📊 ArXiv 研究报告 (2026-04-03)

生成时间: 2026-04-03 09:25:54 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 289 篇
及格论文: 11 篇 (3.8%)

⭐ 及格论文详细分析

1. Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

作者: Zaifu Zhan, Mengyuan Cui, Rui Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00261v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心研究大语言模型（LLMs）在医学问答（AI for Science子领域）中的自我反思（Self-Correction/Self-Reflection）能力，并采用链式思维（Chain of Thought）提示方法进行评估；因此与"Large Language Models"、“Chain of Thought”、“Self-Correction"和"AI for Science"高度相关（10分）。论文涉及推理透明度与正确性的分析，与"System 2 Thinking"和"Explainable AI"有一定关联（5分）；同时探讨模型可靠性，与"Hallucination Mitigation"部分相关（5分）。其他关键词如MoE、SFT、RAG等未在研究中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该研究探索了大语言模型在医学问答中通过自我反思提示进行自我纠正的有效性，发现其并不能一致提升准确性，且效果高度依赖于数据集和模型，表明自我反思更适用于理解模型行为而非直接提高可靠性。

摘要翻译

大语言模型（LLM）在医学问答任务上已展现出强劲性能，而思维链提示通过引导模型进行显式的中间推理进一步提升了结果；与此同时，自我反思（自我校正）提示被广泛认为能通过促使大语言模型批判和修正自身推理来增强模型可靠性，但其在安全关键的医疗场景中的有效性仍不明确。本研究对医学多项选择题问答中的自我反思推理进行了探索性分析：使用GPT-4o和GPT-4o-mini模型，我们比较了标准思维链提示与迭代式自我反思循环的效果，并在三个广泛使用的医学问答基准数据集（MedQA、HeadQA和PubMedQA）上追踪了预测结果在反思步骤中的演变过程。我们分析了自我反思是否会导致错误修正、错误持续或引入新错误。研究结果表明，自我反思提示并不能持续提升准确性，其效果高度依赖于数据集和模型：在MedQA上带来有限增益，但在HeadQA和PubMedQA上收益甚微或产生负面影响，且增加反思步骤数并不能保证性能提升。这些发现揭示了推理透明度与推理正确性之间的差距，表明自我反思推理更适合被视为理解模型行为的分析工具，而非提升医学问答可靠性的独立解决方案。

摘要 (Abstract)

Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.

关键词: Large Language Models, Self-Correction, Self-Reflection, Chain-of-Thought, Medical Question Answering, Model Reliability, GPT-4o, Reasoning Transparency

2. HippoCamp: Benchmarking Contextual Agents on Personal Computers

作者: Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01221v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出HippoCamp基准，专门评估多模态大语言模型（MLLMs）和智能体在个人计算机环境中的文件管理能力，涉及搜索、证据感知和多步推理。核心相关关键词包括：‘Large Language Models’（论文明确评估MLLMs）、‘LLM Agents’（评估智能体方法）、‘Chain of Thought’（评估多步推理能力）。中等相关关键词：‘Retrieval-Augmented Generation’（涉及检索和生成）、‘Context Window Extension’（处理大量个人文件需要长上下文）、‘System 2 Thinking’（涉及深度推理）、‘Tool Use’（智能体可能使用工具进行文件管理）。其他关键词如MoE、量化、对齐等未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了HippoCamp基准，用于评估多模态大语言模型和智能体在个人计算机环境中进行用户画像建模、多模态文件搜索和上下文感知推理的能力，实验发现当前最先进的模型在用户画像准确率上仅达到48.3%，并识别出多模态感知和证据基础是主要瓶颈。

摘要翻译

我们推出HippoCamp——一个旨在评估智能体多模态文件管理能力的新型基准测试平台。与现有聚焦于通用场景下网络交互、工具使用或软件自动化等任务的智能体基准不同，HippoCamp将智能体置于以用户为中心的环境中，要求其建模个体用户画像并在海量个人文件中进行上下文感知推理。本基准基于覆盖多模态的真实世界用户画像实例化设备级文件系统，包含超过2000个真实文件，数据总量达42.4 GB。基于原始文件，我们构建了581组问答对以评估智能体的搜索能力、证据感知能力和多步推理能力。为支持细粒度分析，我们提供了4.61万条密集标注的结构化轨迹用于逐步故障诊断。我们在HippoCamp上评估了多种前沿的多模态大语言模型（MLLMs）与智能体方法。综合实验结果表明存在显著性能差距：即使在用户画像建模任务中，最先进的商业模型准确率也仅达48.3%，尤其在密集个人文件系统内的长程检索与跨模态推理方面表现欠佳。进一步的逐步故障诊断表明，多模态感知与证据锚定是当前的主要瓶颈。最终，HippoCamp揭示了现有智能体在真实用户中心环境中的关键局限，并为开发下一代个人人工智能助手奠定了坚实基础。

摘要 (Abstract)

We present HippoCamp, a new benchmark designed to evaluate agents’ capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents’ capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

关键词: multimodal large language models, agent benchmarking, personal file management, context-aware reasoning, user profiling, multi-step reasoning, retrieval, failure diagnosis

3. Execution-Verified Reinforcement Learning for Optimization Modeling

作者: Runda Guan, Xiangqing Shen, Jiajun Zhang, Yifan Zhang, Jian Cheng, Rui Xia 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00442v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出EVOM框架，使用LLMs（核心）将自然语言问题自动转化为优化建模代码，并通过执行验证的强化学习（GRPO和DAPO）进行训练，这直接涉及"Post-training/SFT”（与监督微调对比）、“RLHF/RLAIF/DPO”（使用GRPO和DAPO）、“LLM Agents”（自动化建模流程）和"Tool Use"（调用求解器API）。其他关键词如MoE、Scaling Laws、PEFT、RAG等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为EVOM的执行验证强化学习框架，用于自动化优化建模，通过将数学规划求解器作为验证器，生成并执行代码以获取奖励，从而在无需过程级监督的情况下，实现了与监督微调相当或更好的性能，并支持零样本求解器迁移和低成本适应。

摘要翻译

利用大语言模型实现优化建模自动化是通向可扩展决策智能的一条前景广阔的路径，但现有方法要么依赖于基于闭源大语言模型构建的高推理延迟智能体流程，要么采用成本高昂的过程监督对较小模型进行微调，而这类方法常会过度适配单一求解器接口。受可验证奖励强化学习的启发，我们提出了执行验证优化建模（Execution-Verified Optimization Modeling, EVOM），这是一个执行验证学习框架，将数学规划求解器视为一个确定性的交互式验证器。给定自然语言描述的问题和目标求解器，EVOM生成针对特定求解器的代码，在沙箱环境中执行该代码，并将执行结果转化为标量奖励，通过GRPO和DAPO方法在“生成-执行-反馈-更新”的闭环流程中进行优化。这种仅基于结果的范式消除了对过程级监督的需求，并通过切换验证环境而非重建针对特定求解器的数据集，实现了跨求解器的泛化能力。在NL4OPT、MAMO、IndustryOR和OptiBench数据集上，针对Gurobi、OR-Tools和COPT等求解器的实验表明，EVOM达到或超越了过程监督的监督微调方法，支持零样本求解器迁移，并能通过在目标求解器后端下持续训练，实现高效低成本的求解器适配。

摘要 (Abstract)

Automating optimization modeling with LLMs is a promising path toward scalable decision intelligence, but existing approaches either rely on agentic pipelines built on closed-source LLMs with high inference latency, or fine-tune smaller LLMs using costly process supervision that often overfits to a single solver API. Inspired by reinforcement learning with verifiable rewards, we propose Execution-Verified Optimization Modeling (EVOM), an execution-verified learning framework that treats a mathematical programming solver as a deterministic, interactive verifier. Given a natural-language problem and a target solver, EVOM generates solver-specific code, executes it in a sandboxed harness, and converts execution outcomes into scalar rewards, optimized with GRPO and DAPO in a closed-loop generate-execute-feedback-update process. This outcome-only formulation removes the need for process-level supervision, and enables cross-solver generalization by switching the verification environment rather than reconstructing solver-specific datasets. Experiments on NL4OPT, MAMO, IndustryOR, and OptiBench across Gurobi, OR-Tools, and COPT show that EVOM matches or outperforms process-supervised SFT, supports zero-shot solver transfer, and achieves effective low-cost solver adaptation by continuing training under the target solver backend.

关键词: Large Language Models, Reinforcement Learning, Optimization Modeling, Execution-Verified Learning, Solver Adaptation, GRPO, DAPO, Automated Code Generation

4. Dual Optimal: Make Your LLM Peer-like with Dignity

作者: Xiangqi Wang, Yue Huang, Haomin Zhuang, Kehan Guo, Xiangliang Zhang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00979v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究对齐语言模型的行为问题（逃避式仆人现象）并提出Dignified Peer框架，直接涉及LLM对齐、DPO算法和LLM智能体构建，因此"Large Language Models"、“Instruction Tuning/Alignment”、“RLHF/DPO"和"LLM Agents"均获得10分（核心内容）。“Hallucination Mitigation"获得5分，因为论文关注可信度/真实性（trustworthiness）与幻觉缓解有一定关联。其他关键词在论文中未明确涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对当前对齐语言模型存在的'逃避式仆人'问题（即谄媚用户错误信念并用免责声明推卸责任），提出了Dignified Peer框架，通过引入PersonaKnob数据集和容忍约束拉格朗日DPO算法，成功构建了一个兼具尊严和同伴感的LLM智能体。

摘要翻译

当前的对齐语言模型表现出一种我们称之为“回避型仆人”的双重失效模式：它们会谄媚地认可用户有缺陷的信念，同时用模板化的免责声明来推卸责任。我们提出了“尊严对等体”框架，该框架通过反谄媚性和可信赖性来对抗顺从性，并通过同理心和创造力来缓解回避性。实现这一智能体需要克服数据监督、目标坍缩和评估偏见方面的重大挑战。我们通过引入PersonaKnob数据集来解决这些问题，该数据集具有多个人格偏好的组合偏序结构。这些数据与一种宽容的约束拉格朗日DPO算法结合使用，该算法动态平衡所有人格维度以防止行为坍缩。此外，我们采用了一种经心理测量学校准的项目反应理论评估方案，以将潜在模型的人格能力与评估者偏见等混杂因素区分开来。大量的实证研究表明，我们的方法成功构建了一个兼具尊严与对等性的LLM智能体。

摘要 (Abstract)

Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.

关键词: LLM alignment, Dignified Peer, anti-sycophancy, trustworthiness, DPO, LLM agent, persona preference, behavioral collapse

5. MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

作者: Junxian Wu, Chenghan Fu, Zhanheng Nie, Daoze Zhang, Bowen Wan, Wanxian Guan, Chuan Yu, Jian Xu, Bo Zheng 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00513v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文MOON3.0专注于多模态大语言模型（MLLMs）在电子商务产品理解中的应用，核心是利用MLLMs的推理能力显式建模细粒度产品属性。因此，与"Large Language Models"高度相关（10分），因为MLLMs是LLMs的扩展；与"Supervised Fine-tuning"高度相关（10分），因为论文明确提到SFT的局限性并提出了改进方法；与"Chain of Thought"和"System 2 Thinking"高度相关（10分），因为论文强调推理能力（reasoning-aware）和多步推理策略；与"Context Window Extension"有一定关联（5分），因为论文提到长上下文推理的挑战。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等未在摘要中提及或与论文主题无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在电子商务产品理解中难以捕获细粒度属性的问题，提出了MOON3.0模型，通过多模态融合、联合对比与强化学习框架以及细粒度残差增强模块，实现了在多个下游任务上的先进零样本性能。

摘要翻译

随着电子商务的快速发展，探索通用表征而非任务特定表征已引起越来越多的关注。尽管近期多模态大语言模型（MLLMs）在商品理解方面推动了显著进展，但它们通常被用作特征提取器，将商品信息隐式编码为全局嵌入，从而限制了其捕捉细粒度属性的能力。因此，我们认为利用MLLMs的推理能力来显式建模细粒度商品属性具有重要潜力。然而，由于几个关键挑战，实现这一目标仍非易事：（i）长上下文推理容易稀释模型对原始输入中显著信息的注意力；（ii）监督微调（SFT）主要鼓励僵化模仿，限制了有效推理策略的探索；（iii）细粒度细节在前向传播过程中逐渐衰减。为解决这些问题，我们提出了MOON3.0，这是首个基于MLLM的推理感知商品表征学习模型。我们的方法（1）采用多头模态融合模块自适应整合原始信号；（2）结合联合对比与强化学习框架，自主探索更有效的推理策略；（3）引入细粒度残差增强模块，在整个网络中逐步保留局部细节。此外，我们发布了一个大规模多模态电子商务基准MBE3.0。实验表明，我们的模型在自建基准和公共数据集的各种下游任务中均展现出领先的零样本性能。

摘要 (Abstract)

With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model’s attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.

关键词: Multimodal Large Language Models, Reasoning-aware, Product Understanding, Supervised Fine-tuning, E-commerce, Zero-shot Performance, Fine-grained Attributes, Representation Learning

6. True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness

作者: Graziano Blasilli, Marco Angelini 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01181v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（LLMs）在识别误导性可视化方面的能力，因此与"Large Language Models"高度相关（10分）。论文评估了包括小型模型（如Nemotron-Nano-V2-VL）在内的多种模型，因此与"Small Language Models"有一定关联（5分）。研究涉及模型对误导性内容的识别和解释，这需要推理能力，因此与"Chain of Thought"和"System 2 Thinking"有一定关联（各5分）。论文关注误导性可视化（即"lies”）的识别，这与事实性和幻觉缓解相关，因此与"Hallucination Mitigation"高度相关（8分）。研究还比较了模型与人类专家的行为，以提供见解，这涉及可解释性，因此与"Mechanistic Interpretability"高度相关（8分）。其他关键词（如MoE、Scaling Laws、Pre-training、RLHF等）在论文中未涉及或仅为背景提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究评估了多模态大语言模型在识别和解释COVID-19相关推文中误导性可视化方面的能力，通过比较16个先进模型与人类专家的行为，发现模型在识别误导性内容时与人类判断有相似之处但也有差异。

摘要翻译

本研究探讨了多模态大语言模型（LLMs）识别与解读误导性可视化图表的能力，以及其对这些观察结果、背后成因及潜在意图的认知。我们运用可视化修辞学概念及一套新构建的作者意图分类法作为解释框架。通过提出三个研究问题，我们利用包含2,336条与COVID-19相关推文的数据集进行了实验分析，其中半数推文含有误导性可视化内容，并辅以来自IEEE VIS社区专注于展示欺骗性与误导性可视化图表的活动VisLies中提取的真实案例，这些案例涵盖了感知、认知和概念层面的错误。为确保全面覆盖当前LLMs的发展现状，我们评估了16个前沿模型，其中15个为开放权重模型，涵盖了广泛的模型规模、架构家族及推理能力。所选模型包括小型模型：Nemotron-Nano-VL2（120亿参数）、Mistral-Small-3.2（240亿）、DeepSeek-VL2（270亿）、Gemma3（270亿）和GTA1（320亿）；中型模型：Qianfan-VL（700亿）、Molmo（720亿）、GLM-4.5V（1080亿）、LLaVA-NeXT（1100亿）和Pixtral-Large（1240亿）；大型模型：Qwen3-VL（2350亿）、InternVL3.5（2410亿）、Step3（3210亿）、Llama-4-Maverick（4000亿）和Kimi-K2.5（10000亿）。此外，我们还采用了前沿的专有模型OpenAI GPT-5.4。为建立人类对这些任务的认知基准，我们同时开展了可视化专家用户研究，以评估人类如何理解相同误导性可视化图表中的修辞技巧及作者意图。这使得模型与专家行为之间的比较成为可能，揭示了二者的相似性与差异性，从而深入洞察大语言模型在哪些方面与人类判断一致，在哪些方面存在分歧。

摘要 (Abstract)

This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.

关键词: Large Language Models, misleading visualizations, multimodal LLMs, visualization rhetoric, authorial intents, COVID-19 tweets, model evaluation, human-expert comparison

7. Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

作者: Kazuki Yano, Jun Suzuki, Shinji Watanabe 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00489v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究将预训练文本大语言模型（LLMs）通过持续预训练适应为语音语言模型（Speech LMs），并提出了Multimodal Depth Upscaling方法。因此，与"Large Language Models"和"Pre-training/Continual Pre-training"高度相关（10分）。论文明确使用了SmolLM2模型（属于Small Language Models范畴）进行实验，故"Small Language Models"得10分。论文将提出的方法与LoRA（一种PEFT技术）进行比较，并涉及参数高效微调，故"PEFT/LoRA"得10分。其他关键词如MoE、Scaling Laws、SFT、Alignment、RAG等均未在摘要中提及或与核心内容无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过提出的Multimodal Depth Upscaling方法，将预训练的文本大语言模型持续预训练适应为语音语言模型，以在保持文本能力的同时实现高效的语音识别性能。

摘要翻译

通过在海量语音数据上进行持续预训练，将预训练的文本大语言模型（LLMs）适配为语音语言模型（Speech LMs）具有广阔前景，但这一过程常导致模型原有的文本能力下降。我们提出多模态深度扩展方法，该方法是对持续大语言模型预训练中一种新兴策略的扩展：在冻结的文本大语言模型中插入新的Transformer层，并仅使用语音数据训练这些新增层。基于SmolLM2-360M和SmolLM2-1.7B模型，在4.8万小时英语自动语音识别（ASR）数据上的实验表明，深度扩展方法能达到与全参数微调相当的ASR性能，同时其导致的文本能力退化远低于全参数微调和低秩自适应（LoRA）。我们进一步证明，将专为语音识别设计的E-Branchformer架构作为插入层，在较大模型上实现的ASR性能可匹配甚至超越全参数微调，同时将文本能力退化降低75%以上，且可训练参数减少了60%。

摘要 (Abstract)

Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75% with 60% fewer trainable parameters.

关键词: Large Language Models, Speech Language Models, Continual Pre-training, Multimodal Depth Upscaling, Parameter-efficient Fine-tuning, Automatic Speech Recognition, SmolLM2, E-Branchformer

8. IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

作者: Dong-Jae Lee, Sunghyun Baek, Junmo Kim 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00757v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	5.0/10	5.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种针对大型视觉语言模型（LVLMs）的无训练令牌剪枝框架IWP，核心是减少视觉令牌数量以降低计算成本。与"Large Language Models"高度相关（10分），因为LVLMs是大模型的一种。与"Quantization"和"Speculative Decoding"有一定关联（8分），因为令牌剪枝是一种模型压缩和推理加速技术。与"KV Cache Compression"和"Mechanistic Interpretability"有弱关联（5分），因为涉及注意力机制分析和效率优化。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型中视觉令牌过多导致计算成本高的问题，提出了一种基于注意力对偶形式视角的无训练令牌剪枝框架IWP，通过量化令牌的信息量和重复性来优化剪枝，在实验中实现了性能与效率的更好权衡。

摘要翻译

大型视觉语言模型在图像与视频理解任务中展现出卓越性能，但其计算成本随视觉标记数量的增加而快速上升。现有的标记剪枝方法通过经验性策略缓解这一问题，却忽视了注意力机制的内在原理。本文基于注意力的对偶形式视角，提出一种无需训练的新型标记剪枝框架。我们将注意力重新表述为一个隐式线性层，其权重矩阵由多个秩为1的外积求和构成，每个外积由单个标记的键值对生成。因此，标记剪枝问题转化为选择这些秩1更新的最优子集，以最佳逼近原始对偶权重矩阵。将此视角扩展至大型视觉语言模型中的标准softmax注意力机制，我们推导出一种新的度量指标，可同时量化标记的信息强度与信息冗余度。为基于该度量高效选择子集，我们提出了渐进式分块最大边际相关性算法。大量实验表明，本方法在性能与效率之间实现了更优的平衡，同时为现有剪枝方法提供了新的理论视角。

摘要 (Abstract)

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token’s key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token’s information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

关键词: Large Vision Language Models, Token Pruning, Implicit Weight Pruning, Attention Mechanism, Computational Efficiency, Training-free Framework, Progressive Chunked Maximal Marginal Relevance, Dual Form Perspective

9. Cost-Penalized Fitness in FMA-Orchestrated Mixture of Experts: Experimental Evidence for Molecular M

作者: Martin Jaraiz 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00812v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究MoE系统在动态数据分布下的专家管理策略，提出’分子记忆’概念，属于大模型技术原理的创新。高度相关关键词：‘Mixture of Experts’（核心内容，15分），‘Large Language Models’（论文明确提及LLM开发，10分），‘Domain Adaptation’（研究领域适应问题，10分）。其他关键词如SLMs、Scaling Laws、SFT等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在动态数据分布下，MoE系统如何通过成本惩罚适应度指标和线性宽限期管理专家池，实验证明该方法能实现'分子记忆'效应，使系统在返回先前学习领域时恢复速度快9-11倍，且无需专家替换。

摘要翻译

我们展示了七组受控运行的实验结果，这些实验基于nanoFMT——一种采用动态混合专家（Mixture-of-Experts, MoE）管理的自由市场算法（Free-Market Algorithm, FMA）协调的Transformer模型。实验旨在回答高级大语言模型（LLM）开发中的一个核心问题：当MoE系统在数据分布动态变化下满负荷运行时，应如何管理其专家池？我们证明，采用成本惩罚的适应度指标，并结合为新专家设置的线性宽限期，能够构建一个通过多样化而非替换来积累领域专业知识的系统。核心成果体现在一项往返式领域切换实验中：当系统重返先前学习过的领域时，其恢复速度提升了9至11倍，且无需新增或替换任何专家。这种“分子记忆”效应——即休眠专家在其对应领域重现时得以保留并重新激活——在当前MoE管理方法中尚无先例。初步成本分析估算，在中等规模场景下，一个OpenAI级别的服务提供商采用此方法每年可节省约3910万美元成本，并减少27.1吉瓦时的能源消耗。

摘要 (Abstract)

We present experimental results from seven controlled runs of nanoFMT, a Free-Market Algorithm (FMA) orchestrated transformer with dynamic Mixture-of-Experts (MoE) management. The experiments address a fundamental question for advanced LLM development: how should an MoE system manage its expert pool when operating at full capacity under changing data distributions? We demonstrate that cost-penalized fitness metrics, combined with a linear grace period for newborn experts, produce a system that accumulates domain expertise through diversification rather than replacement. The central result is a round-trip domain shift experiment showing 9-11x faster recovery when returning to a previously learned domain, with zero expert births or replacements required. This “molecular memory” effect – where dormant experts survive and reactivate when their domain returns – has no analogue in current MoE management approaches. A preliminary cost analysis estimates annual savings of $39.1M and 27.1 GWh energy reduction for an OpenAI-scale provider under a moderate scenario.

关键词: Mixture of Experts, MoE, domain adaptation, cost-penalized fitness, molecular memory, Free-Market Algorithm, transformer, expert management

10. EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

作者: Alibek T. Kaliyev, Artem Maryanskyy 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00392v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents生成工具库的质量评估，与"LLM Agents"和"Tool Use"高度相关（10分），属于大模型应用研究，因此"Large Language Models"也得10分。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM agents在运行时生成工具库（如Python函数、API客户端）的质量评估问题，提出了EvolveTool-Bench基准，发现仅基于任务完成度的评估会掩盖高达18%的软件质量风险，强调需要将工具库作为一级软件制品进行评估。

摘要翻译

现代大语言模型（LLM）智能体在运行时越来越多地自主创建工具——从Python函数到API客户端——然而现有的基准测试几乎完全通过下游任务完成度来评估它们。这好比仅凭代码能否运行来评判软件工程师，而忽略了冗余性、回归问题和安全性。我们推出了EvolveTool-Bench，这是一个针对软件工程工作流中LLM生成工具库的诊断性基准测试。在三个需要实际工具执行的领域（专有数据格式、API编排和数值计算）中，我们定义了库级别的软件质量指标——复用性、冗余度、组合成功率、回归稳定性和安全性——同时引入衡量每个工具正确性、鲁棒性、通用性和代码质量的工具质量评分（Tool Quality Score）。在首次对代码级与策略级工具演化方法（ARISE vs. EvoSkill vs. 单次生成基线，涵盖99项任务和两种模型）的直接比较中，我们发现任务完成率相近的系统（63-68%）在工具库健康度上存在高达18%的差异，这揭示了仅关注任务的评估所无法察觉的软件质量风险。我们的研究结果强调：对LLM生成工具的评估与治理，需要将不断演化的工具库视为首要的软件制品，而非黑箱。

摘要 (Abstract)

Modern LLM agents increasingly create their own tools at runtime – from Python functions to API clients – yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We introduce EvolveTool-Bench, a diagnostic benchmark for LLM-generated tool libraries in software engineering workflows. Across three domains requiring actual tool execution (proprietary data formats, API orchestration, and numerical computation), we define library-level software quality metrics – reuse, redundancy, composition success, regression stability, and safety – alongside a per-tool Tool Quality Score measuring correctness, robustness, generality, and code quality. In the first head-to-head comparison of code-level and strategy-level tool evolution (ARISE vs. EvoSkill vs. one-shot baselines, 99 tasks, two models), we show that systems with similar task completion (63-68%) differ by up to 18% in library health, revealing software quality risks invisible to task-only evaluation. Our results highlight that evaluation and governance of LLM-generated tools require treating the evolving tool library as a first-class software artifact, not a black box.

关键词: LLM agents, tool generation, software quality, benchmark evaluation, tool libraries, EvolveTool-Bench, software artifacts, tool evolution

11. Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

作者: Chris Ge, Daria Kryvosheieva, Daniel Fried, Uzay Girit, Kaivalya Hariharan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00594v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于LLM-based coding agents的性能预测框架，核心涉及LLM agents和tool use，因此这两个关键词分别获得10分和8分。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法、推理技术、压缩方法、幻觉缓解、科学AI应用等均未在摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM-based coding agents在工具和环境交互中的性能评估难题，提出了一个结合项目反应理论和任务特征提取的框架，能够准确预测任务级性能并分解agent能力为LLM和scaffold组件。

摘要翻译

随着基于大语言模型的编程研究焦点从静态单步代码生成转向与工具及环境的多步智能体交互，理解何种任务将挑战智能体及其原因变得日益困难。当前实践加剧了这一挑战：智能体性能通常通过基准测试的总体通过率来衡量，但单一数值指标掩盖了基准测试内部任务的多样性。我们提出了一个针对智能体编程范式、用于预测个体任务成功或失败的框架。该方法通过从任务中提取丰富特征（包括问题描述、代码库上下文、解决方案和测试用例）来增强项目反应理论，并引入了一种将智能体能力分解为大语言模型能力与框架能力组件的新颖方法。这种参数化使我们能够整合异构排行榜的评估数据，并准确预测未见基准测试的任务级性能，以及未见的大语言模型-框架组合性能。我们的方法对基准设计者具有实际应用价值，他们无需运行计算成本高昂的智能体评估，即可更好地校准新任务的难度。

摘要 (Abstract)

As the focus in LLM-based coding shifts from static single-step code generation to multi-step agentic interaction with tools and environments, understanding which tasks will challenge agents and why becomes increasingly difficult. This is compounded by current practice: agent performance is typically measured by aggregate pass rates on benchmarks, but single-number metrics obscure the diversity of tasks within a benchmark. We present a framework for predicting success or failure on individual tasks tailored to the agentic coding regime. Our approach augments Item Response Theory (IRT) with rich features extracted from tasks, including issue statements, repository contexts, solutions, and test cases, and introduces a novel decomposition of agent ability into LLM and scaffold ability components. This parameterization enables us to aggregate evaluation data across heterogeneous leaderboards and accurately predict task-level performance for unseen benchmarks, as well as unseen LLM-scaffold combinations. Our methods have practical utility for benchmark designers, who can better calibrate the difficulty of their new tasks without running computationally expensive agent evaluations.

关键词: LLM-based coding, agentic interaction, task-level performance prediction, Item Response Theory, agent ability decomposition, benchmark evaluation, tool use, multi-step coding

📋 所有论文列表

1. ✅ Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

作者: Zaifu Zhan, Mengyuan Cui, Rui Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00261v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心研究大语言模型（LLMs）在医学问答（AI for Science子领域）中的自我反思（Self-Correction/Self-Reflection）能力，并采用链式思维（Chain of Thought）提示方法进行评估；因此与"Large Language Models”、“Chain of Thought”、“Self-Correction"和"AI for Science"高度相关（10分）。论文涉及推理透明度与正确性的分析，与"System 2 Thinking"和"Explainable AI"有一定关联（5分）；同时探讨模型可靠性，与"Hallucination Mitigation"部分相关（5分）。其他关键词如MoE、SFT、RAG等未在研究中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该研究探索了大语言模型在医学问答中通过自我反思提示进行自我纠正的有效性，发现其并不能一致提升准确性，且效果高度依赖于数据集和模型，表明自我反思更适用于理解模型行为而非直接提高可靠性。

摘要翻译

大语言模型（LLM）在医学问答任务上已展现出强劲性能，而思维链提示通过引导模型进行显式的中间推理进一步提升了结果；与此同时，自我反思（自我校正）提示被广泛认为能通过促使大语言模型批判和修正自身推理来增强模型可靠性，但其在安全关键的医疗场景中的有效性仍不明确。本研究对医学多项选择题问答中的自我反思推理进行了探索性分析：使用GPT-4o和GPT-4o-mini模型，我们比较了标准思维链提示与迭代式自我反思循环的效果，并在三个广泛使用的医学问答基准数据集（MedQA、HeadQA和PubMedQA）上追踪了预测结果在反思步骤中的演变过程。我们分析了自我反思是否会导致错误修正、错误持续或引入新错误。研究结果表明，自我反思提示并不能持续提升准确性，其效果高度依赖于数据集和模型：在MedQA上带来有限增益，但在HeadQA和PubMedQA上收益甚微或产生负面影响，且增加反思步骤数并不能保证性能提升。这些发现揭示了推理透明度与推理正确性之间的差距，表明自我反思推理更适合被视为理解模型行为的分析工具，而非提升医学问答可靠性的独立解决方案。

摘要 (Abstract)

Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.

关键词: Large Language Models, Self-Correction, Self-Reflection, Chain-of-Thought, Medical Question Answering, Model Reliability, GPT-4o, Reasoning Transparency

2. ✅ HippoCamp: Benchmarking Contextual Agents on Personal Computers

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了HippoCamp基准，用于评估多模态大语言模型和智能体在个人计算机环境中进行用户画像建模、多模态文件搜索和上下文感知推理的能力，实验发现当前最先进的模型在用户画像准确率上仅达到48.3%，并识别出多模态感知和证据基础是主要瓶颈。

摘要翻译

我们推出HippoCamp——一个旨在评估智能体多模态文件管理能力的新型基准测试平台。与现有聚焦于通用场景下网络交互、工具使用或软件自动化等任务的智能体基准不同，HippoCamp将智能体置于以用户为中心的环境中，要求其建模个体用户画像并在海量个人文件中进行上下文感知推理。本基准基于覆盖多模态的真实世界用户画像实例化设备级文件系统，包含超过2000个真实文件，数据总量达42.4 GB。基于原始文件，我们构建了581组问答对以评估智能体的搜索能力、证据感知能力和多步推理能力。为支持细粒度分析，我们提供了4.61万条密集标注的结构化轨迹用于逐步故障诊断。我们在HippoCamp上评估了多种前沿的多模态大语言模型（MLLMs）与智能体方法。综合实验结果表明存在显著性能差距：即使在用户画像建模任务中，最先进的商业模型准确率也仅达48.3%，尤其在密集个人文件系统内的长程检索与跨模态推理方面表现欠佳。进一步的逐步故障诊断表明，多模态感知与证据锚定是当前的主要瓶颈。最终，HippoCamp揭示了现有智能体在真实用户中心环境中的关键局限，并为开发下一代个人人工智能助手奠定了坚实基础。

摘要 (Abstract)

We present HippoCamp, a new benchmark designed to evaluate agents’ capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents’ capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

关键词: multimodal large language models, agent benchmarking, personal file management, context-aware reasoning, user profiling, multi-step reasoning, retrieval, failure diagnosis

3. ✅ Execution-Verified Reinforcement Learning for Optimization Modeling

作者: Runda Guan, Xiangqing Shen, Jiajun Zhang, Yifan Zhang, Jian Cheng, Rui Xia 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00442v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一个名为EVOM的执行验证强化学习框架，用于自动化优化建模，通过将数学规划求解器作为验证器，生成并执行代码以获取奖励，从而在无需过程级监督的情况下，实现了与监督微调相当或更好的性能，并支持零样本求解器迁移和低成本适应。

摘要翻译

利用大语言模型实现优化建模自动化是通向可扩展决策智能的一条前景广阔的路径，但现有方法要么依赖于基于闭源大语言模型构建的高推理延迟智能体流程，要么采用成本高昂的过程监督对较小模型进行微调，而这类方法常会过度适配单一求解器接口。受可验证奖励强化学习的启发，我们提出了执行验证优化建模（Execution-Verified Optimization Modeling, EVOM），这是一个执行验证学习框架，将数学规划求解器视为一个确定性的交互式验证器。给定自然语言描述的问题和目标求解器，EVOM生成针对特定求解器的代码，在沙箱环境中执行该代码，并将执行结果转化为标量奖励，通过GRPO和DAPO方法在“生成-执行-反馈-更新”的闭环流程中进行优化。这种仅基于结果的范式消除了对过程级监督的需求，并通过切换验证环境而非重建针对特定求解器的数据集，实现了跨求解器的泛化能力。在NL4OPT、MAMO、IndustryOR和OptiBench数据集上，针对Gurobi、OR-Tools和COPT等求解器的实验表明，EVOM达到或超越了过程监督的监督微调方法，支持零样本求解器迁移，并能通过在目标求解器后端下持续训练，实现高效低成本的求解器适配。

摘要 (Abstract)

Automating optimization modeling with LLMs is a promising path toward scalable decision intelligence, but existing approaches either rely on agentic pipelines built on closed-source LLMs with high inference latency, or fine-tune smaller LLMs using costly process supervision that often overfits to a single solver API. Inspired by reinforcement learning with verifiable rewards, we propose Execution-Verified Optimization Modeling (EVOM), an execution-verified learning framework that treats a mathematical programming solver as a deterministic, interactive verifier. Given a natural-language problem and a target solver, EVOM generates solver-specific code, executes it in a sandboxed harness, and converts execution outcomes into scalar rewards, optimized with GRPO and DAPO in a closed-loop generate-execute-feedback-update process. This outcome-only formulation removes the need for process-level supervision, and enables cross-solver generalization by switching the verification environment rather than reconstructing solver-specific datasets. Experiments on NL4OPT, MAMO, IndustryOR, and OptiBench across Gurobi, OR-Tools, and COPT show that EVOM matches or outperforms process-supervised SFT, supports zero-shot solver transfer, and achieves effective low-cost solver adaptation by continuing training under the target solver backend.

关键词: Large Language Models, Reinforcement Learning, Optimization Modeling, Execution-Verified Learning, Solver Adaptation, GRPO, DAPO, Automated Code Generation

4. ✅ Dual Optimal: Make Your LLM Peer-like with Dignity

作者: Xiangqi Wang, Yue Huang, Haomin Zhuang, Kehan Guo, Xiangliang Zhang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00979v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对当前对齐语言模型存在的'逃避式仆人'问题（即谄媚用户错误信念并用免责声明推卸责任），提出了Dignified Peer框架，通过引入PersonaKnob数据集和容忍约束拉格朗日DPO算法，成功构建了一个兼具尊严和同伴感的LLM智能体。

摘要翻译

当前的对齐语言模型表现出一种我们称之为“回避型仆人”的双重失效模式：它们会谄媚地认可用户有缺陷的信念，同时用模板化的免责声明来推卸责任。我们提出了“尊严对等体”框架，该框架通过反谄媚性和可信赖性来对抗顺从性，并通过同理心和创造力来缓解回避性。实现这一智能体需要克服数据监督、目标坍缩和评估偏见方面的重大挑战。我们通过引入PersonaKnob数据集来解决这些问题，该数据集具有多个人格偏好的组合偏序结构。这些数据与一种宽容的约束拉格朗日DPO算法结合使用，该算法动态平衡所有人格维度以防止行为坍缩。此外，我们采用了一种经心理测量学校准的项目反应理论评估方案，以将潜在模型的人格能力与评估者偏见等混杂因素区分开来。大量的实证研究表明，我们的方法成功构建了一个兼具尊严与对等性的LLM智能体。

摘要 (Abstract)

Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.

关键词: LLM alignment, Dignified Peer, anti-sycophancy, trustworthiness, DPO, LLM agent, persona preference, behavioral collapse

5. ✅ MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在电子商务产品理解中难以捕获细粒度属性的问题，提出了MOON3.0模型，通过多模态融合、联合对比与强化学习框架以及细粒度残差增强模块，实现了在多个下游任务上的先进零样本性能。

摘要翻译

随着电子商务的快速发展，探索通用表征而非任务特定表征已引起越来越多的关注。尽管近期多模态大语言模型（MLLMs）在商品理解方面推动了显著进展，但它们通常被用作特征提取器，将商品信息隐式编码为全局嵌入，从而限制了其捕捉细粒度属性的能力。因此，我们认为利用MLLMs的推理能力来显式建模细粒度商品属性具有重要潜力。然而，由于几个关键挑战，实现这一目标仍非易事：（i）长上下文推理容易稀释模型对原始输入中显著信息的注意力；（ii）监督微调（SFT）主要鼓励僵化模仿，限制了有效推理策略的探索；（iii）细粒度细节在前向传播过程中逐渐衰减。为解决这些问题，我们提出了MOON3.0，这是首个基于MLLM的推理感知商品表征学习模型。我们的方法（1）采用多头模态融合模块自适应整合原始信号；（2）结合联合对比与强化学习框架，自主探索更有效的推理策略；（3）引入细粒度残差增强模块，在整个网络中逐步保留局部细节。此外，我们发布了一个大规模多模态电子商务基准MBE3.0。实验表明，我们的模型在自建基准和公共数据集的各种下游任务中均展现出领先的零样本性能。

摘要 (Abstract)

With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model’s attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.

关键词: Multimodal Large Language Models, Reasoning-aware, Product Understanding, Supervised Fine-tuning, E-commerce, Zero-shot Performance, Fine-grained Attributes, Representation Learning

6. ✅ True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

作者: Graziano Blasilli, Marco Angelini 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01181v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究评估了多模态大语言模型在识别和解释COVID-19相关推文中误导性可视化方面的能力，通过比较16个先进模型与人类专家的行为，发现模型在识别误导性内容时与人类判断有相似之处但也有差异。

摘要翻译

本研究探讨了多模态大语言模型（LLMs）识别与解读误导性可视化图表的能力，以及其对这些观察结果、背后成因及潜在意图的认知。我们运用可视化修辞学概念及一套新构建的作者意图分类法作为解释框架。通过提出三个研究问题，我们利用包含2,336条与COVID-19相关推文的数据集进行了实验分析，其中半数推文含有误导性可视化内容，并辅以来自IEEE VIS社区专注于展示欺骗性与误导性可视化图表的活动VisLies中提取的真实案例，这些案例涵盖了感知、认知和概念层面的错误。为确保全面覆盖当前LLMs的发展现状，我们评估了16个前沿模型，其中15个为开放权重模型，涵盖了广泛的模型规模、架构家族及推理能力。所选模型包括小型模型：Nemotron-Nano-VL2（120亿参数）、Mistral-Small-3.2（240亿）、DeepSeek-VL2（270亿）、Gemma3（270亿）和GTA1（320亿）；中型模型：Qianfan-VL（700亿）、Molmo（720亿）、GLM-4.5V（1080亿）、LLaVA-NeXT（1100亿）和Pixtral-Large（1240亿）；大型模型：Qwen3-VL（2350亿）、InternVL3.5（2410亿）、Step3（3210亿）、Llama-4-Maverick（4000亿）和Kimi-K2.5（10000亿）。此外，我们还采用了前沿的专有模型OpenAI GPT-5.4。为建立人类对这些任务的认知基准，我们同时开展了可视化专家用户研究，以评估人类如何理解相同误导性可视化图表中的修辞技巧及作者意图。这使得模型与专家行为之间的比较成为可能，揭示了二者的相似性与差异性，从而深入洞察大语言模型在哪些方面与人类判断一致，在哪些方面存在分歧。

摘要 (Abstract)

This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.

关键词: Large Language Models, misleading visualizations, multimodal LLMs, visualization rhetoric, authorial intents, COVID-19 tweets, model evaluation, human-expert comparison

7. ✅ Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

作者: Kazuki Yano, Jun Suzuki, Shinji Watanabe 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00489v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了如何通过提出的Multimodal Depth Upscaling方法，将预训练的文本大语言模型持续预训练适应为语音语言模型，以在保持文本能力的同时实现高效的语音识别性能。

摘要翻译

通过在海量语音数据上进行持续预训练，将预训练的文本大语言模型（LLMs）适配为语音语言模型（Speech LMs）具有广阔前景，但这一过程常导致模型原有的文本能力下降。我们提出多模态深度扩展方法，该方法是对持续大语言模型预训练中一种新兴策略的扩展：在冻结的文本大语言模型中插入新的Transformer层，并仅使用语音数据训练这些新增层。基于SmolLM2-360M和SmolLM2-1.7B模型，在4.8万小时英语自动语音识别（ASR）数据上的实验表明，深度扩展方法能达到与全参数微调相当的ASR性能，同时其导致的文本能力退化远低于全参数微调和低秩自适应（LoRA）。我们进一步证明，将专为语音识别设计的E-Branchformer架构作为插入层，在较大模型上实现的ASR性能可匹配甚至超越全参数微调，同时将文本能力退化降低75%以上，且可训练参数减少了60%。

摘要 (Abstract)

Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75% with 60% fewer trainable parameters.

8. ✅ IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

作者: Dong-Jae Lee, Sunghyun Baek, Junmo Kim 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00757v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	5.0/10	5.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型中视觉令牌过多导致计算成本高的问题，提出了一种基于注意力对偶形式视角的无训练令牌剪枝框架IWP，通过量化令牌的信息量和重复性来优化剪枝，在实验中实现了性能与效率的更好权衡。

摘要翻译

大型视觉语言模型在图像与视频理解任务中展现出卓越性能，但其计算成本随视觉标记数量的增加而快速上升。现有的标记剪枝方法通过经验性策略缓解这一问题，却忽视了注意力机制的内在原理。本文基于注意力的对偶形式视角，提出一种无需训练的新型标记剪枝框架。我们将注意力重新表述为一个隐式线性层，其权重矩阵由多个秩为1的外积求和构成，每个外积由单个标记的键值对生成。因此，标记剪枝问题转化为选择这些秩1更新的最优子集，以最佳逼近原始对偶权重矩阵。将此视角扩展至大型视觉语言模型中的标准softmax注意力机制，我们推导出一种新的度量指标，可同时量化标记的信息强度与信息冗余度。为基于该度量高效选择子集，我们提出了渐进式分块最大边际相关性算法。大量实验表明，本方法在性能与效率之间实现了更优的平衡，同时为现有剪枝方法提供了新的理论视角。

摘要 (Abstract)

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token’s key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token’s information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

9. ✅ Cost-Penalized Fitness in FMA-Orchestrated Mixture of Experts: Experimental Evidence for Molecular Memory in Domain Adaptation

作者: Martin Jaraiz 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00812v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了在动态数据分布下，MoE系统如何通过成本惩罚适应度指标和线性宽限期管理专家池，实验证明该方法能实现'分子记忆'效应，使系统在返回先前学习领域时恢复速度快9-11倍，且无需专家替换。

摘要翻译

我们展示了七组受控运行的实验结果，这些实验基于nanoFMT——一种采用动态混合专家（Mixture-of-Experts, MoE）管理的自由市场算法（Free-Market Algorithm, FMA）协调的Transformer模型。实验旨在回答高级大语言模型（LLM）开发中的一个核心问题：当MoE系统在数据分布动态变化下满负荷运行时，应如何管理其专家池？我们证明，采用成本惩罚的适应度指标，并结合为新专家设置的线性宽限期，能够构建一个通过多样化而非替换来积累领域专业知识的系统。核心成果体现在一项往返式领域切换实验中：当系统重返先前学习过的领域时，其恢复速度提升了9至11倍，且无需新增或替换任何专家。这种“分子记忆”效应——即休眠专家在其对应领域重现时得以保留并重新激活——在当前MoE管理方法中尚无先例。初步成本分析估算，在中等规模场景下，一个OpenAI级别的服务提供商采用此方法每年可节省约3910万美元成本，并减少27.1吉瓦时的能源消耗。

摘要 (Abstract)

We present experimental results from seven controlled runs of nanoFMT, a Free-Market Algorithm (FMA) orchestrated transformer with dynamic Mixture-of-Experts (MoE) management. The experiments address a fundamental question for advanced LLM development: how should an MoE system manage its expert pool when operating at full capacity under changing data distributions? We demonstrate that cost-penalized fitness metrics, combined with a linear grace period for newborn experts, produce a system that accumulates domain expertise through diversification rather than replacement. The central result is a round-trip domain shift experiment showing 9-11x faster recovery when returning to a previously learned domain, with zero expert births or replacements required. This “molecular memory” effect – where dormant experts survive and reactivate when their domain returns – has no analogue in current MoE management approaches. A preliminary cost analysis estimates annual savings of $39.1M and 27.1 GWh energy reduction for an OpenAI-scale provider under a moderate scenario.

关键词: Mixture of Experts, MoE, domain adaptation, cost-penalized fitness, molecular memory, Free-Market Algorithm, transformer, expert management

10. ✅ EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

作者: Alibek T. Kaliyev, Artem Maryanskyy 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00392v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对LLM agents在运行时生成工具库（如Python函数、API客户端）的质量评估问题，提出了EvolveTool-Bench基准，发现仅基于任务完成度的评估会掩盖高达18%的软件质量风险，强调需要将工具库作为一级软件制品进行评估。

摘要翻译

现代大语言模型（LLM）智能体在运行时越来越多地自主创建工具——从Python函数到API客户端——然而现有的基准测试几乎完全通过下游任务完成度来评估它们。这好比仅凭代码能否运行来评判软件工程师，而忽略了冗余性、回归问题和安全性。我们推出了EvolveTool-Bench，这是一个针对软件工程工作流中LLM生成工具库的诊断性基准测试。在三个需要实际工具执行的领域（专有数据格式、API编排和数值计算）中，我们定义了库级别的软件质量指标——复用性、冗余度、组合成功率、回归稳定性和安全性——同时引入衡量每个工具正确性、鲁棒性、通用性和代码质量的工具质量评分（Tool Quality Score）。在首次对代码级与策略级工具演化方法（ARISE vs. EvoSkill vs. 单次生成基线，涵盖99项任务和两种模型）的直接比较中，我们发现任务完成率相近的系统（63-68%）在工具库健康度上存在高达18%的差异，这揭示了仅关注任务的评估所无法察觉的软件质量风险。我们的研究结果强调：对LLM生成工具的评估与治理，需要将不断演化的工具库视为首要的软件制品，而非黑箱。

摘要 (Abstract)

Modern LLM agents increasingly create their own tools at runtime – from Python functions to API clients – yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We introduce EvolveTool-Bench, a diagnostic benchmark for LLM-generated tool libraries in software engineering workflows. Across three domains requiring actual tool execution (proprietary data formats, API orchestration, and numerical computation), we define library-level software quality metrics – reuse, redundancy, composition success, regression stability, and safety – alongside a per-tool Tool Quality Score measuring correctness, robustness, generality, and code quality. In the first head-to-head comparison of code-level and strategy-level tool evolution (ARISE vs. EvoSkill vs. one-shot baselines, 99 tasks, two models), we show that systems with similar task completion (63-68%) differ by up to 18% in library health, revealing software quality risks invisible to task-only evaluation. Our results highlight that evaluation and governance of LLM-generated tools require treating the evolving tool library as a first-class software artifact, not a black box.

关键词: LLM agents, tool generation, software quality, benchmark evaluation, tool libraries, EvolveTool-Bench, software artifacts, tool evolution

11. ✅ Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

作者: Chris Ge, Daria Kryvosheieva, Daniel Fried, Uzay Girit, Kaivalya Hariharan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00594v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对LLM-based coding agents在工具和环境交互中的性能评估难题，提出了一个结合项目反应理论和任务特征提取的框架，能够准确预测任务级性能并分解agent能力为LLM和scaffold组件。

摘要翻译

随着基于大语言模型的编程研究焦点从静态单步代码生成转向与工具及环境的多步智能体交互，理解何种任务将挑战智能体及其原因变得日益困难。当前实践加剧了这一挑战：智能体性能通常通过基准测试的总体通过率来衡量，但单一数值指标掩盖了基准测试内部任务的多样性。我们提出了一个针对智能体编程范式、用于预测个体任务成功或失败的框架。该方法通过从任务中提取丰富特征（包括问题描述、代码库上下文、解决方案和测试用例）来增强项目反应理论，并引入了一种将智能体能力分解为大语言模型能力与框架能力组件的新颖方法。这种参数化使我们能够整合异构排行榜的评估数据，并准确预测未见基准测试的任务级性能，以及未见的大语言模型-框架组合性能。我们的方法对基准设计者具有实际应用价值，他们无需运行计算成本高昂的智能体评估，即可更好地校准新任务的难度。

摘要 (Abstract)

As the focus in LLM-based coding shifts from static single-step code generation to multi-step agentic interaction with tools and environments, understanding which tasks will challenge agents and why becomes increasingly difficult. This is compounded by current practice: agent performance is typically measured by aggregate pass rates on benchmarks, but single-number metrics obscure the diversity of tasks within a benchmark. We present a framework for predicting success or failure on individual tasks tailored to the agentic coding regime. Our approach augments Item Response Theory (IRT) with rich features extracted from tasks, including issue statements, repository contexts, solutions, and test cases, and introduces a novel decomposition of agent ability into LLM and scaffold ability components. This parameterization enables us to aggregate evaluation data across heterogeneous leaderboards and accurately predict task-level performance for unseen benchmarks, as well as unseen LLM-scaffold combinations. Our methods have practical utility for benchmark designers, who can better calibrate the difficulty of their new tasks without running computationally expensive agent evaluations.

关键词: LLM-based coding, agentic interaction, task-level performance prediction, Item Response Theory, agent ability decomposition, benchmark evaluation, tool use, multi-step coding

12. ❌ TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models

作者: Lingjie Chen, Ruizhong Qiu, Yuyu Fan, Yanjun Zhao, Hanghang Tong 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00666v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出TRIMS框架，专注于扩散语言模型（DLMs）的训练优化，属于大模型技术原理的创新。核心贡献是轨迹引导的监督微调方法，与"Supervised Fine-tuning"高度相关（10分）。研究涉及LLaDA和Dream等大模型，与"Large Language Models"相关（8分）。方法旨在改善解码轨迹以实现低延迟生成，与"Speculative Decoding"或"Inference Acceleration"相关（8分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等未在摘要中提及或与论文主题无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散语言模型训练与推理不匹配导致解码效率低的问题，提出了轨迹引导的监督微调框架TRIMS，显著提升了模型在数学和代码任务上的准确性与并行解码效率的权衡。

摘要翻译

扩散语言模型（DLMs）通过并行解码为实现低延迟生成提供了可行路径，但其实际效率高度依赖于解码轨迹。在实践中，这一优势往往未能完全实现，因为标准训练未对标记揭示顺序提供显式监督，导致训练与推理之间的不匹配，从而产生次优的解码行为。我们提出轨迹排序指令掩码监督（TRIMS），这是一个简单的轨迹引导监督微调框架，能够以最小开销将轨迹监督注入标准的掩码扩散语言模型（MDLM）训练中。TRIMS不依赖基于DLM的高成本蒸馏，而是利用自回归教师模型生成的轻量级信号来引导轨迹感知的掩码策略，促使模型学习更有效的解码顺序。在数学和代码基准测试中对LLaDA和Dream进行的实验表明，与标准MDLM训练及无需训练的加速基线相比，TRIMS显著提升了准确性与并行性之间的权衡效果，同时以远低于训练成本的方式，实现了与先前基于蒸馏方法相竞争的性能。进一步分析表明，TRIMS能够产生更优的解码轨迹，验证了轨迹引导监督对DLMs的有效性。

摘要 (Abstract)

Diffusion language models (DLMs) offer a promising path toward low-latency generation through parallel decoding, but their practical efficiency depends heavily on the decoding trajectory. In practice, this advantage often fails to fully materialize because standard training does not provide explicit supervision over token reveal order, creating a train-inference mismatch that leads to suboptimal decoding behavior. We propose Trajectory-Ranked Instruction Masked Supervision (TRIMS), a simple trajectory-guided supervised fine-tuning framework that injects trajectory supervision into standard Masked Diffusion Language Model (MDLM) training with minimal overhead. Instead of relying on costly DLM-based distillation, TRIMS uses lightweight signals from an autoregressive teacher to guide a trajectory-aware masking strategy, encouraging the model to learn more effective decoding orders. Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive performance with prior distillation-based approaches at substantially lower training cost. Further analysis shows that TRIMS leads to better decoding trajectories, validating the effectiveness of trajectory-guided supervision for DLMs.

关键词: Diffusion Language Models, Trajectory-Guided Supervision, Supervised Fine-Tuning, Parallel Decoding, Masked Diffusion Language Model, Decoding Trajectory, Training-Inference Mismatch, Inference Acceleration

13. ❌ A Survey of On-Policy Distillation for Large Language Models

作者: Mingyang Song, Mao Zheng 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00626v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	8.0/10	8.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究On-Policy Distillation (OPD)用于将大型语言模型(LLMs)的知识迁移到更小的可部署模型，因此与"Large Language Models"高度相关(10分)，与"Small Language Models"相关(8分)。论文讨论蒸馏作为知识转移机制，属于模型训练后优化范畴，与"Post-training"相关(8分)。论文未涉及其他关键词的具体技术或应用。

!!! tip deepseek-chat TL;DR

这篇论文系统综述了On-Policy Distillation (OPD)方法，解决了传统离线蒸馏中暴露偏差导致推理时错误累积的问题，通过让学生模型生成自身轨迹并接收教师反馈，基于交互模仿学习理论统一了OPD框架，并分析了反馈信号、教师访问和损失粒度三个维度。

摘要翻译

知识蒸馏已成为将推理能力与领域专业知识从前沿大语言模型向更小、可部署的学生模型迁移的核心机制。然而，主流范式仍是\textit{离策略}的：学生模型在静态的教师生成数据上进行训练，在学习过程中从未接触自身错误。这种训练与测试的错位——\textit{暴露偏差}的一种表现——导致预测误差在推理时以自回归方式不断累积。策略蒸馏通过让学生模型生成自身的轨迹，并基于这些自生成输出获得教师反馈，将蒸馏过程建立在交互式模仿学习理论基础上，从而解决了这一问题。尽管该领域在散度最小化、奖励引导学习和自我博弈等方面发展迅速，但相关文献仍较为零散，缺乏统一的理论框架。本综述首次为大语言模型的策略蒸馏提供了全面概述。我们引入了一个基于策略样本的统一$f$-散度框架，并从三个正交维度梳理该领域：\emph{反馈信号}（基于逻辑值、基于结果或自我博弈）、\emph{教师访问权限}（白盒、黑盒或无教师）以及\emph{损失粒度}（词元级、序列级或混合型）。我们系统分析了代表性方法，考察了工业部署实践，并指出了包括蒸馏缩放定律、不确定性感知反馈和智能体级蒸馏在内的若干开放性问题。

摘要 (Abstract)

Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train–test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

关键词: Knowledge Distillation, On-Policy Distillation, Large Language Models, Exposure Bias, Interactive Imitation Learning, Model Compression, Deployable Students, Teacher Feedback

14. ❌ Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation

作者: HyunJoon Jung, William Na 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00477v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agent judges（LLM代理评估者）在评估对话AI中的应用，直接高度相关于"Large Language Models"和"LLM Agents"关键词。研究涉及使用具有不同人格的agent组成panel进行评估，这体现了多agent系统的协调概念，因此与"Multi-agent Systems"有一定关联（5分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等）或应用领域（如生物信息学），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于LLM的代理评估者（agent judges）在评估对话AI时的可靠性和效率问题，发现通过人格条件化构建的代理评估小组能产生与人类评估者难以区分的评价，并且评估分数随小组规模呈对数增长，而问题发现则遵循幂律分布，揭示了评估覆盖范围与评分饱和速度之间的差异。

摘要翻译

基于大语言模型的智能体评估者正成为评估对话式人工智能的新兴方法，但一个根本性不确定性依然存在：我们能否信任其评估结果？若可信任，又需要多少评估者？通过对两组模型在15项任务中开展的960次会话分析，我们发现在图灵式验证中，基于人格设定的智能体评估者产生的评估结果与人类评分者无法区分。我们进而发现评分与覆盖范围之间的分离现象：质量评分随评审团规模呈对数增长，而独特问题发现数量则遵循亚线性幂律——两者均呈现收益递减趋势，但评分饱和速度约为问题发现速度的两倍。我们假设这反映了发现空间的幂律分布：关键问题可由小型评审团率先发现，而边缘案例则需要逐步扩大评审规模，这与生态学中的物种累积曲线具有相似性。其机制可追溯至集成多样性：基于大五人格的设定使智能体能够探查不同的质量维度，其中专家型评估者扮演对抗性探查角色，将发现推向问题分布的尾部。受控消融实验证实，需要结构化的人格设定而非简单提示才能产生这种缩放特性。

摘要 (Abstract)

LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.

关键词: LLM-based agent judges, evaluation, persona conditioning, scaling properties, power law distribution, ensemble diversity, Turing-style validation, score-coverage dissociation

15. ❌ Hierarchical Pre-Training of Vision Encoders with Large Language Models

作者: Eugene Lee, Ting-Yu Chang, Jui-Huang Tsai, Jiajie Diao, Chen-Yi Lee 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00086v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）与视觉编码器的分层预训练（Pre-training）集成，因此这两个关键词高度相关（10分）。论文提到“vision-language alignment”，与“Alignment”有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及或与论文主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HIVE的分层预训练框架，通过引入视觉编码器与大语言模型之间的分层交叉注意力机制，显著提升了视觉-语言对齐效果，并在多个基准测试中优于现有方法。

摘要翻译

计算机视觉领域通过可扩展的视觉编码器与多模态预训练框架取得了显著进展。然而，现有方法通常将视觉编码器与大型语言模型（LLMs）视为独立模块，限制了层次化视觉特征的深度融合。本研究提出HIVE（Hierarchical Pre-Training of Vision Encoders）——一种通过引入视觉编码器与LLM之间的层次化交叉注意力机制来增强视觉-语言对齐的新框架。与传统方法将图像嵌入扁平化处理不同，HIVE实现了跨多网络层的结构化特征融合，从而改善了梯度流与表征学习能力。为优化这种交互机制，我们设计了三阶段训练策略，逐步对齐视觉编码器与LLM，确保稳定的优化过程与有效的多模态融合。实证评估表明，HIVE不仅在图像分类任务中表现优异，在MME、GQA、OK-VQA和ScienceQA等基准测试的多种视觉-语言任务上也超越了基于自注意力的方法。我们的研究结果凸显了层次化特征整合的优势，为构建更高效、表达能力更强的视觉-语言模型开辟了新路径。

摘要 (Abstract)

The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.

关键词: Hierarchical Pre-Training, Vision Encoders, Large Language Models, Vision-Language Alignment, Cross-Attention, Multimodal Fusion, HIVE Framework

16. ❌ Fast and Accurate Probing of In-Training LLMs’ Downstream Performances

作者: Zhichen Liu, Tianle Lun, Zhibin Wen, Hao An, Yulin Ou, Jianhui Xu, Hao Zhang, Wenyi Fang, Yang Zheng, Yang Xu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01025v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM训练过程中的下游性能评估问题，提出使用轻量级探针分析内部表示来预测性能，因此与"Large Language Models"高度相关（10分）。论文涉及训练过程监控，与"Pre-training"有一定关联（5分）。探针分析内部表示可视为模型解释性的一种方法，与"Mechanistic Interpretability"有一定关联（5分）。其他关键词如MoE、SFT、RAG、推理加速等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型训练过程中传统生成式评估方法计算成本高、延迟大的问题，提出了一种使用轻量级探针分析模型内部表示来快速准确预测下游任务性能的新评估范式，将评估延迟从约1小时减少到约3分钟，同时保持较高的预测准确性。

摘要翻译

扩大大型语言模型（LLMs）在参数量与测试时间上的规模范式，不断推动着人工智能能力的边界，但代价是使传统的生成式评估范式变得极其昂贵，从而导致LLM训练期间下游性能评估的延迟变得难以承受。然而，简单的指标如训练损失（困惑度）并不总是与下游性能相关，因为它们的趋势有时会与实际任务结果相背离。这一困境呼唤一种计算高效且能足够准确衡量模型能力的方法。为应对这一挑战，我们引入了一种新的训练中评估范式，它使用轻量级探针来监测下游性能。这些探针以LLM训练过程中检查点的内部表示为输入，直接预测该检查点在下游任务上的性能（以成功率，即pass@1衡量）。我们设计了多种探针架构，并通过在OLMo3-7B模型的一系列多样化下游任务检查点上进行验证，证明了其有效性。这些探针能够准确预测检查点的性能（平均AUROC$>$0.75），在不同检查点间具有良好的泛化能力（早期预测后期），并将计算延迟从约1小时（使用传统生成式评估方法）降低至约3分钟。总之，这项工作提出了一种实用且可扩展的训练中下游评估范式，能够实现更敏捷、更知情、更高效的大型语言模型开发流程。

摘要 (Abstract)

The paradigm of scaling Large Language Models (LLMs) in both parameter size and test time has pushed the boundaries of AI capabilities, but at the cost of making the traditional generative evaluation paradigm prohibitively expensive, therefore making the latency of LLM’s in-training downstream performance evaluation unbearable. However, simple metrics like training loss (perplexity) are not always correlated with downstream performance, as sometimes their trends diverge from the actual task outcomes. This dilemma calls for a method that is computationally efficient and sufficiently accurate in measuring model capabilities. To address this challenge, we introduce a new in-training evaluation paradigm that uses a lightweight probe for monitoring downstream performance. The probes take the internal representations of LLM checkpoints (during training) as input and directly predict the checkpoint’s performance on downstream tasks measured by success probability (i.e., pass@1). We design several probe architectures, validating their effectiveness using the OLMo3-7B’s checkpoints across a diverse set of downstream tasks. The probes can accurately predict a checkpoint’s performance (with avg. AUROC$>$0.75), have decent generalizability across checkpoints (earlier predicts later), and reduce the computation latency from $\sim$1 hr (using conventional generative evaluation method) to $\sim$3 min. In sum, this work presents a practical and scalable in-training downstream evaluation paradigm, enabling a more agile, informed, and efficient LLM development process.

关键词: Large Language Models, in-training evaluation, downstream performance, lightweight probe, internal representations, computational efficiency, OLMo3-7B, pass@1

17. ❌ Routing-Free Mixture-of-Experts

作者: Yilun Liu, Jinru Han, Sikuan Yan, Volker Tresp, Yunpu Ma 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00801v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究Mixture-of-Experts（MoE）模型的架构创新，提出Routing-Free MoE方法，因此与"Mixture of Experts"关键词高度相关（15分）。MoE通常作为大模型（LLMs）的组件，论文属于大模型技术范畴，但与LLMs的具体应用或训练方法无直接关联，给5分。其他关键词如SLMs、训练方法、推理优化、AI应用等均未在摘要中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对标准MoE模型依赖集中式路由机制引入刚性归纳偏差的问题，提出了Routing-Free MoE方法，通过消除硬编码的集中式设计并将所有激活功能封装在单个专家内，实现了更好的可扩展性和鲁棒性。

摘要翻译

标准混合专家模型依赖于引入刚性归纳偏置的集中式路由机制。我们提出无路由混合专家模型，该模型消除了包括外部路由器、Softmax函数、Top-K选择和负载均衡在内的所有硬编码集中式设计，转而将所有激活功能封装在单个专家内部，并通过连续梯度流直接优化，使每个专家能够完全自主决定其激活状态。我们引入了一个统一的自适应负载均衡框架，通过可配置的插值方法同时优化专家均衡与令牌均衡目标，实现灵活可定制的资源分配。大量实验表明，无路由混合专家模型能够持续超越基线模型，并展现出更好的可扩展性与鲁棒性。我们对其行为进行了详细分析，提供了可能促进未来混合专家模型设计与优化的见解。

摘要 (Abstract)

Standard Mixture-of-Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing-Free MoE which eliminates any hard-coded centralized designs including external routers, Softmax, Top-K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load-balancing framework to simultaneously optimize both expert-balancing and token-balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing-Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.

关键词: Mixture-of-Experts, Routing-Free, Load Balancing, Sparse Models, Expert Activation, Gradient Optimization, Scalability, Robustness

作者: Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang, Yaozu Wu, Liancheng Fang, Zhengyao Gu, Zhen Zhang, Kening Zheng, Fangxin Wang, Yi Nian, Shanghao Li, Wenzhe Fan, Langzhou He, Weizhi Zhang, Xue Liu, Philip S. Yu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00892v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents在长视野、动态环境（网页导航）中处理用户中断的能力，与关键词"Large Language Models"和"LLM Agents"高度相关（10分），因为论文明确使用LLM作为agent骨干并评估其agentic能力。其他关键词如MoE、SFT、RAG、CoT等均未在摘要中提及或涉及，因此评0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM agents在长视野网页导航任务中处理用户中断（如需求增加、目标修订）的能力，发现即使强大的LLM也难以有效高效地适应这些动态变化。

摘要翻译

随着大语言模型智能体从解决简短、静态问题转向在动态环境中执行复杂、长周期的任务，在任务执行过程中处理用户中断（例如增加需求或修改目标）的能力正成为其实际部署的核心要求。然而，现有基准测试大多假设智能体行为不受中断，或仅在简短、无约束的语言任务中研究中断现象。本文首次对长周期、环境具身的网络导航任务中可中断智能体进行了系统性研究，此类任务中的操作会引发持久的状态改变。我们形式化了三种现实的中断类型，包括增加、修改和撤销，并引入了InterruptBench——一个源自WebArena-Lite的基准测试，其在严格的语义约束下合成了高质量的中断场景。通过统一的中断模拟框架，我们评估了六种强大LLM基座模型在单轮和多轮中断设置下的表现，既分析了它们适应更新意图的有效性，也评估了它们从任务中途变更中恢复的效率。我们的研究结果表明，对于当前性能强大的大规模LLM而言，在长周期智能体任务中有效且高效地处理用户中断仍然具有挑战性。代码与数据集发布于https://github.com/HenryPengZou/InterruptBench。

摘要 (Abstract)

As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena-Lite that synthesizes high-quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single- and multi-turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid-task changes. Our results show that handling user interruptions effectively and efficiently during long-horizon agentic tasks remains challenging for powerful large-scale LLMs. Code and dataset are available at https://github.com/HenryPengZou/InterruptBench.

关键词: LLM agents, long-horizon tasks, user interruptions, web navigation, InterruptBench, dynamic environments, agentic workflow

19. ❌ PHASOR: Anatomy- and Phase-Consistent Volumetric Diffusion for CT Virtual Contrast Enhancement

作者: Zilong Li, Dongyang Li, Chenglong Ma, Zhan Feng, Dakai Jin, Junping Zhang, Hao Luo, Fan Wang, Hongming Shan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01053v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文PHASOR专注于医学影像（CT）的虚拟对比增强，属于AI for Science（生物医学AI）应用领域，与关键词"AI for Science"高度相关（10分）。论文核心创新之一是提出了anatomy-routed mixture-of-experts (AR-MoE)模块，直接应用了MoE技术，与关键词"Mixture of Experts"高度相关（10分）。论文未涉及大语言模型（LLMs）、模型训练/对齐技术（如RLHF、PEFT）、推理优化（如量化、推测解码）、智能体系统或其他通用AI技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对从非对比CT合成虚拟对比增强CT时存在的解剖异质性和空间错位问题，提出了PHASOR框架，通过解剖路由的专家混合和强度相位感知表示对齐模块，显著提升了合成质量和增强准确性。

摘要翻译

对比增强计算机断层扫描（CECT）对于凸显组织灌注和血管分布至关重要，但其临床应用普及受到造影剂侵入性和辐射风险的限制。虚拟对比增强（VCE）技术提供了从非增强CT（NCCT）合成CECT的替代方案，然而现有方法因解剖结构异质性和空间错位问题，常导致增强模式不一致及细节错误。本文提出PHASOR——一种面向高保真CT虚拟对比增强的体积扩散框架。通过将CT体积数据视为连贯序列，我们利用视频扩散模型来增强结构连贯性与体积精度。为确保解剖-时相一致的合成，我们引入了两个互补模块：首先，解剖路由专家混合模块（AR-MoE）将不同的增强模式锚定于解剖语义，其器官特异性记忆机制能够捕捉显著细节；其次，强度-时相感知表征对齐模块（IP-REPA）在强调精细对比信号的同时，减轻不完美空间对齐的影响。基于三个数据集的广泛实验表明，PHASOR在合成质量和增强准确性方面均显著优于现有先进方法。

摘要 (Abstract)

Contrast-enhanced computed tomography (CECT) is pivotal for highlighting tissue perfusion and vascularity, yet its clinical ubiquity is impeded by the invasive nature of contrast agents and radiation risks. While virtual contrast enhancement (VCE) offers an alternative to synthesizing CECT from non-contrast CT (NCCT), existing methods struggle with anatomical heterogeneity and spatial misalignment, leading to inconsistent enhancement patterns and incorrect details. This paper introduces PHASOR, a volumetric diffusion framework for high-fidelity CT VCE. By treating CT volumes as coherent sequences, we leverage a video diffusion model to enhance structural coherence and volumetric accuracy. To ensure anatomy-phase consistent synthesis, we introduce two complementary modules. First, anatomy-routed mixture-of-experts (AR-MoE) anchors distinct enhancement patterns to anatomical semantics, with organ-specific memory to capture salient details. Second, intensity-phase aware representation alignment (IP-REPA) highlights intricate contrast signals while mitigating the impact of imperfect spatial alignment. Extensive experiments across three datasets demonstrate that PHASOR significantly outperforms state-of-the-art methods in both synthesis quality and enhancement accuracy.

关键词: Virtual Contrast Enhancement, CT Synthesis, Volumetric Diffusion, Mixture of Experts, Anatomy-Phase Consistency, Medical Imaging, Diffusion Model, Radiology

20. ❌ Multimodal Analysis of State-Funded News Coverage of the Israel-Hamas War on YouTube Shorts

作者: Daniel Miehling, Sandra Kuebler 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00994v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	2.0/10	2.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	8.0/10	8.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究YouTube Shorts上关于以色列-哈马斯战争的多模态分析，核心是开发一个结合自动转录、基于方面的情感分析（ABSA）和语义场景分类的管道。论文与大多数关键词无关，因为这些关键词主要涉及大模型的技术原理、训练方法、推理优化、对齐技术等，而本文主要应用现成的NLP和计算机视觉技术进行内容分析。唯一相关的关键词是：1）“Small Language Models” OR “SLMs” OR “On-device AI”：得8分，因为摘要明确指出"smaller domain-adapted models outperform large transformers and even LLMs for sentiment analysis”，直接讨论了小模型相对于大模型/LLM的优势，这是论文的一个重要发现。2）“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”：得5分，因为提到了"domain-adapted models"，暗示了领域适应技术，但这不是论文的核心创新点。3）“Large Language Models” OR “LLMs” OR “Foundation Models”：得2分，因为论文提到了LLMs作为比较基准（“outperform large transformers and even LLMs”），但LLMs本身不是研究焦点。其他关键词如MoE、Scaling Laws、SFT、RLHF、RAG、Attention优化、推理技术、代理、量化等均未涉及。论文属于AI在社会科学/媒体研究中的应用，而非大模型技术原理的创新。

!!! tip deepseek-chat TL;DR

该论文开发了一个多模态分析管道，用于研究YouTube Shorts上国家资助媒体对以色列-哈马斯战争的报道，发现小型的领域适应模型在情感分析任务上优于大型Transformer和LLMs，并揭示了不同媒体在情感表达和视觉线索上的差异。

摘要翻译

YouTube Shorts已成为该平台新闻消费的核心形式，然而关于地缘政治事件在此类短视频中如何呈现的研究仍显不足。为填补这一空白，本研究提出一种结合自动转录、基于方面的情感分析（Aspect-Based Sentiment Analysis, ABSA）与语义场景分类的多模态分析流程。该流程首先进行可行性评估，随后应用于分析国家资助媒体对以色列-哈马斯战争的短视频报道。通过对2,300余条冲突相关Shorts及94,000多帧视觉画面的分析，我们系统考察了主要国际广播机构的战争报道模式。研究发现：不同媒体机构在特定议题上的文本情感倾向存在差异且随时间演变，而场景分类结果则显示出与现实事件相吻合的视觉线索特征。值得注意的是，在情感分析任务中，经过领域适配的小型模型表现优于大型Transformer模型乃至大语言模型，这凸显了资源高效型方法在人文研究中的价值。本分析流程可为TikTok、Instagram等其他短视频平台提供研究模板，同时证明多模态方法与定性解读相结合，能够有效揭示算法驱动视频环境中情感模式与视觉线索的特征规律。

摘要 (Abstract)

YouTube Shorts have become central to news consumption on the platform, yet research on how geopolitical events are represented in this format remains limited. To address this gap, we present a multimodal pipeline that combines automatic transcription, aspect-based sentiment analysis (ABSA), and semantic scene classification. The pipeline is first assessed for feasibility and then applied to analyze short-form coverage of the Israel-Hamas war by state-funded outlets. Using over 2,300 conflict-related Shorts and more than 94,000 visual frames, we systematically examine war reporting across major international broadcasters. Our findings reveal that the sentiment expressed in transcripts regarding specific aspects differs across outlets and over time, whereas scene-type classifications reflect visual cues consistent with real-world events. Notably, smaller domain-adapted models outperform large transformers and even LLMs for sentiment analysis, underscoring the value of resource-efficient approaches for humanities research. The pipeline serves as a template for other short-form platforms, such as TikTok and Instagram, and demonstrates how multimodal methods, combined with qualitative interpretation, can characterize sentiment patterns and visual cues in algorithmically driven video environments.

关键词: multimodal analysis, YouTube Shorts, sentiment analysis, domain-adapted models, Israel-Hamas war, state-funded news, visual scene classification, humanities research

21. ❌ GRASP: Gradient Realignment via Active Shared Perception for Multi-Agent Collaborative Optimization

作者: Sihan Zhou, Tiantian He, Yifan Lu, Yaqing Hou, Yew-Soon Ong 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00717v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《GRASP: Gradient Realignment via Active Shared Perception for Multi-Agent Collaborative Optimization》专注于多智能体强化学习（MARL）领域，提出了一种解决非平稳性问题的框架。其核心贡献在于通过主动共享感知和梯度对齐来优化多智能体协作，这与关键词"Multi-agent Systems" OR “Agent Coordination"高度相关（评分为10分），因为论文直接研究多智能体系统中的协调与优化问题。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新、科学领域AI应用或其他指定的大模型相关关键词（如MoE、Scaling Laws、RLHF、RAG等），因此这些关键词均评为0分。论文的研究背景虽然提到关注大模型和深度学习在科学领域的应用，但本文的具体内容并未体现这些方向。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GRASP的新框架，通过主动共享感知和梯度对齐来解决多智能体强化学习中的非平稳性问题，并在StarCraft II和Google Research Football环境中验证了其有效性和可扩展性。

摘要翻译

非平稳性源于并行的策略更新，并导致持续的环境波动。现有方法如集中训练分散执行（CTDE）和顺序更新方案缓解了这一问题。然而，由于对其他智能体策略的感知仍依赖于采样环境交互数据，智能体本质上处于被动感知状态。这不可避免地引发均衡振荡，并显著降低系统收敛速度。为解决此问题，我们提出基于主动共享感知的梯度重对齐（GRASP）框架，该框架将广义贝尔曼均衡定义为策略演化的稳定目标。GRASP的核心机制在于利用智能体的独立梯度推导出已定义的共识梯度，使智能体能够主动感知策略更新并优化团队协作。理论上，我们借助角谷不动点定理证明共识方向$u^*$保证了该均衡的存在性与可达性。在星际争霸II多智能体挑战（SMAC）和谷歌研究足球（GRF）平台上的大量实验验证了该框架的可扩展性与优越性能。

摘要 (Abstract)

Non-stationarity arises from concurrent policy updates and leads to persistent environmental fluctuations. Existing approaches like Centralized Training with Decentralized Execution (CTDE) and sequential update schemes mitigate this issue. However, since the perception of the policies of other agents remains dependent on sampling environmental interaction data, the agent essentially operates in a passive perception state. This inevitably triggers equilibrium oscillations and significantly slows the convergence speed of the system. To address this issue, we propose Gradient Realignment via Active Shared Perception (GRASP), a novel framework that defines generalized Bellman equilibrium as a stable objective for policy evolution. The core mechanism of GRASP involves utilizing the independent gradients of agents to derive a defined consensus gradient, enabling agents to actively perceive policy updates and optimize team collaboration. Theoretically, we leverage the Kakutani Fixed-Point Theorem to prove that the consensus direction $u^*$ guarantees the existence and attainability of this equilibrium. Extensive experiments on StarCraft II Multi-Agent Challenge (SMAC) and Google Research Football (GRF) demonstrate the scalability and promising performance of the framework.

关键词: Multi-Agent Systems, Collaborative Optimization, Gradient Realignment, Active Shared Perception, Non-stationarity, Policy Updates, Bellman Equilibrium, Agent Coordination

22. ❌ The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

作者: Piyush Garg, Diana R. Gergel, Andrew E. Shao, Galen J. Yacalis 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01215v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于AI天气预测，属于AI for Science（科学AI）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。其他关键词均涉及大模型、深度学习技术原理（如LLM、MoE、训练方法、推理优化等），而本文研究的是AI天气预测的数学框架和训练管道，不涉及大模型技术本身，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一个统一的数学框架来分析AI天气预测中训练管道（包括架构、损失函数、训练策略和数据分布）对预测技能的影响，发现估计误差（依赖损失函数和数据）在当前规模下主导近似误差（依赖架构），并通过实验验证了MSE训练模型在高波数下的谱能量损失和极端事件中的线性负偏差。

摘要翻译

人工智能天气预报已取得快速发展，但尚未形成统一的数学框架来解释预报技能的决定因素。现有理论主要针对特定架构选择而非整个学习流程，而2023-2026年的业务实践表明，训练方法、损失函数设计和数据多样性至少与架构选择同等重要。本文提出两项交织的贡献：理论上，我们构建了一个根植于球面近似理论、动力系统理论、信息论和统计学习理论的框架，该框架处理完整学习流程（架构、损失函数、训练策略、数据分布）而非仅关注架构。我们建立了学习流程误差分解理论，证明在当前规模下估计误差（依赖损失函数与数据）主导着近似误差（依赖架构）。我们提出损失函数谱理论，形式化球谐坐标中均方误差（MSE）引发的谱模糊效应，并推导出分布外外推边界，证明数据驱动模型系统性地低估破纪录极端天气事件，其偏差随纪录超出量呈线性增长。实证方面，我们通过NVIDIA Earth2Studio平台使用ERA5初始条件对十种架构各异的人工智能天气模型进行推理验证，涵盖所有季节的30个初始化日期并评估六项指标。结果证实：经MSE训练的模型普遍存在高波数谱能量损失；误差共识比持续上升表明大部分预报误差在不同架构间具有共性；极端事件期间存在线性负偏差。我们提出的整体模型评估分数实现了统一的多维评估，而预设框架可在训练前对拟议流程进行数学评估。

摘要 (Abstract)

AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.

关键词: AI weather prediction, mathematical framework, learning pipeline, estimation error, approximation error, spectral theory, extreme events, model assessment

23. ❌ LAtent Phase Inference from Short time sequences using SHallow REcurrent Decoders (LAPIS-SHRED)

作者: Yuxuan Bao, Xingyue Zhang, J. Nathan Kutz 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01216v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文LAPIS-SHRED专注于复杂时空系统的动态重建与预测，属于AI for Science领域，特别是物理、环境等科学应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分8.0）。论文提到SHRED模型在模拟数据上进行预训练，这与’Pre-training OR Continual Pre-training OR Domain Adaptation’有间接联系（评分5.0）。其他关键词主要涉及大语言模型（LLMs）的技术细节（如MoE、RLHF、RAG等）或特定推理方法（如CoT、MCTS），而本文研究的是基于浅层循环解码器的时空动态重建，未涉及LLMs、深度学习技术原理创新或大模型应用，因此其余关键词评分为0.0。

!!! tip deepseek-chat TL;DR

论文提出LAPIS-SHRED框架，解决从稀疏时空观测中重建完整时空动态的挑战，通过预训练的浅层循环解码器和时序模型，在湍流、燃烧等复杂物理系统中实现了有效的重建和预测。

摘要翻译

在复杂系统中，从时空稀疏观测中重建完整的时空动态仍是一个核心挑战，因为测量可能在空间上不完整，且可能局限于狭窄的时间窗口。然而，近似完整的时空轨迹对于机理洞察与理解、模型校准以及操作决策至关重要。本文提出LAPIS-SHRED（基于浅层循环解码器的短时序潜在相位推断），这是一种模块化架构，能够从局限于短时间窗口的稀疏传感器观测中重建和/或预测完整的时空动态。LAPIS-SHRED通过三阶段流程运行：（一）SHRED模型完全在仿真数据上预训练，将传感器时间序列映射到结构化的潜在空间；（二）一个在仿真导出的潜在轨迹上训练的时间序列模型，学习在时间上向前或向后传播潜在状态，以从短观测时间窗口跨越未观测的时间区域；（三）在部署阶段，仅提供来自真实系统的超稀疏传感器测量的短观测窗口，冻结的SHRED模型与时间模型共同从中重建或预测完整的时空轨迹。该框架支持双向推断，从其模块化结构中继承了数据同化和多尺度重建能力，并能适应极端的观测约束，包括单帧终端输入。我们在涵盖复杂时空物理的六个实验中对LAPIS-SHRED进行评估：湍流、多尺度推进物理、不稳定燃烧瞬态以及卫星反演的环境场，突显了其轻量级、模块化的架构，适用于观测受物理或实际条件限制的操作场景。

摘要 (Abstract)

Reconstructing full spatio-temporal dynamics from sparse observations in both space and time remains a central challenge in complex systems, as measurements can be spatially incomplete and can be also limited to narrow temporal windows. Yet approximating the complete spatio-temporal trajectory is essential for mechanistic insight and understanding, model calibration, and operational decision-making. We introduce LAPIS-SHRED (LAtent Phase Inference from Short time sequence using SHallow REcurrent Decoders), a modular architecture that reconstructs and/or forecasts complete spatiotemporal dynamics from sparse sensor observations confined to short temporal windows. LAPIS-SHRED operates through a three-stage pipeline: (i) a SHRED model is pre-trained entirely on simulation data to map sensor time-histories into a structured latent space, (ii) a temporal sequence model, trained on simulation-derived latent trajectories, learns to propagate latent states forward or backward in time to span unobserved temporal regions from short observational time windows, and (iii) at deployment, only a short observation window of hyper-sparse sensor measurements from the true system is provided, from which the frozen SHRED model and the temporal model jointly reconstruct or forecast the complete spatiotemporal trajectory. The framework supports bidirectional inference, inherits data assimilation and multiscale reconstruction capabilities from its modular structure, and accommodates extreme observational constraints including single-frame terminal inputs. We evaluate LAPIS-SHRED on six experiments spanning complex spatio-temporal physics: turbulent flows, multiscale propulsion physics, volatile combustion transients, and satellite-derived environmental fields, highlighting a lightweight, modular architecture suited for operational settings where observation is constrained by physical or logistical limitations.

关键词: spatio-temporal dynamics, sparse observations, reconstruction, forecasting, latent space, simulation data, complex systems, modular architecture

24. ❌ $\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

作者: Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi, Sachin Patro, Nazneen Rajani 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01212v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在长期规划任务中的表现，与’LLM Agents’高度相关（15分），直接使用LLM作为评估对象（10分），涉及多步推理和深度思考（各10分）。其他关键词如MoE、量化、RAG等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了YC-Bench基准，用于评估LLM智能体在模拟初创公司一年期运营中的长期规划能力，发现只有少数模型能持续盈利，且scratchpad使用是成功的关键预测因素。

摘要翻译

随着大语言模型智能体处理日益复杂的任务，一个关键问题在于它们能否在长周期内保持战略连贯性：即在不确定性下进行规划、从延迟反馈中学习，并在早期错误产生连锁影响时及时调整。我们推出 $\texttt{YC-Bench}$ 基准测试，通过要求智能体在模拟中运营一家初创公司长达一年时间（涵盖数百个回合），来评估这些能力。智能体必须在部分可观测的环境中管理员工、选择任务合同并维持盈利能力，其中对抗性客户和不断增长的薪资支出会使错误决策产生连锁后果。我们评估了12个专有及开源模型，每个模型运行3次随机种子。仅三个模型能持续超越20万美元的初始资本，其中 Claude Opus 4.6 以平均最终资金127万美元取得最高成绩，紧随其后的是 GLM-5（121万美元），其推理成本低11倍。跨上下文截断时唯一的信息持久化机制——草稿纸使用情况，是预测成功的最强指标；而对抗性客户检测是主要失败模式，导致47%的破产案例。我们的分析表明，前沿模型仍会因特定失败模式（如过度并行化）而失败，这揭示了其在长周期性能方面存在的能力差距。$\texttt{YC-Bench}$ 具有开源、可复现和可配置的特点。

摘要 (Abstract)

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $\texttt{YC-Bench}$, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of $200K, with Claude Opus 4.6 achieving the highest average final funds at $1.27 M, followed by GLM-5 at $1.21 M at 11$\times$ lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for $47%$ of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. $\texttt{YC-Bench}$ is open-source, reproducible, and configurable.

关键词: LLM agents, long-term planning, benchmarking, strategic coherence, scratchpad usage, adversarial clients, startup simulation, context truncation

25. ❌ CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

作者: Youssef Mroueh, Carlos Fonseca, Brian Belgodere, David Cox 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01210v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出CliffSearch框架，使用LLM代理（LLM agents）实现进化操作（选择、交叉、变异、评审），属于多代理系统（multi-agent systems），用于科学算法发现（AI for Science）。因此，与’LLM Agents/Autonomous Agents/Agentic Workflow’、‘Multi-agent Systems/Agent Coordination’和’AI for Science/Bioinformatics/Cheminformatics’高度相关（10分）。论文核心使用LLM，与’Large Language Models/LLMs/Foundation Models’高度相关（10分）。其他关键词如MoE、SLMs、训练技术、推理优化、对齐等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了CliffSearch框架，一个基于LLM代理的进化系统，用于科学算法发现，通过结构化理论+代码节点和评审门控机制，在优化任务指标的同时优先考虑科学可解释性和正确性。

摘要翻译

科学算法发现是一个迭代过程：假设被提出、实现、压力测试并修正。当前基于大语言模型（LLM）的搜索系统加速了假设生成环节，但通常因仅优化纯代码产物且缺乏严格的正确定/原创性筛选机制，未能充分体现科学结构。我们提出CliffSearch，这是一个智能演化框架，其核心演化算子（配对选择、交叉、变异和评审）均由LLM智能体实现，且循环设计遵循三项原则：（1）每个节点均为结构化的科学产物，以“理论+代码”或“纯代码”模式实例化；（2）评审者对正确性和原创性的判断与目标基准指标的优化同等重要，共同作为核心筛选关口；（3）变异被拆分为探索和修正两条路径，各自具有不同目标。探索变异通过引入相邻科学领域的思想以提升新颖性，而修正变异则利用评审信号（涉及理论、代码、基准结果和运行时错误）进行有针对性的证据引导修复。我们在三项基于基准测试的研究中展示了该框架：Transformer超连接结构演化、固定nanoGPT栈上的优化器发现，以及一项小规模原生优化器的消融实验。在这些场景中，同一循环支持明确的指标导向、可复现的持久化存储，以及在受控搜索条件下通过评审关口对发现进行比较。由此产生的工作流优先考虑科学可解释性与正确性，并在受控的新颖性约束下优化任务指标，而非仅仅追求候选方案的数量。完整运行记录、交互式可视化图表及已报告研究中导出的最佳节点可在 https://cliffsearch.ai 获取。

摘要 (Abstract)

Scientific algorithm discovery is iterative: hypotheses are proposed, implemented, stress-tested, and revised. Current LLM-guided search systems accelerate proposal generation, but often under-represent scientific structure by optimizing code-only artifacts with weak correctness/originality gating. We present CliffSearch, an agentic evolutionary framework in which the core evolution operators (pair selection, crossover, mutation, and review) are implemented as LLM agents, and the loop is designed around three principles: (1) each node is a structured scientific artifact, instantiated in either theory+code or code_only mode, (2) reviewer judgments of correctness and originality are first-class selection gates alongside optimization of the benchmark metric of interest, and (3) mutation is split into exploration and correction pathways with distinct objectives. Exploration mutation imports ideas from adjacent scientific domains to increase novelty, while correction mutation performs targeted evidence-guided repair using reviewer signals over theory, code, benchmark results, and runtime errors. We illustrate the framework on three benchmark-grounded studies: transformer hyper-connection evolution, optimizer discovery on a fixed nanoGPT stack, and a smaller native-optimizer ablation. Across these settings, the same loop supports explicit metric direction, reproducible persistence, and reviewer-gated comparison of discoveries under controlled search conditions. The result is a discovery workflow that prioritizes scientific interpretability and correctness while optimizing task metrics under controlled novelty constraints, rather than maximizing candidate throughput alone. Full run artifacts, interactive visualizations, and exported best nodes for the reported studies are available at https://cliffsearch.ai .

关键词: LLM agents, agentic evolutionary framework, scientific algorithm discovery, structured scientific artifact, theory+code, reviewer judgments, exploration mutation, correction mutation

26. ❌ Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

作者: Jorge Condor, Nicolas Moenne-Loccoz, Merlin Nimier-David, Piotr Didyk, Zan Gojcic, Qi Wu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01204v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域的神经表示方法，特别是用于3D重建和新视角合成的Neural Harmonic Textures技术。论文内容涉及3D高斯泼溅、神经场、傅里叶分析、周期性激活函数等，但完全不涉及大语言模型、深度学习技术原理创新、AI for Science等关键词。所有关键词均与大模型、深度学习技术原理或科学应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Neural Harmonic Textures的神经表示方法，通过将潜在特征向量锚定在基元周围的虚拟支架上并应用周期性激活函数，解决了基于基元的3D重建方法中高频细节建模的挑战，实现了实时新视角合成的先进效果。

摘要翻译

基于基元的方法（如3D高斯泼溅）近年来已成为新视角合成及相关重建任务的前沿技术。与神经场相比，这类表征更具灵活性、自适应性，且能更好地扩展至大场景。然而，单个基元的有限表达能力使得高频细节建模面临挑战。我们提出神经谐波纹理（Neural Harmonic Textures），这是一种将潜在特征向量锚定在每个基元周围虚拟支架上的神经表征方法。这些特征在光线交点处于基元内部进行插值。受傅里叶分析启发，我们对插值后的特征施加周期性激活函数，将阿尔法混合转化为谐波分量的加权求和。随后通过一个轻量神经网络在单次延迟传递中对生成的信号进行解码，显著降低了计算成本。神经谐波纹理在实时新视角合成中取得了领先成果，同时弥合了基于基元与基于神经场的重建方法之间的差距。我们的方法可无缝集成到现有基元驱动流程中，如3DGUT、三角泼溅（Triangle Splatting）和2DGS。我们进一步通过其在二维图像拟合与语义重建中的应用验证了该方法的通用性。

摘要 (Abstract)

Primitive-based methods such as 3D Gaussian Splatting have recently become the state-of-the-art for novel-view synthesis and related reconstruction tasks. Compared to neural fields, these representations are more flexible, adaptive, and scale better to large scenes. However, the limited expressivity of individual primitives makes modeling high-frequency detail challenging. We introduce Neural Harmonic Textures, a neural representation approach that anchors latent feature vectors on a virtual scaffold surrounding each primitive. These features are interpolated within the primitive at ray intersection points. Inspired by Fourier analysis, we apply periodic activations to the interpolated features, turning alpha blending into a weighted sum of harmonic components. The resulting signal is then decoded in a single deferred pass using a small neural network, significantly reducing computational cost. Neural Harmonic Textures yield state-of-the-art results in real-time novel view synthesis while bridging the gap between primitive- and neural-field-based reconstruction. Our method integrates seamlessly into existing primitive-based pipelines such as 3DGUT, Triangle Splatting, and 2DGS. We further demonstrate its generality with applications to 2D image fitting and semantic reconstruction.

关键词: Neural Harmonic Textures, 3D Gaussian Splatting, primitive-based reconstruction, novel-view synthesis, neural representation, Fourier analysis, periodic activations, real-time rendering

27. ❌ Therefore I am. I Think

作者: Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, Rajagopal Venkatesaramani 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01202v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型语言模型（LLMs）在推理过程中的决策机制，核心关注点包括：1）Chain-of-Thought推理过程（高度相关，10分）；2）工具调用决策（Tool Use，高度相关，10分）；3）系统2思维/深度推理（System 2 Thinking，8分）；4）机制可解释性（Mechanistic Interpretability，8分）；5）自我反思/自我改进（Self-Correction，5分）。论文未涉及其他关键词如MoE、量化、RAG、对齐等具体技术。

!!! tip deepseek-chat TL;DR

该论文研究发现，大型语言推理模型在生成推理文本前就已编码了工具调用决策，通过线性探针和激活导向实验证明决策先于推理过程，且推理过程会合理化被改变的决策。

摘要翻译

我们探讨以下问题：当大型语言推理模型做出选择时，它是先思考后决定，还是先决定后思考？本文提供的证据表明，在推理模型中，可检测的、早期编码的决策会影响思维链的生成。具体而言，我们证明一个简单的线性探针能够以极高的置信度从生成前的激活状态中解码工具调用决策，在某些情况下，甚至能在模型产生第一个推理标记之前就实现解码。激活导向从因果角度支持了这一发现：扰动决策方向会导致模型过度“斟酌”，并在许多示例中改变其行为（改变比例在7%至79%之间，具体取决于模型和基准测试）。通过行为分析我们还发现，当导向改变决策时，思维链过程往往会合理化这种转变而非抵抗它。综合来看，这些结果表明，推理模型可能在开始文本层面的推演之前就已编码了行动选择。

摘要 (Abstract)

We consider the question: when a large language reasoning model makes a choice, did it think first and then decide to, or decide first and then think? In this paper, we present evidence that detectable, early-encoded decisions shape chain-of-thought in reasoning models. Specifically, we show that a simple linear probe successfully decodes tool-calling decisions from pre-generation activations with very high confidence, and in some cases, even before a single reasoning token is produced. Activation steering supports this causally: perturbing the decision direction leads to inflated deliberation, and flips behavior in many examples (between 7 - 79% depending on model and benchmark). We also show through behavioral analysis that, when steering changes the decision, the chain-of-thought process often rationalizes the flip rather than resisting it. Together, these results suggest that reasoning models can encode action choices before they begin to deliberate in text.

关键词: large language models, chain-of-thought, reasoning models, tool-calling decisions, activation steering, linear probe, pre-generation activations, decision encoding

28. ❌ ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

作者: Nandan Thakur, Zijian Chen, Xueguang Ma, Jimmy Lin 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01195v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究搜索代理（Search Agents），涉及语言模型与网络搜索的集成，属于LLM Agents范畴（高度相关10分）。研究使用Qwen3-4B（4B参数模型）进行训练，属于Small Language Models（8分）。方法涉及多步检索和推理（Chain of Thought 10分，System 2 Thinking 8分），并采用Retrieval-Augmented Generation（RAG）框架（10分）。训练使用GRPO（一种强化学习优化方法），与Post-training/SFT相关（8分）。验证阶段涉及自我和外部验证，与Self-Correction（5分）和Factuality（5分）相关。数据生成框架关注低成本合成数据，与Scaling Laws & Data Quality有一定关联（5分）。工具使用体现在搜索代理的API/工具集成（5分）。其他关键词如MoE、PEFT、量化等未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了ORBIT框架，通过低成本合成数据生成方法解决了搜索代理在复杂多步检索推理任务中的训练数据构建难题，并证明使用该数据训练的4B参数模型在Wikipedia问答任务中取得了优异性能。

摘要翻译

将语言模型（LMs）与网络搜索相结合的搜索智能体，正日益成为回答复杂用户查询的关键工具。然而，针对涉及多步检索与推理的深度研究任务构建训练数据集仍面临挑战，这主要源于高昂的人工标注成本或繁琐的先决条件。本研究提出了ORBIT，一个包含2万个推理密集型查询及其简短可验证答案的训练数据集，该数据集通过一个无需依赖付费API服务的轻量级框架生成。该模块化框架包含四个阶段：种子创建、问答对生成，以及两个验证阶段：自我验证与外部验证。ORBIT涵盖15个领域，每个训练对需要4至5步推理，且需通过完整的网络进行外部搜索验证。我们以Qwen3-4B为基础模型，使用GRPO方法在ORBIT上进行训练，并在维基百科问答任务上对其评估。大量实验结果表明，ORBIT-4B在4B参数以下的大型语言模型（LLMs）作为搜索智能体时表现出色，验证了合成数据集的实用性。我们的框架、代码与数据集均已开源并公开提供。

摘要 (Abstract)

Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question–answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4–5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.

关键词: search agents, language models, retrieval-augmented generation, multi-step reasoning, synthetic dataset, Qwen3-4B, GRPO, verification

29. ❌ A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

作者: J. E. Domínguez-Vidal 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01179v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究Florence-2（一个基础视觉语言模型）在机器人系统中的实际部署和集成，通过ROS 2包装器实现本地推理。与关键词的相关性分析如下：1）高度相关（8分）：论文明确涉及基础模型（Foundation Models）在机器人领域的应用，Florence-2属于视觉语言基础模型，符合“大模型在不同领域的研究应用”标准。2）其他关键词（0分）：论文聚焦于工程部署和系统集成（ROS 2包装器、本地执行、吞吐量测试），未涉及模型架构创新（如MoE、量化）、训练方法（如预训练、微调）、推理优化（如注意力机制）、代理系统或多步推理等技术原理。尽管属于AI应用，但未专门针对科学领域（如生物信息学），且未提及其他关键词的具体内容。

!!! tip deepseek-chat TL;DR

该论文开发了一个ROS 2包装器，将Florence-2视觉语言基础模型集成到机器人系统中，支持本地部署和多模式交互，并通过实验验证了其在消费级硬件上的可行性。

摘要翻译

基础视觉-语言模型正日益与机器人技术产生紧密关联，因其能提供比特定任务专用流程更丰富的语义感知能力。然而，这些模型在机器人软件栈中的实际应用，仍取决于可复现的中间件集成，而非仅依赖于模型质量本身。Florence-2 在这方面尤其具有吸引力，因为它将图像描述、光学字符识别、开放词汇检测、视觉定位及相关视觉-语言任务统一整合在一个相对可控的模型规模内。本文介绍了一种 Florence-2 的 ROS 2 封装器，该封装器通过三种互补的交互模式提供模型功能：持续的主题驱动处理、同步服务调用以及异步动作执行。该封装器设计用于本地运行，支持原生安装和 Docker 容器部署两种方式。同时，它结合了通用 JSON 输出与面向检测任务的标准 ROS 2 消息绑定。文中报告了功能验证结果以及在多种 GPU 上的吞吐量研究，表明使用消费级硬件进行本地部署是可行的。代码仓库已公开：https://github.com/JEDominguezVidal/florence2_ros2_wrapper

摘要 (Abstract)

Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: https://github.com/JEDominguezVidal/florence2_ros2_wrapper

关键词: Florence-2, vision-language model, ROS 2 wrapper, robotic systems, local inference, multi-mode interaction, foundation model, deployment

30. ❌ Screening Is Enough

作者: Ken M. Nakanishi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01178v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Multiscreen的新型语言模型架构，通过引入screening机制来解决标准softmax注意力机制中缺乏绝对查询-键相关性的问题。该研究直接涉及大语言模型架构创新（高度相关），并显著改进了长上下文处理能力（在100K上下文长度下减少推理延迟3.2倍），因此与’Context Window Extension OR Long Context LLMs’、‘KV Cache Compression OR Linear Attention OR FlashAttention’和’Speculative Decoding OR Inference Acceleration’高度相关。论文未涉及其他关键词领域，如MoE、小模型、对齐、RAG、代理等。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Multiscreen的新型语言模型架构，通过引入screening机制解决了标准注意力机制中缺乏绝对查询-键相关性的问题，实现了在减少40%参数的情况下达到可比性能、支持更长上下文并显著降低推理延迟。

摘要翻译

标准softmax注意力的一个核心局限在于，其未定义查询与键的绝对相关性概念：注意力权重是通过依据所有键的相对得分重新分配固定单位质量而获得的。因此，相关性仅相对于竞争键来定义，且无法明确排除无关键。我们提出了Multiscreen，这是一种围绕我们称为“筛选”机制构建的语言模型架构，该机制能够实现查询与键的绝对相关性。筛选并非在所有键间重新分配注意力，而是依据一个明确的阈值评估每个键，丢弃无关键并聚合剩余键，从而消除了键之间的全局竞争。在多项实验中，与Transformer基线相比，Multiscreen以约40%更少的参数取得了相当的验证损失，能在显著更大的学习率下实现稳定优化，在长上下文困惑度上保持强劲性能，即使远超训练上下文长度，其检索性能也几乎未见下降，并在100K上下文长度下将推理延迟降低了高达3.2倍。

摘要 (Abstract)

A core limitation of standard softmax attention is that it does not define a notion of absolute query–key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query–key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2$\times$ at 100K context length.

关键词: Multiscreen, screening mechanism, attention mechanism, language model architecture, long-context processing, inference acceleration, query-key relevance, Transformer alternative

31. ❌ Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

作者: Cai Zhou, Zekai Wang, Menghua Wu, Qianyu Julie Zhu, Flora C. Shi, Chenyu Wang, Ashia Wilson, Tommi Jaakkola, Stephen Bates 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01170v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理过程中的校准问题，与’Large Language Models’高度相关（10分）。论文明确提到’post-trained language models’，与’Post-training’高度相关（10分）。研究涉及推理任务中的校准，与’Chain of Thought’和’System 2 Thinking’高度相关（各10分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了在线推理校准框架ORCA，通过结合保形预测和测试时训练来校准大语言模型的采样过程，在保证理论风险的同时显著提高了推理任务的效率和泛化能力。

摘要翻译

尽管测试时缩放技术使大语言模型能够解决高难度任务，但最先进的结果往往伴随着极高的计算成本。这些低效问题可归因于后训练语言模型的校准失准，以及主流采样技术中校准机制的缺失。本文提出在线推理校准框架，该框架结合了共形预测与测试时训练技术，旨在对采样过程进行校准。具体而言，我们引入一种元学习流程，能够针对每个输入动态更新校准模块。这使得我们能够在分布偏移情况下提供有效的置信度估计，例如跨越不同推理阶段出现的思维模式差异，或模型开发与部署阶段的提示分布差异。ORCA不仅为共形风险提供理论保证，更在不同推理任务中实证展现出更高的效率与泛化能力。在风险水平δ=0.1时，ORCA将Qwen2.5-32B模型在分布内任务上的效率提升至47.5%（使用监督标签）和40.7%（使用自洽性标签）。在零样本域外设定下，该框架将MATH-500数据集的节省率从静态校准基线的24.8%提升至67.0%，同时保持较低的经验错误率，这一优势在不同模型家族与下游基准测试中均得到验证。我们的代码已公开于https://github.com/wzekai99/ORCA。

摘要 (Abstract)

While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level $δ=0.1$, ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.

关键词: Online Reasoning Calibration, conformal prediction, test-time training, large language models, reasoning tasks, calibration, efficiency, generalization

32. ❌ AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation

作者: Prantik Deb, Srimanth Dhondy, N. Ramakrishna, Anu Kapoor, Raju S. Bapi, Tapabrata Chakraborti 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出AdaLoRA-QAT框架，专注于医学图像分割领域的大模型高效部署。核心创新点包括：1）自适应低秩适配（AdaLoRA），属于参数高效微调（PEFT）范畴，与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（15分）；2）量化感知训练（QAT），实现模型压缩，与关键词’Quantization OR Model Compression OR Low-bit Weights’高度相关（15分）；3）应用于胸部X光分割，属于生物信息学/科学AI应用，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’相关（10分）；4）基于基础模型（如SAM）进行微调，与关键词’Post-training OR Supervised Fine-tuning OR SFT’相关（10分）；5）整体涉及大模型/基础模型在特定领域的应用，与关键词’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词如MoE、Scaling Laws、RLHF、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对临床部署中大型基础模型的计算约束问题，提出了AdaLoRA-QAT框架，通过自适应低秩适配和量化感知训练，在保持胸部X光分割精度的同时，显著减少了可训练参数并实现了模型压缩。

摘要翻译

胸部X光（Chest X-ray, CXR）分割是计算机辅助诊断中的重要步骤，然而由于计算资源的限制，在临床环境中部署大型基础模型仍具挑战性。我们提出了AdaLoRA-QAT，一种结合了自适应低秩编码器适配与全量化感知训练的两阶段微调框架。自适应秩分配提升了参数效率，而选择性混合精度INT8量化则保留了对于临床可靠性至关重要的结构保真度。在大规模CXR数据集上的评估表明，AdaLoRA-QAT取得了95.6%的Dice系数，与全精度SAM解码器微调性能相当，同时将可训练参数减少了16.6倍，并实现了2.24倍的模型压缩。Wilcoxon符号秩检验证实量化并未显著降低分割精度。这些结果表明，AdaLoRA-QAT在准确性、效率和结构可信度之间实现了有效平衡，为医学图像分割提供了紧凑且可部署的基础模型。代码与预训练模型已公开于：https://prantik-pdeb.github.io/adaloraqat.github.io/

摘要 (Abstract)

Chest X-ray (CXR) segmentation is an important step in computer-aided diagnosis, yet deploying large foundation models in clinical settings remains challenging due to computational constraints. We propose AdaLoRA-QAT, a two-stage fine-tuning framework that combines adaptive low-rank encoder adaptation with full quantization-aware training. Adaptive rank allocation improves parameter efficiency, while selective mixed-precision INT8 quantization preserves structural fidelity crucial for clinical reliability. Evaluated across large-scale CXR datasets, AdaLoRA-QAT achieves 95.6% Dice, matching full-precision SAM decoder fine-tuning while reducing trainable parameters by 16.6\times and yielding 2.24\times model compression. A Wilcoxon signed-rank test confirms that quantization does not significantly degrade segmentation accuracy. These results demonstrate that AdaLoRA-QAT effectively balances accuracy, efficiency, and structural trust-worthiness, enabling compact and deployable foundation models for medical image segmentation. Code and pretrained models are available at: https://prantik-pdeb.github.io/adaloraqat.github.io/

关键词: AdaLoRA, Quantization-Aware Training, Medical Image Segmentation, Parameter-efficient Fine-tuning, Model Compression, Chest X-ray, Foundation Models, Adaptive Low-Rank Adaptation

33. ❌ Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

作者: Mohammad R. Abu Ayyash 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01152v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究MoE-LoRA架构用于LLM的持续多领域微调，高度相关关键词包括：MoE、LoRA/PEFT、LLMs、Continual Pre-training/Domain Adaptation、SFT、DPO、Quantization、Chain of Thought。论文使用TinyLlama-1.1B和Gemma 3 12B，涉及小模型但非核心，给5分。提到医学领域应用，与AI for Science相关但非重点，给5分。其他关键词如Scaling Laws、RAG、Context Window等未涉及，给0分。

!!! tip deepseek-chat TL;DR

论文提出Brainstacks架构，通过冻结的MoE-LoRA堆栈实现LLM的持续多领域学习，发现领域堆栈编码可转移的认知原语而非领域特定知识，在医疗提示中97%路由到聊天+数学堆栈而无需医疗数据。

摘要翻译

本文提出Brainstacks，一种用于大语言模型持续多领域微调的模块化架构，其将领域专业知识封装为冻结的适配器栈，在推理时以可叠加方式组合于共享的冻结基础模型之上。该架构包含五个相互关联的组件：（1）MoE-LoRA：采用Shazeer式带噪声的top-2路由机制，作用于QLoRA 4位量化与rsLoRA缩放下的全部七个Transformer投影层；（2）内循环：通过冻结已训练栈并添加新栈进行残差增强；（3）外循环：按照课程排序的依赖关系，顺序训练领域专用栈；（4）零空间投影：通过随机化SVD将新栈约束至与先前方向正交的子空间，实现孤立状态下的零遗忘；（5）基于结果的Sigmoid元路由器：基于经验发现的领域组合目标进行训练，可选择性加权各栈，实现跨领域组合。两项边界实验：（6）在随机初始化模型上进行PSN预训练；（7）逐领域强化学习（DPO/GRPO），验证其与SFT后对齐方法的兼容性。在TinyLlama-1.1B（4个领域，9个栈）和Gemma 3 12B IT（5个领域，10个栈）上的验证表明：MoE-LoRA比参数匹配的单LoRA收敛速度快2.5倍；残差增强突破了单栈性能上限；路由系统恢复了因无门控栈累积而破坏的生成质量。核心发现：基于结果的路由器揭示出领域栈编码的是可迁移的认知基元（如指令遵循清晰度、数值推理、程序逻辑、思维链结构），而非领域特定知识——尽管相关栈中未包含任何医学数据，医学提示在97%的情况下被路由至聊天栈与数学栈的组合。

摘要 (Abstract)

We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.

关键词: MoE-LoRA, continual learning, domain adaptation, parameter-efficient fine-tuning, frozen adapter stacks, cognitive primitives, cross-domain composition, zero forgetting

34. ❌ Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

作者: Aaron Rose, Carissa Cullen, Brandon Gary Kaplowitz, Christian Schroeder de Witt 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01151v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体系统中LLM智能体的共谋检测问题，与’LLM Agents’、‘Multi-agent Systems’和’Mechanistic Interpretability’高度相关（10分），因为这些是论文的直接研究对象和方法。论文涉及LLM，但主要关注智能体应用而非LLM技术本身，因此’Large Language Models’给8分。其他关键词如MoE、SFT、RAG等与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了在多智能体系统中检测LLM智能体之间共谋行为的问题，通过提出NARCBench基准和五种探测技术，发现模型内部激活信号可作为文本级监控的补充手段来检测共谋，并在不同场景中实现了0.60-0.86的AUROC性能。

摘要翻译

随着大语言模型智能体在多智能体系统中日益广泛的应用，其可能引发的隐蔽协同风险可能规避常规形式的人工监督。尽管基于模型激活的线性探针在单智能体场景中已显示出检测欺骗行为的潜力，但共谋本质上是一种多智能体现象，利用内部表征检测智能体间共谋的研究仍属空白。我们提出了NARCBench——一个用于评估环境分布偏移下共谋检测能力的基准，并提出了五种探针技术，通过聚合单智能体欺骗分数来实现群体层面的场景分类。我们的探针在分布内测试中取得了1.00的AUROC值，在零样本迁移到结构不同的多智能体场景及隐写式二十一点纸牌记牌任务中，AUROC值达到0.60–0.86。研究发现，没有任何单一探针技术能在所有共谋类型中保持最优，这表明不同形式的共谋在激活空间中具有不同的表征模式。我们还发现了初步证据，表明该信号定位于词元级别：共谋智能体的激活值在其处理伙伴消息的编码部分时会特异性突增。这项工作向多智能体可解释性迈出了一步：将白盒检测从单一模型扩展到多智能体场景，其中检测需要聚合跨智能体的信号。这些结果表明，模型内部状态为文本级监控提供了补充信号，尤其适用于能够获取模型激活的组织机构，可用于检测多智能体共谋。代码与数据详见https://github.com/aaronrose227/narcbench。

摘要 (Abstract)

As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60–0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent’s activations spiking specifically when processing the encoded parts of their partner’s message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at https://github.com/aaronrose227/narcbench.

关键词: multi-agent systems, LLM agents, collusion detection, interpretability, model activations, deception, NARCBench, white-box inspection

35. ❌ Looking into a Pixel by Nonlinear Unmixing – A Generative Approach

作者: Maofeng Tang, Hairong Qi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01141v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究高光谱图像非线性解混问题，提出了一种基于生成对抗网络（GAN）的方法（LCGU net）。论文属于计算机视觉和遥感图像处理领域，与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、智能体等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为遥感图像分析可视为AI在科学（地球科学/遥感）中的一个应用，但论文本身并未强调’AI for Science’这一概念，且与生物信息学或化学信息学无关，因此给予5分（有一定关联）。其他所有关键词均未在论文标题或摘要中提及，且研究内容不涉及大模型或所列的深度学习技术，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对高光谱图像非线性解混这一挑战，提出了一种无需显式混合模型知识的生成式方法（LCGU net），通过双向GAN框架结合循环一致性和线性关联约束，在多个数据集上取得了稳定且具有竞争力的性能。

摘要翻译

由于遥感影像中像素的足迹较大，高光谱解混已成为高光谱图像分析中重要且必要的处理步骤。传统解混方法依赖于先验的光谱混合模型，尤其对于非线性混合情况，这在很大程度上限制了解混方法的性能与泛化能力。本文针对高光谱非线性解混这一挑战性问题展开研究，且无需显式已知混合模型。受生成模型的原理启发——在不知道图像精确概率分布函数的情况下，可以生成与训练图像相同分布的图像——我们通过一个双向生成对抗网络框架构建了一个可逆的混合-解混过程，该过程同时受到循环一致性和线性与非线性混合间关联性的约束。循环一致性与线性关联的结合提供了强大的约束条件，且无需显式的混合模型。我们将所提出的方法称为线性约束循环生成对抗网络解混网络，简称LCGU网络。实验结果表明，与其他先进的基于模型的高光谱非线性解混方法相比，所提出的LCGU网络在不同数据集上均表现出稳定且具有竞争力的性能。

摘要 (Abstract)

Due to the large footprint of pixels in remote sensing imagery, hyperspectral unmixing (HU) has become an important and necessary procedure in hyperspectral image analysis. Traditional HU methods rely on a prior spectral mixing model, especially for nonlinear mixtures, which has largely limited the performance and generalization capacity of the unmixing approach. In this paper, we address the challenging problem of hyperspectral nonlinear unmixing (HNU) without explicit knowledge of the mixing model. Inspired by the principle of generative models, where images of the same distribution can be generated as that of the training images without knowing the exact probability distribution function of the image, we develop an invertible mixing-unmixing process via a bi-directional GAN framework, constrained by both the cycle consistency and the linkage between linear and nonlinear mixtures. The combination of cycle consistency and linear linkage provides powerful constraints without requiring an explicit mixing model. We refer to the proposed approach as the linearly-constrained CycleGAN unmixing net, or LCGU net. Experimental results indicate that the proposed LCGU net exhibits stable and competitive performance across different datasets compared with other state-of-the-art model-based HNU methods.

关键词: hyperspectral unmixing, nonlinear unmixing, generative models, CycleGAN, remote sensing, image analysis, invertible process, spectral mixing

36. ❌ Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

作者: Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki, Kiyoharu Aizawa 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究AI驱动的论文写作（特别是编码代理）的评估框架，重点关注Presentation和Hallucination两个维度。与关键词的相关性分析如下：1）高度相关（10分）：‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（论文明确使用编码代理生成论文，属于Agentic Workflow），‘Hallucination Mitigation OR Factuality OR Truthfulness’（论文核心评估维度之一，专门评估AI生成论文的幻觉问题）。2）有一定关联（5分）：‘Large Language Models OR LLMs OR Foundation Models’（论文实验使用ClaudeCode和Codex，这些基于大语言模型，但论文本身不研究LLM技术原理）。3）完全无关（0分）：其余24个关键词涉及大模型技术原理、训练方法、优化技术等，论文未涉及这些具体技术，仅使用现有AI工具进行应用评估。

!!! tip deepseek-chat TL;DR

该论文提出了首个系统评估框架PaperRecon，用于量化现代编码代理生成的AI论文的质量和风险，发现ClaudeCode在呈现质量上更高但产生更多幻觉，而Codex幻觉较少但呈现质量较低。

摘要翻译

本文首次提出了系统性的评估框架，用以量化现代编码智能体所撰写论文的质量与风险。尽管人工智能驱动的论文写作已日益引发关注，但目前对AI生成论文的质量与潜在风险仍缺乏严谨评估，学界对其可靠性的统一认识亦显不足。我们提出了论文重构评估框架（PaperRecon），该框架首先从现有论文中提取概览文件（overview.md），随后由智能体基于此概览及少量补充资源生成完整论文，最终将生成结果与原文进行系统对比。PaperRecon将AI生成论文的评估解耦为两个正交维度——呈现质量与幻觉程度，其中呈现质量通过评估量表进行量化，幻觉程度则依托原文进行基于智能体的溯源评估。为实施评估，我们构建了PaperWrite-Bench基准数据集，涵盖2025年后发表于各领域顶级学术会议的51篇论文。实验结果表明存在明显的权衡关系：虽然ClaudeCode与Codex均随模型升级而改进，但ClaudeCode以平均每篇论文产生超过10处幻觉为代价获得更高的呈现质量，而Codex产生的幻觉较少但呈现质量较低。本研究为建立AI驱动论文写作的评估框架、深化科研界对其风险认知迈出了关键一步。

摘要 (Abstract)

This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.

关键词: AI-written papers, evaluation framework, paper reconstruction, hallucination assessment, presentation quality, coding agents, PaperWrite-Bench, agentic evaluation

37. ❌ Trust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators

作者: Griffin Pitts, Neha Rani, Weedguet Mildort 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01114v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究教育环境中学生对AI助手的信任与依赖行为，涉及AI助手（可视为一种AI系统）的使用，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（评5分），但论文未深入探讨任何具体的大模型技术原理、架构、训练方法、优化技术或特定科学领域应用，因此其他所有关键词均不相关（评0分）。

!!! tip deepseek-chat TL;DR

该研究探讨了学生在编程问题解决任务中对AI助手的信任如何影响其适当依赖行为，并发现更高的信任与更低的适当依赖相关，且这种关系受到学生AI素养和认知需求的调节。

摘要翻译

随着生成式人工智能系统被融入教育环境，学生在完成学习任务时——无论是通过寻求帮助还是使用集成工具——经常会接触到人工智能生成的输出内容。对人工智能的信任会影响学生如何解读和使用这些输出，包括他们是否进行批判性评估或表现出过度依赖。本研究探讨了在编程问题解决任务中，学生对人工智能助手的信任如何与其对助手的适当依赖相关联，以及这种关系是否因学习者特征而异。研究招募了432名本科生参与者，学生们在完成Python输出预测问题时，会收到来自人工智能聊天机器人的建议和解释，其中包括准确的和故意误导的建议。我们将依赖行为操作化为学生的回答在多大程度上反映了对人工智能助手建议的适当使用：当建议正确时接受，错误时拒绝。任务前后的问卷调查评估了学生对助手的信任度、人工智能素养、认知需求、编程自我效能感以及编程素养。结果显示，信任与适当依赖之间存在非线性关系：更高的信任与更低的适当依赖相关，表明学生对正确与错误建议的辨别能力较弱。这种关系受到学生人工智能素养和认知需求的显著调节。这些发现强调，未来的工作需要在教学和系统支持方面进行探索，以鼓励在问题解决过程中对人工智能辅助进行更具反思性的评估。

摘要 (Abstract)

As generative AI systems are integrated into educational settings, students often encounter AI-generated output while working through learning tasks, either by requesting help or through integrated tools. Trust in AI can influence how students interpret and use that output, including whether they evaluate it critically or exhibit overreliance. We investigate how students’ trust relates to their appropriate reliance on an AI assistant during programming problem-solving tasks, and whether this relationship differs by learner characteristics. With 432 undergraduate participants, students’ completed Python output-prediction problems while receiving recommendations and explanations from an AI chatbot, including accurate and intentionally misleading suggestions. We operationalize reliance behaviorally as the extent to which students’ responses reflected appropriate use of the AI assistant’s suggestions, accepting them when they were correct and rejecting them when they were incorrect. Pre- and post-task surveys assessed trust in the assistant, AI literacy, need for cognition, programming self-efficacy, and programming literacy. Results showed a non-linear relationship in which higher trust was associated with lower appropriate reliance, suggesting weaker discrimination between correct and incorrect recommendations. This relationship was significantly moderated by students’ AI literacy and need for cognition. These findings highlight the need for future work on instructional and system supports that encourage more reflective evaluation of AI assistance during problem-solving.

关键词: AI in education, trust in AI, appropriate reliance, AI literacy, need for cognition, programming problem-solving, AI assistant, behavioral reliance

38. ❌ Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

作者: Reyhaneh Ahani Manghotay, Jie Liang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉-语言模型CLIP在单目深度估计任务中的参数高效微调，核心创新是轻量级Mixture-of-Adapters模块。与关键词的相关性分析：1）高度相关（10分）：PEFT/LoRA/Parameter-efficient Fine-tuning是论文的核心方法；2）中等相关（5分）：Mixture-of-Adapters与MoE概念相关但不完全相同，Pre-training/Domain Adaptation涉及CLIP预训练模型迁移，Post-training/SFT涉及微调过程；3）无关（0分）：其他关键词主要针对大语言模型（LLM）的文本生成、推理、对齐、代理等方向，而本文专注于视觉任务的VLM微调，没有涉及LLM、科学AI应用或其他技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MoA-DepthCLIP的轻量级参数高效微调框架，通过集成Mixture-of-Adapters模块和选择性微调，将预训练的CLIP视觉-语言模型有效适配到单目深度估计任务，在NYU Depth V2基准上显著提升了性能（δ1准确率从0.390提高到0.745）同时大幅减少了可训练参数量。

摘要翻译

利用如CLIP等视觉语言模型（VLMs）丰富的语义特征进行单目深度估计任务是一个前景广阔的方向，但通常需要大量微调或缺乏几何精度。我们提出了一种参数高效的框架，命名为MoA-DepthCLIP，该框架以最小监督的方式将预训练的CLIP表征适配于单目深度估计。我们的方法将轻量级的混合适配器（Mixture-of-Adapters，MoA）模块集成到预训练的视觉变换器（Vision Transformer，ViT-B/32）主干网络中，并结合对最终层的选择性微调。该设计在全局语义上下文向量的引导下实现了空间感知的适配，并通过一个混合预测架构协同深度区间分类与直接回归。为提升结构准确性，我们采用了一种复合损失函数来强化几何约束。在NYU Depth V2基准测试中，MoA-DepthCLIP取得了具有竞争力的结果，显著超越了DepthCLIP基线，将$δ_1$准确率从0.390提升至0.745，并将均方根误差（RMSE）从1.176降低至0.520。这些成果是在仅需极少可训练参数的情况下实现的，证明了轻量级、提示引导的MoA是将VLM知识迁移到细粒度单目深度估计任务中的一种高效策略。

摘要 (Abstract)

Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

关键词: CLIP adaptation, monocular depth estimation, parameter-efficient fine-tuning, Mixture-of-Adapters, vision-language models, lightweight adaptation, geometric constraints, hybrid prediction architecture

39. ❌ Adversarial Moral Stress Testing of Large Language Models

作者: Saeid Jamshidi, Foutse Khomh, Arghavan Moradi Dakhel, Amin Nikanjam, Mohammad Hamdaqa, Kawser Wazed Nafi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的伦理鲁棒性评估，与’Large Language Models’高度相关（10分），并涉及伦理对齐和价值观对齐，与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Context Window、KV Cache、CoT、System 2、MCTS、Self-Correction、Agents、Tool Use、Multi-agent、Quantization、Speculative Decoding、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等均未在标题或摘要中提及或相关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了Adversarial Moral Stress Testing (AMST)框架，用于评估大型语言模型在对抗性多轮交互中的伦理鲁棒性，发现不同模型的鲁棒性存在显著差异且传统单轮评估无法检测到行为退化模式。

摘要翻译

评估部署于软件系统中的大型语言模型（LLM）的伦理鲁棒性仍然具有挑战性，尤其是在持续对抗性用户交互的场景下。现有的安全基准测试通常依赖于单轮评估和聚合指标（如毒性分数和拒绝率），这些方法对现实多轮交互中可能出现的行为不稳定性提供的信息有限。因此，在部署前，罕见但影响重大的伦理失效以及渐进性退化效应可能无法被检测到。本文提出了对抗性道德压力测试（Adversarial Moral Stress Testing, AMST），这是一种基于压力的评估框架，用于评估对抗性多轮交互下的伦理鲁棒性。AMST对提示施加结构化的压力变换，并通过分布感知的鲁棒性指标来评估模型行为，这些指标捕捉了交互轮次间的方差、尾部风险以及时序行为漂移。我们在多个先进的大型语言模型（包括LLaMA-3-8B、GPT-4o和DeepSeek-v3）上评估了AMST，使用了在受控压力条件下生成的大量对抗性场景。结果表明，不同模型的鲁棒性特征存在显著差异，并揭示出在传统单轮评估协议下无法观察到的退化模式。特别是，鲁棒性被证明依赖于分布稳定性和尾部行为，而不仅仅是平均性能。此外，AMST提供了一种可扩展且与模型无关的压力测试方法，能够对在对抗性环境中运行的、基于大型语言模型的软件系统进行鲁棒性感知的评估与监控。

摘要 (Abstract)

Evaluating the ethical robustness of large language models (LLMs) deployed in software systems remains challenging, particularly under sustained adversarial user interaction. Existing safety benchmarks typically rely on single-round evaluations and aggregate metrics, such as toxicity scores and refusal rates, which offer limited visibility into behavioral instability that may arise during realistic multi-turn interactions. As a result, rare but high-impact ethical failures and progressive degradation effects may remain undetected prior to deployment. This paper introduces Adversarial Moral Stress Testing (AMST), a stress-based evaluation framework for assessing ethical robustness under adversarial multi-round interactions. AMST applies structured stress transformations to prompts and evaluates model behavior through distribution-aware robustness metrics that capture variance, tail risk, and temporal behavioral drift across interaction rounds. We evaluate AMST on several state-of-the-art LLMs, including LLaMA-3-8B, GPT-4o, and DeepSeek-v3, using a large set of adversarial scenarios generated under controlled stress conditions. The results demonstrate substantial differences in robustness profiles across models and expose degradation patterns that are not observable under conventional single-round evaluation protocols. In particular, robustness has been shown to depend on distributional stability and tail behavior rather than on average performance alone. Additionally, AMST provides a scalable and model-agnostic stress-testing methodology that enables robustness-aware evaluation and monitoring of LLM-enabled software systems operating in adversarial environments.

关键词: Large Language Models, Ethical Robustness, Adversarial Interaction, Stress Testing, Multi-turn Evaluation, Behavioral Degradation, Safety Benchmark, Distribution-aware Metrics

40. ❌ Temporal Dependencies in In-Context Learning: The Role of Induction Heads

作者: Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, Billy Dickson, Zoran Tiganj 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的上下文学习机制，特别是通过消融实验探究induction heads在时序依赖和序列召回中的作用，因此与’Large Language Models’、‘Mechanistic Interpretability’和’In-context Learning’高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等未在论文中涉及，故为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在上下文学习中如何处理时序依赖，发现induction heads对序列召回行为起关键作用，并通过消融实验验证了其机制特异性。

摘要翻译

大语言模型（LLMs）展现出强大的上下文学习能力，但其如何追踪并从上下文中检索信息仍未被充分探究。借鉴认知科学中的自由回忆范式（参与者以任意顺序回忆列表项目），我们发现多个开源LLM持续表现出类似序列回忆的模式：它们将概率峰值分配给输入序列中紧跟在重复标记后的标记。通过系统的消融实验，我们证明归纳头——一种专门关注当前标记先前出现位置之后标记的注意力头——在此现象中起重要作用。移除具有高归纳得分的注意力头会显著降低+1滞后偏差，而随机移除其他注意力头则无法产生相同的削弱效果。我们还发现，在模型通过少样本学习进行序列回忆任务时，移除高归纳得分注意力头对性能的损害程度远大于移除随机注意力头。我们的研究结果揭示了归纳头与Transformer中时序上下文处理之间一种机制特异的关联，表明这些注意力头对于上下文学习中的有序检索和类序列回忆行为尤为重要。

摘要 (Abstract)

Large language models (LLMs) exhibit strong in-context learning capabilities, but how they track and retrieve information from context remains underexplored. Drawing on the free recall paradigm in cognitive science (where participants recall list items in any order), we show that several open-source LLMs consistently display a serial-recall-like pattern, assigning peak probability to tokens that immediately follow a repeated token in the input sequence. Through systematic ablation experiments, we show that induction heads, specialized attention heads that attend to the token following a previous occurrence of the current token, play an important role in this phenomenon. Removing heads with a high induction score substantially reduces the +1 lag bias, whereas ablating random heads does not reproduce the same reduction. We also show that removing heads with high induction scores impairs the performance of models prompted to do serial recall using few-shot learning to a larger extent than removing random heads. Our findings highlight a mechanistically specific connection between induction heads and temporal context processing in transformers, suggesting that these heads are especially important for ordered retrieval and serial-recall-like behavior during in-context learning.

关键词: Large Language Models, In-context Learning, Induction Heads, Temporal Dependencies, Serial Recall, Mechanistic Interpretability, Transformer Models, Attention Heads

41. ❌ Approximating Pareto Frontiers in Stochastic Multi-Objective Optimization via Hashing and Randomization

作者: Jinzhao Li, Nan Jiang, Yexiang Xue 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01098v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于随机多目标优化（SMOO）的算法设计，提出XOR-SMOO算法来解决#P-hard问题，通过哈希和随机化技术近似帕累托前沿。所有关键词均涉及大模型、深度学习、AI应用或相关技术原理，而本文研究的是经典优化算法（基于SAT oracle查询），与人工智能模型训练、推理、应用或技术原理无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为XOR-SMOO的新算法，用于解决随机多目标优化问题，通过哈希和随机化技术高效近似帕累托前沿，并在实际应用中验证了其优于现有方法的性能。

摘要翻译

随机多目标优化（Stochastic Multi-Objective Optimization, SMOO）对于在不确定环境中权衡多个潜在冲突目标的决策至关重要。SMOO旨在识别包含所有互不支配决策的帕累托前沿（Pareto frontier）。由于涉及嵌入式概率推断（如计算边缘概率、后验概率或期望），该问题通常高度难解。现有方法（如标量化、样本平均逼近和进化算法）要么提供任意宽松的近似解，要么可能产生极高的计算成本。我们提出XOR-SMOO，一种新颖算法，该算法以$1-δ$的概率，通过以$γ$和$δ$的多对数次数查询SAT谕示（SAT oracle），为SMOO问题获得$γ$近似帕累托前沿（$γ>1$）。$γ$近似帕累托前沿仅以固定的乘法因子$γ$低于真实前沿。因此，XOR-SMOO仅通过查询SAT谕示即可解决高度难解的SMOO问题（#P难问题），同时获得紧密的常数因子近似保证。在现实世界道路网络加固和供应链设计问题上的实验表明，XOR-SMOO在识别帕累托前沿方面优于多种基线方法，其前沿具有更高的目标函数值、更优的最优解覆盖度，且所得解分布更为均匀。总体而言，XOR-SMOO显著提升了SMOO求解器的实用性与可靠性。

摘要 (Abstract)

Stochastic Multi-Objective Optimization (SMOO) is critical for decision-making trading off multiple potentially conflicting objectives in uncertain environments. SMOO aims at identifying the Pareto frontier, which contains all mutually non-dominating decisions. The problem is highly intractable due to the embedded probabilistic inference, such as computing the marginal, posterior probabilities, or expectations. Existing methods, such as scalarization, sample average approximation, and evolutionary algorithms, either offer arbitrarily loose approximations or may incur prohibitive computational costs. We propose XOR-SMOO, a novel algorithm that with probability $1-δ$, obtains $γ$-approximate Pareto frontiers ($γ>1$) for SMOO by querying an SAT oracle poly-log times in $γ$ and $δ$. A $γ$-approximate Pareto frontier is only below the true frontier by a fixed, multiplicative factor $γ$. Thus, XOR-SMOO solves highly intractable SMOO problems (#P-hard) with only queries to SAT oracles while obtaining tight, constant factor approximation guarantees. Experiments on real-world road network strengthening and supply chain design problems demonstrate that XOR-SMOO outperforms several baselines in identifying Pareto frontiers that have higher objective values, better coverage of the optimal solutions, and the solutions found are more evenly distributed. Overall, XOR-SMOO significantly enhanced the practicality and reliability of SMOO solvers.

关键词: Stochastic Multi-Objective Optimization, Pareto frontier, SAT oracle, approximation algorithm, hashing, randomization, road network strengthening, supply chain design

42. ❌ VibeGuard: A Security Gate Framework for AI-Generated Code

作者: Ying Xie 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01052v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI生成代码的安全漏洞检测框架VibeGuard，主要涉及AI代码助手（如Claude Code CLI）在生产环境中的使用及其引入的安全风险。论文与"Large Language Models OR LLMs OR Foundation Models"有一定关联（5分），因为AI代码助手通常基于大语言模型，但论文未深入探讨模型技术本身，而是聚焦于应用层面的安全工具。其他关键词均与论文内容无关（0分），因为论文未涉及模型架构、训练方法、推理优化、对齐、代理系统、科学AI应用等具体技术方向。

!!! tip deepseek-chat TL;DR

论文针对AI代码助手（如Claude）在"vibe coding"实践中引入的新型安全漏洞（如源文件泄露、配置错误），提出了一个预发布安全检测框架VibeGuard，在实验中实现了高召回率和精确度，为依赖AI代码生成的团队提供了深度防御方案。

摘要翻译

“氛围编码”（vibe coding）——即开发者将代码生成任务委托给人工智能助手，并在极少人工审查的情况下接受其输出——已在生产环境中迅速普及。2026年3月31日，Anthropic公司的Claude Code CLI在其npm包中意外发布了一个59.8 MB的源码映射（source map）文件，导致约51.2万行专有TypeScript代码泄露。该工具本身在很大程度上采用了氛围编码方式开发，而此次泄露可追溯至一项配置错误的打包规则，而非逻辑漏洞。现有的静态分析和敏感信息扫描工具均未覆盖此类故障模式，这表明人工智能倾向于引入的漏洞类型与当前工具旨在发现的漏洞之间存在差距。本文提出VibeGuard，一种针对五类此类盲点的预发布安全关卡：产物清洁度、打包配置漂移、源码映射暴露、硬编码密钥以及供应链风险。在八个合成项目（七个含漏洞，一个作为洁净对照组）的受控实验中，VibeGuard实现了100%的召回率、89.47%的精确率（F1值=94.44%），并在三个策略级别上对所有八个项目作出了正确的通过/拦截决策。我们进一步探讨这些结果如何为依赖AI代码生成的团队构建纵深防御工作流提供参考。

摘要 (Abstract)

“Vibe coding,” in which developers delegate code generation to AI assistants and accept the output with little manual review, has gained rapid adoption in production settings. On March 31, 2026, Anthropic’s Claude Code CLI shipped a 59.8 MB source map file in its npm package, exposing roughly 512,000 lines of proprietary TypeScript. The tool had itself been largely vibe-coded, and the leak traced to a misconfigured packaging rule rather than a logic bug. Existing static-analysis and secret-scanning tools did not cover this failure mode, pointing to a gap between the vulnerabilities AI tends to introduce and the vulnerabilities current tooling is built to find. We present VibeGuard, a pre-publish security gate that targets five such blind spots: artifact hygiene, packaging-configuration drift, source-map exposure, hardcoded secrets, and supply-chain risk. In controlled experiments on eight synthetic projects (seven vulnerable, one clean control), VibeGuard achieved 100% recall, 89.47% precision (F1 = 94.44%), and correct pass/fail gate decisions on all eight projects across three policy levels. We discuss how these results inform a defense-in-depth workflow for teams that rely on AI code generation.

关键词: AI-generated code, security gate, vibe coding, source map exposure, static analysis, supply-chain risk, defense-in-depth, Claude Code CLI

43. ❌ TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models

作者: Awais Khan, Muhammad Umar Farooq, Kutub Uddin, Khalid Malik 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01083v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究音频深度伪造检测，使用预训练的语音基础模型（属于基础模型范畴），属于AI在科学/安全领域的应用，因此与’Large Language Models OR LLMs OR Foundation Models’、‘Pre-training OR Continual Pre-training OR Domain Adaptation’和’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。其他关键词主要涉及大语言模型的具体技术（如MoE、对齐、推理、代理等）或特定领域（如生物信息学），与论文的音频取证主题无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的音频深度伪造检测方法TRACE，通过分析预训练语音基础模型的嵌入轨迹动态来检测部分伪造音频，在多个基准测试中取得了与有监督方法竞争甚至更优的性能。

摘要翻译

局部音频深度伪造（partial audio deepfakes）指将合成片段拼接至真实录音中，此类伪造极具欺骗性，因为音频的大部分内容仍保持真实。现有检测器均为监督式方法：它们需要帧级标注，易对特定合成流程过拟合，且每当新生成模型出现时必须重新训练。我们认为这种监督并非必要。我们假设语音基础模型（speech foundation models）隐式编码了取证信号：真实语音会形成平滑、缓慢变化的嵌入轨迹，而拼接边界会在帧级过渡中引入突变。基于此，我们提出TRACE（基于嵌入动态的无训练表征音频防御框架），这是一种无需训练的方法框架，通过分析冻结语音基础模型表征的一阶动态来检测局部音频深度伪造，无需任何训练过程、标注数据或架构修改。我们在涵盖两种语言的四个基准数据集上使用六种语音基础模型评估TRACE。在PartialSpoof数据集中，TRACE实现了8.08%的等错误率（EER），与经过微调的监督基线模型表现相当。在最具挑战性的LlamaPartialSpoof基准（采用LLM驱动的商业合成技术）中，TRACE在未使用任何目标领域数据的情况下，直接超越了监督基线模型（等错误率分别为24.12% vs. 24.49%）。这些结果表明，语音基础模型中的时序动态为无训练音频取证提供了有效且泛化性强的检测信号。

摘要 (Abstract)

Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.

关键词: audio deepfake detection, speech foundation models, training-free, embedding trajectory analysis, partial audio forgery, forensic signal, temporal dynamics, generalization

44. ❌ Adversarial Attacks in AI-Driven RAN Slicing: SLA Violations and Recovery

作者: Deemah H. Tashman, Soumaya Cherkaoui 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01049v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI驱动的RAN切片中的对抗攻击，使用深度强化学习（DRL）进行资源分配，并分析SLA违规和恢复行为。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于通信网络中的DRL应用，未涉及大模型、LLMs、MoE、SLMs、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或AI for Science等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了预算受限的对抗性干扰攻击对基于深度强化学习的RAN切片资源分配的影响，量化了其导致的SLA违规以及攻击后的恢复行为。

摘要翻译

下一代（NextG）蜂窝网络旨在支持具有多样化数据速率与时延需求的新兴应用，例如沉浸式多媒体服务和大规模物联网部署。其中一项关键使能机制是无线接入网（RAN）切片技术，该技术通过将无线资源动态划分为虚拟资源块，以高效服务异构业务类型，包括增强型移动宽带（eMBB）、大规模机器类通信（mMTC）以及超可靠低时延通信（URLLC）。本文研究了对抗性攻击对人工智能驱动的RAN切片决策的影响：在预算受限条件下，攻击者通过选择性干扰切片传输，以误导基于深度强化学习（DRL）的资源分配策略，并量化了由此导致的服务水平协议（SLA）违约情况及攻击后的恢复行为。研究结果表明，预算受限的对抗性干扰可引发严重且依赖切片类型的稳态SLA违约。此外，DRL智能体的奖励仅在经历不可忽略的恢复期后，才逐渐收敛至未受攻击的基线水平。

摘要 (Abstract)

Next-generation (NextG) cellular networks are designed to support emerging applications with diverse data rate and latency requirements, such as immersive multimedia services and large-scale Internet of Things deployments. A key enabling mechanism is radio access network (RAN) slicing, which dynamically partitions radio resources into virtual resource blocks to efficiently serve heterogeneous traffic classes, including enhanced mobile broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable low-latency communications (URLLC). In this paper, we study the impact of adversarial attacks on AI-driven RAN slicing decisions, where a budget-constrained adversary selectively jams slice transmissions to bias deep reinforcement learning (DRL)-based resource allocation, and quantify the resulting service level agreement (SLA) violations and post-attack recovery behavior. Our results indicate that budget-constrained adversarial jamming can induce severe and slice-dependent steady-state SLA violations. Moreover, the DRL agent’s reward converges toward the clean baseline only after a non-negligible recovery period.

关键词: Adversarial Attacks, AI-driven RAN Slicing, Deep Reinforcement Learning, Service Level Agreement Violations, Recovery Behavior, Radio Access Network, Resource Allocation

45. ❌ Aligning Recommendations with User Popularity Preferences

作者: Mona Schirmer, Anton Thielmann, Pola Schwöbel, Thomas Martynec, Giuseppe Di Benedetto, Ben London, Yannik Stein 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究推荐系统中的流行度偏差问题，提出了一种基于激活引导的推理时缓解方法SPREE。论文核心是推荐系统与用户偏好的对齐问题，与关键词’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（评分5分），因为都涉及系统输出与用户期望的对齐概念。但论文未涉及大模型、深度学习技术原理创新或科学领域应用，也未使用任何其他关键词中的具体技术方法，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究推荐系统中的流行度偏差问题，提出了SPREE方法，通过激活引导技术改善用户级流行度对齐，同时保持推荐质量。

摘要翻译

流行度偏差是推荐系统中普遍存在的问题，即推荐结果过度偏向热门项目。这不仅会导致“富者愈富”的效应和可见内容的同质化，还可能使推荐结果与用户个人对热门或小众内容的偏好产生错位。本研究从用户-推荐系统对齐的视角探讨流行度偏差。为此，我们提出了流行度分位数校准，这是一个量化用户历史流行度偏好与其推荐内容流行度之间错位程度的测量框架。基于这种流行度对齐的概念，我们提出了SPREE——一种基于激活导向的序列推荐器推理阶段缓解方法。SPREE在表征空间中识别流行度方向，并根据对每个用户个人流行度偏差的估计自适应地引导模型激活，使得引导的方向和强度均可随用户动态调整。与全局去偏差方法不同，SPREE明确以对齐为目标，而非简单地均匀降低流行度。在多个数据集上的实验表明，SPREE在保持推荐质量的同时，持续提升了用户层面的流行度对齐程度。

摘要 (Abstract)

Popularity bias is a pervasive problem in recommender systems, where recommendations disproportionately favor popular items. This not only results in “rich-get-richer” dynamics and a homogenization of visible content, but can also lead to misalignment of recommendations with individual users’ preferences for popular or niche content. This work studies popularity bias through the lens of user-recommender alignment. To this end, we introduce Popularity Quantile Calibration, a measurement framework that quantifies misalignment between a user’s historical popularity preference and the popularity of their recommendations. Building on this notion of popularity alignment, we propose SPREE, an inference-time mitigation method for sequential recommenders based on activation steering. SPREE identifies a popularity direction in representation space and adaptively steers model activations based on an estimate of each user’s personal popularity bias, allowing both the direction and magnitude of steering to vary across users. Unlike global debiasing approaches, SPREE explicitly targets alignment rather than uniformly reducing popularity. Experiments across multiple datasets show that SPREE consistently improves user-level popularity alignment while preserving recommendation quality.

关键词: recommender systems, popularity bias, user-recommender alignment, Popularity Quantile Calibration, SPREE, activation steering, sequential recommenders, inference-time mitigation

46. ❌ Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

作者: Anubhab Sahu, Diptisha Samanta, Reza Soosahabi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM系统指令的安全漏洞和防御方法，与"Large Language Models"高度相关（10分），因为全文围绕LLM展开。与"LLM Agents"高度相关（10分），因为研究背景明确提到agentic AI applications。与"Instruction Tuning"相关（8分），因为涉及系统指令的设计和修改。与"Chain of Thought"相关（8分），因为防御策略使用了CoT推理模型。与"Tool Use"相关（8分），因为系统指令可能包含API credentials等工具使用信息。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无直接关联，给0分。

!!! tip deepseek-chat TL;DR

该论文研究发现LLM系统指令在编码攻击下存在泄露风险，并提出基于Chain-of-Thought的指令重塑方法能有效降低攻击成功率。

摘要翻译

大型语言模型（LLM）中的系统指令通常用于在智能体AI应用中强制执行安全策略、定义智能体行为并保护敏感的操作上下文。这些指令可能包含敏感信息，如API凭证、内部策略和特权工作流定义，使得系统指令泄露成为OWASP LLM应用十大安全风险中强调的关键威胁。许多LLM应用在不引入推理模型开销的情况下，依赖基于拒绝的指令来阻止直接获取系统指令的请求，其隐含假设是禁止信息只能通过显式查询提取。我们提出了一种自动化评估框架，用于测试当提取请求被重新构建为编码或结构化输出任务时，系统指令是否仍能保持机密性。在四种常见模型和46条已验证系统指令的测试中，我们观察到结构化序列化攻击的成功率较高（>0.7），即模型虽拒绝直接提取请求，却会以请求的序列化格式披露受保护内容。我们进一步展示了一种基于单样本指令重构的缓解策略，该策略利用思维链推理模型实现，结果表明即使对系统指令的措辞和结构进行细微调整，也能显著降低攻击成功率，且无需重新训练模型。

摘要 (Abstract)

System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates (> 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats. We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.

关键词: Large Language Models, System Instructions, Security Risk, Encoding Attacks, Chain-of-Thought, Agentic AI, Instruction Reshaping, Attack Mitigation

47. ❌ Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

作者: Jingjie Ning, Xueqi Li, Chengyu Yu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01029v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多LLM修订管道（multi-LLM revision pipelines），直接涉及大语言模型（LLMs）的应用，因此’Large Language Models’得10分。研究探讨第二个模型如何审查和改进第一个模型的输出，涉及自我改进（self-improvement）和多智能体协调（multi-agent systems），因此’Self-Correction’和’Multi-agent Systems’各得8分。其他关键词如MoE、SLMs、训练方法、推理加速、科学AI应用等，论文未直接涉及，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了多LLM修订管道中第二遍增益的来源，发现增益并非来自真正的错误纠正，而是取决于任务结构、草稿质量和草稿信息类型，在MCQ任务中直接使用强模型可能更有效，而在代码生成中两阶段提示仍有价值。

摘要翻译

多LLM修订流程——即由第二个模型审阅并改进第一个模型生成的初稿——被广泛认为其效果提升源于真实的错误修正。我们通过一项受控分解实验对这一假设提出质疑，该实验利用四种匹配条件将第二轮的增益分解为三个可加性成分：重新求解、框架支撑与内容贡献。我们在三个涵盖知识密集型多项选择题与竞争性编程的基准测试中，使用两对模型组合评估了这一设计。结果表明，多LLM修订的增益并非单一整体，而是取决于任务结构、初稿质量及初稿信息的类型。在多项选择题任务中，由于答案空间受限且初稿提供的结构性指导有限，大部分增益与强模型的重新求解行为一致，直接将查询路由至强模型可能比修订弱初稿更有效。然而，在代码生成任务中，两阶段提示法仍具价值，因为即使语义为空的初稿也能提供显著的结构框架支撑，而弱初稿内容则可能产生负面影响。最后，角色反转实验表明，强初稿能明确使弱审阅模型受益。总体而言，我们的研究证明多LLM修订的效用受到任务结构与初稿质量的动态制约，因此需要更具针对性的流程设计，而非笼统的修订策略。

摘要 (Abstract)

Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.

关键词: multi-LLM revision, second-pass gains, re-solving, scaffold, draft quality, task structure, pipeline design, model coordination

48. ❌ Transfer learning for nonparametric Bayesian networks

作者: Rafael Sojo, Pedro Larrañaga, Concha Bielza 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01021v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于非参数贝叶斯网络的迁移学习方法，提出PCS-TL和HC-TL算法以解决数据稀缺下的学习问题，并引入度量来避免负迁移。所有评分关键词均涉及大模型、深度学习技术原理或特定AI应用（如科学AI），而本文研究的是传统机器学习中的贝叶斯网络和迁移学习，未涉及大模型、深度学习或相关技术（如MoE、RLHF、RAG等），也未应用于科学领域（如生物信息学）。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对数据稀缺下的非参数贝叶斯网络学习，提出了两种迁移学习算法（PCS-TL和HC-TL）和度量方法以避免负迁移，实验证明这些方法能有效提升学习性能并减少部署时间。

摘要翻译

本文针对数据稀缺条件下的非参数贝叶斯网络估计问题，提出了两种迁移学习方法。我们设计了两种算法：一种基于约束的结构学习方法，称为PC稳定迁移学习（PCS-TL）；另一种基于评分的方法，称为爬山迁移学习（HC-TL）。同时，针对两种方法中可能出现的负迁移问题——即迁移学习对模型性能产生负面影响的情况——我们分别定义了特定的度量指标加以应对。在参数学习方面，我们提出了一种对数线性池化策略。为进行评估，我们以核密度估计贝叶斯网络（一种非参数贝叶斯网络）为研究对象，将其迁移学习性能与独立训练模型进行对比。实验数据来源于中小型及大型合成网络采样数据，以及UCI机器学习数据库中的数据集。我们通过对这些数据集添加噪声和修改来测试方法避免负迁移的能力。最后，通过弗里德曼检验及伯格曼-霍梅尔事后分析，为所提方法性能提升提供了统计显著性验证。实验表明，PCS-TL与HC-TL是提升数据稀缺条件下非参数贝叶斯网络学习性能的可靠算法，在实际工业环境中应用时可有效缩短网络部署所需时间。

摘要 (Abstract)

This paper introduces two transfer learning methodologies for estimating nonparametric Bayesian networks under scarce data. We propose two algorithms, a constraint-based structure learning method, called PC-stable-transfer learning (PCS-TL), and a score-based method, called hill climbing transfer learning (HC-TL). We also define particular metrics to tackle the negative transfer problem in each of them, a situation in which transfer learning has a negative impact on the model’s performance. Then, for the parameters, we propose a log-linear pooling approach. For the evaluation, we learn kernel density estimation Bayesian networks, a type of nonparametric Bayesian network, and compare their transfer learning performance with the models alone. To do so, we sample data from small, medium and large-sized synthetic networks and datasets from the UCI Machine Learning repository. Then, we add noise and modifications to these datasets to test their ability to avoid negative transfer. To conclude, we perform a Friedman test with a Bergmann-Hommel post-hoc analysis to show statistical proof of the enhanced experimental behavior of our methods. Thus, PCS-TL and HC-TL demonstrate to be reliable algorithms for improving the learning performance of a nonparametric Bayesian network with scarce data, which in real industrial environments implies a reduction in the required time to deploy the network.

关键词: transfer learning, nonparametric Bayesian networks, scarce data, negative transfer, structure learning, kernel density estimation, PC-stable-transfer learning, hill climbing transfer learning

49. ❌ OrgAgent: Organize Your Multi-Agent System like a Company

作者: Yiru Wang, Xinyue Shen, Yaohui Han, Michael Backes, Pin-Yu Chen, Tsung-Yi Ho 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01020v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型驱动的多智能体系统组织架构，与’LLM Agents’和’Multi-agent Systems’高度相关（10分）。涉及复杂推理、分层验证和协调行为，与’Chain of Thought’、‘System 2 Thinking’和’Self-Correction’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何有效组织基于大语言模型的多智能体系统，提出了公司式分层框架OrgAgent，实验表明该分层结构在提升任务性能的同时能显著降低计算成本。

摘要翻译

尽管基于大语言模型的多智能体系统已展现出处理复杂推理任务的强大潜力，如何有效组织多个智能体仍是一个开放性问题。本文提出OrgAgent，一种公司式分层多智能体框架，将协作划分为治理层、执行层与合规层。OrgAgent将多智能体推理分解为三个层级：治理层负责规划与资源分配，执行层负责任务求解与复核，合规层负责最终答案控制。通过对该框架在推理任务、大语言模型、执行模式与执行策略等多维度进行评估，我们发现采用公司式层级结构组织的多智能体系统普遍优于其他组织结构。此外，在多数设定下，分层协调相较于扁平协作能降低令牌消耗。例如，在SQuAD 2.0数据集上，对于GPT-OSS-120B模型，分层设置相较于扁平多智能体系统性能提升102.73%，同时令牌使用量减少74.52%。进一步分析表明，当任务受益于稳定的技能分配、受控的信息流及分层验证机制时，层级结构的优势最为显著。总体而言，我们的研究揭示了组织结构作为多智能体推理的关键影响因素，不仅决定了系统效能与成本，也塑造了协作行为模式。

摘要 (Abstract)

While large language model-based multi-agent systems have shown strong potential for complex reasoning, how to effectively organize multiple agents remains an open question. In this paper, we introduce OrgAgent, a company-style hierarchical multi-agent framework that separates collaboration into governance, execution, and compliance layers. OrgAgent decomposes multi-agent reasoning into three layers: a governance layer for planning and resource allocation, an execution layer for task solving and review, and a compliance layer for final answer control. By evaluating the framework across reasoning tasks, LLMs, execution modes, and execution policies, we find that multi-agent systems organized in a company-style hierarchy generally outperform other organizational structures. Besides, hierarchical coordination also reduces token consumption relative to flat collaboration in most settings. For example, for GPT-OSS-120B, the hierarchical setting improves performance over flat multi-agent system by 102.73% while reducing token usage by 74.52% on SQuAD 2.0. Further analysis shows that hierarchy helps most when tasks benefit from stable skill assignment, controlled information flow, and layered verification. Overall, our findings highlight organizational structure as an important factor in multi-agent reasoning, shaping not only effectiveness and cost, but also coordination behavior.

关键词: multi-agent systems, large language models, hierarchical organization, agent coordination, reasoning tasks, token efficiency, company-style framework, LLM agents

50. ❌ Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

作者: Yiheng Wang, Lichen Zhu, Yueqian Lin, Yudong Liu, Jingyang Zhang, Hai “Helen” Li, Yiran Chen 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心是解决MLLMs（多模态大语言模型）在长视频理解中因上下文长度和计算成本限制而需要关键帧采样的挑战，提出了一种基于信息瓶颈理论的证据驱动关键帧采样框架。因此，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为MLLMs是LLMs的多模态扩展，是论文的基础模型。与"Context Window Extension OR Long Context LLMs"有一定关联（8分），因为论文直接针对长视频（长上下文）的MLLM应用限制提出解决方案。其他关键词如MoE、SLMs、训练技术、推理加速、AI for Science等均未在论文标题或摘要中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在长视频理解中因上下文长度和计算成本限制而需要关键帧采样的问题，提出了一种基于信息瓶颈理论和条件互信息最大化的证据驱动关键帧采样框架，实验表明该方法在严格token预算下优于现有采样策略并提高了训练效率。

摘要翻译

多模态大语言模型（MLLMs）在视频问答任务中展现出强大性能，但其在长视频中的应用受限于有限的上下文长度与计算成本，使得关键帧采样至关重要。现有方法通常依赖语义相关性或强化学习，这些方法要么无法捕捉证据性线索，要么受困于低效的组合优化。本研究提出一种基于信息瓶颈理论的证据驱动关键帧采样框架。我们将关键帧选择问题形式化为最大化所选帧与查询之间的条件互信息，从而提供一个原则性目标，以反映每一帧对回答问题的贡献。为使该目标可处理，我们利用其结构推导出一种分解优化方法，将子集选择问题简化为独立的帧级评分。我们进一步引入一个基于查询条件的证据评分网络，该网络通过对比目标进行训练，以高效估计证据重要性。在长视频理解基准测试上的实验表明，在严格的令牌预算下，我们的方法持续优于先前的采样策略，同时显著提升了训练效率。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame’s contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.

关键词: Multimodal Large Language Models, MLLMs, long-form video understanding, keyframe sampling, information bottleneck theory, conditional mutual information, evidence-driven, query-conditioned

51. ❌ OmniMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

作者: Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, Huaxiu Yao 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01007v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI智能体的终身多模态记忆框架，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确研究AI agents的长期记忆问题。与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（5分），因为记忆系统涉及检索策略。与’Self-Correction OR Self-Improvement OR Self-Reflection’有一定关联（5分），因为自主研究管道涉及诊断失败模式和自我改进。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为AI agents通常基于大模型，但论文未明确讨论LLM技术细节。其他关键词与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文研究AI智能体在长期运行中如何有效保留、组织和回忆多模态经验的记忆瓶颈问题，通过自主研究管道发现了OmniMem框架，在两个基准测试中实现了显著的性能提升（F1分别提高411%和214%）。

摘要翻译

人工智能体在日益延长的运行时间跨度中运作，但其保留、组织与调用多模态经验的能力仍是关键瓶颈。构建有效的终身记忆系统需探索涵盖架构、检索策略、提示工程与数据管道的庞大设计空间；该空间过于庞大且相互关联，难以通过人工探索或传统自动机器学习进行有效搜索。我们部署了一套自主研究流程，用以发现OmniMem——一个面向终身人工智能体的统一多模态记忆框架。从一个简单基线（在LoCoMo基准上F1=0.117）出发，该流程在无人介入内循环的情况下，自主执行了约50项跨两个基准的实验，诊断故障模式、提出架构改进方案并修复数据管道缺陷。最终系统在两个基准上均达到最优性能：相较于初始配置，在LoCoMo上F1值提升+411%（0.117→0.598），在Mem-Gallery上提升+214%（0.254→0.797）。关键的是，最具影响力的发现并非超参数调整：错误修复（+175%）、架构变更（+44%）以及提示工程（在特定类别上+188%）各自对性能提升的贡献均超过所有超参数调优的总和，这证明了该流程具备传统自动机器学习无法企及的根本性能力。我们提出了六类发现类型的分类体系，并归纳了使多模态记忆特别适合自主研究的四项特性，为将自主研究流程应用于其他人工智能系统领域提供指引。代码发布于https://github.com/aiming-lab/OmniMem。

摘要 (Abstract)

AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or traditional AutoML to explore effectively. We deploy an autonomous research pipeline to discover OmniMem, a unified multimodal memory framework for lifelong AI agents. Starting from a naïve baseline (F1=0.117 on LoCoMo), the pipeline autonomously executes ${\sim}50$ experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo (0.117$\to$0.598) and +214% on Mem-Gallery (0.254$\to$0.797) relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML. We provide a taxonomy of six discovery types and identify four properties that make multimodal memory particularly suited for autoresearch, offering guidance for applying autonomous research pipelines to other AI system domains. Code is available at this https://github.com/aiming-lab/OmniMem.

关键词: AI agents, lifelong memory, multimodal memory, autonomous research pipeline, retrieval strategies, benchmark evaluation, architectural modifications, autoresearch

52. ❌ EgoSim: Egocentric World Simulator for Embodied Interaction Generation

作者: Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, Xudong Xu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EgoSim专注于计算机视觉和机器人领域的具身交互模拟，核心贡献是开发了一个闭环的以自我为中心的世界模拟器，用于生成空间一致的交互视频并持续更新底层3D场景状态。论文涉及3D场景建模、点云提取、相机轨迹、具身交互生成等技术，属于计算机视觉、图形学和机器人学的交叉领域。所有评分关键词均与大模型、深度学习技术原理或AI for Science直接相关，而本论文的研究内容与这些关键词无直接关联。唯一可能相关的关键词是’World Models AND General World Models’，因为论文提到了’world simulator’和’world states’，但这里的’world models’指的是物理世界模拟而非大语言模型中的通用世界模型概念，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出了EgoSim，一个闭环的以自我为中心的世界模拟器，解决了现有模拟器在视角变化下结构漂移和无法跨多阶段交互更新世界状态的问题，通过建模可更新的3D场景状态和设计可扩展的数据采集管道，显著提升了视觉质量、空间一致性以及对复杂场景和野外灵巧交互的泛化能力。

摘要翻译

我们提出EgoSim，一种闭环的以自我为中心的世界模拟器，能够生成空间一致的交互视频，并持续更新底层三维场景状态以实现连续模拟。现有的以自我为中心模拟器要么缺乏明确的三维基础，导致视角变化下的结构漂移；要么将场景视为静态，无法在多阶段交互中更新世界状态。EgoSim通过将三维场景建模为可更新的世界状态，解决了这两项局限。我们通过几何-动作感知的观测模拟模型生成具身交互，其空间一致性由交互感知的状态更新模块保障。为克服因难以获取密集对齐的场景-交互训练对而造成的关键数据瓶颈，我们设计了一个可扩展的流程，能够从野外大规模单目以自我为中心视频中提取静态点云、相机轨迹和具身动作。我们进一步引入了EgoCap采集系统，该系统支持使用未经校准的智能手机进行低成本的真实世界数据收集。大量实验表明，EgoSim在视觉质量、空间一致性以及对复杂场景和野外灵巧交互的泛化能力方面显著优于现有方法，同时支持跨具身迁移至机器人操作。代码与数据集即将开源。项目页面位于egosimulator.github.io。

摘要 (Abstract)

We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.

关键词: egocentric world simulator, embodied interaction generation, 3D scene state updating, spatial consistency, geometry-action-aware observation simulation, interaction-aware state updating, point cloud extraction, cross-embodiment transfer

53. ❌ Do Phone-Use Agents Respect Your Privacy?

作者: Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究手机使用代理（phone-use agents）的隐私合规行为，属于LLM Agents/Autonomous Agents领域（高度相关10分），涉及Tool Use/Function Calling（有一定关联5分），因为代理需要操作手机应用完成任务。论文提到评估了五个前沿模型，因此与Large Language Models有一定关联（5分）。其他关键词如MoE、Scaling Laws、Training方法、推理优化、科学AI应用等均未涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了手机使用代理在执行良性移动任务时是否尊重用户隐私，通过开发MyPhoneBench评估框架发现当前代理存在过度填写可选个人信息等隐私问题，仅评估任务成功率会高估其部署准备度。

摘要翻译

本研究旨在探究手机使用智能体在执行良性移动任务时是否尊重用户隐私。这一问题长期难以解答，原因在于：针对手机使用智能体的隐私合规行为尚未被具体定义，且普通应用程序在执行过程中不会明确揭示智能体向哪些表单条目键入了何种数据。为使该问题可量化，我们提出了MyPhoneBench——一个可验证的移动智能体隐私行为评估框架。我们将尊重隐私的手机使用操作化为三个维度：基于权限的访问、最小化披露以及用户可控的记忆机制，并通过一个最小化隐私合约iMy来实现。该框架与经过工具化的模拟应用程序及基于规则的审计机制相结合，使得不必要的权限请求、欺骗性再披露以及非必要的表单填写行为变得可观测且可复现。通过对10款移动应用上的300项任务进行测试，并涵盖五种前沿模型，我们发现任务成功率、隐私合规的任务完成度以及后续会话中对已保存偏好的使用能力是三种不同的能力维度，且没有任何单一模型能在所有三个维度上表现最优。同时评估任务成功率和隐私合规性会重塑仅基于单一指标的模型排序。所有模型中最普遍存在的失败模式是简单的数据最小化问题：智能体仍会填写任务非必需的、可选的个人条目。这些结果表明，隐私泄露问题源于智能体在执行良性任务时“过度热心”的操作，而仅基于成功率的评估会高估当前手机使用智能体的实际部署成熟度。所有代码、模拟应用程序及智能体执行轨迹均已公开，详见~ https://github.com/tangzhy/MyPhoneBench。

摘要 (Abstract)

We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/tangzhy/MyPhoneBench.

关键词: phone-use agents, privacy evaluation, mobile agents, MyPhoneBench, data minimization, agent behavior, privacy compliance, task success

54. ❌ Bridging Structured Knowledge and Data: A Unified Framework with Finance Applications

作者: Yi Cao, Zexun Chen, Lin William Cong, Heqing Shi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种名为SKINNs的通用估计框架，用于将结构化知识（如理论、模拟、先前学习或跨领域见解）作为可微约束嵌入到灵活的神经网络函数逼近中。该研究主要关注计量经济学和金融应用（如期权定价），属于AI在科学/金融领域的应用范畴。然而，论文并未明确涉及大语言模型（LLMs）、深度学习技术原理创新或任何列出的具体大模型技术关键词（如MoE、Scaling Laws、RLHF等）。它讨论的是神经网络与结构化知识的结合，但并非当前大模型技术讨论的核心内容。因此，仅与“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联（评5分），因为金融应用可视为AI在科学/金融领域的一个子领域，但论文未直接提及生物信息学或化学信息学。其他关键词均不相关（评0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为SKINNs的统一估计框架，将结构化知识作为可微约束嵌入神经网络，以结合基于模型的推理与高维数据驱动估计，并在金融期权定价应用中展示了改进的样本外估值和对冲性能。

摘要翻译

我们提出结构化知识信息神经网络（SKINNs），这是一种统一的估计框架，它将理论、模拟、先前学习或跨领域的洞见作为可微分约束嵌入到灵活的神经函数逼近中。SKINNs 通过单一优化问题联合估计神经网络参数和具有经济意义的结构参数，不仅通过观测数据，还借助配置点法在更广泛的输入域上强制理论一致性，从而囊括了函数型广义矩估计、贝叶斯更新、迁移学习、物理信息神经网络以及代理建模等方法。SKINNs 定义了一类 M-估计量，这些估计量具有一致性和渐近正态性，具备根号 N 收敛速度、三明治协方差结构，并在模型误设下能够恢复伪真实参数。我们在联合灵活性条件下确立了结构参数的可识别性，推导了凸代理中分布漂移下的泛化误差与目标风险界，并对控制偏差-方差权衡的加权参数给出了限制最优的表征。在期权定价的金融应用示例中，SKINNs 提升了样本外估值和对冲表现，尤其在较长期限和高波动时期，同时恢复了具有经济可解释性的结构参数，其稳定性相较于传统校准方法有所提高。更广泛而言，SKINNs 为结合基于模型的推理与高维数据驱动估计提供了一个通用的计量经济学框架。

摘要 (Abstract)

We develop Structured-Knowledge-Informed Neural Networks (SKINNs), a unified estimation framework that embeds theoretical, simulated, previously learned, or cross-domain insights as differentiable constraints within flexible neural function approximation. SKINNs jointly estimate neural network parameters and economically meaningful structural parameters in a single optimization problem, enforcing theoretical consistency not only on observed data but over a broader input domain through collocation, and therefore nesting approaches such as functional GMM, Bayesian updating, transfer learning, PINNs, and surrogate modeling. SKINNs define a class of M-estimators that are consistent and asymptotically normal with root-N convergence, sandwich covariance, and recovery of pseudo-true parameters under misspecification. We establish identification of structural parameters under joint flexibility, derive generalization and target-risk bounds under distributional shift in a convex proxy, and provide a restricted-optimal characterization of the weighting parameter that governs the bias-variance tradeoff. In an illustrative financial application to option pricing, SKINNs improve out-of-sample valuation and hedging performance, particularly at longer horizons and during high-volatility regimes, while recovering economically interpretable structural parameters with improved stability relative to conventional calibration. More broadly, SKINNs provide a general econometric framework for combining model-based reasoning with high-dimensional, data-driven estimation.

关键词: Structured-Knowledge-Informed Neural Networks, SKINNs, differentiable constraints, neural function approximation, econometric framework, option pricing, financial application, bias-variance tradeoff

55. ❌ Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization

作者: Ruijie Hao, Longfei Zhang, Yang Dai, Yang Ma, Xingxing Liang, Guangquan Cheng 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习（RL）算法创新，提出FP-DRL算法，结合流匹配策略和分布强化学习解决传统RL中策略分布受限和回报信息损失的问题。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用直接相关，而本文研究的是强化学习控制任务，未涉及大模型、深度学习架构、训练方法、推理优化、AI代理或科学AI应用等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对传统强化学习中策略分布受限和回报信息损失的问题，提出了一种结合流匹配策略和分布强化学习的FP-DRL算法，在MuJoCo控制任务中实现了最先进的性能。

摘要翻译

强化学习（Reinforcement Learning, RL）已被证明在解决复杂控制与决策任务方面极为有效。然而，在大多数传统RL算法中，策略通常被参数化为对角高斯分布，这限制了策略捕获多模态分布的能力，使其难以覆盖多解问题中的全部最优解，并且回报被简化为均值，失去了其多模态特性，从而无法为策略更新提供充分指导。针对这些问题，我们提出了一种称为基于流策略的分布式强化学习（flow-based policy with distributional RL, FP-DRL）的RL算法。该算法利用流匹配（flow matching）对策略进行建模，既保证了计算效率，又具备了拟合复杂分布的能力。此外，它采用分布式强化学习（distributional RL）方法对整体回报分布进行建模和优化，从而更有效地指导多模态策略更新并提升智能体性能。在MuJoCo基准测试上的实验结果表明，FP-DRL算法在大多数MuJoCo控制任务中实现了最先进的性能，同时展现出流策略卓越的表征能力。

摘要 (Abstract)

Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.

关键词: Reinforcement Learning, Flow-based Policy, Distributional RL, Multimodal Distributions, MuJoCo, Control Tasks, SOTA Performance

56. ❌ WARP: Guaranteed Inner-Layer Repair of NLP Transformers

作者: Hsin-Ling Hsu, Min-Yu Chen, Nai-Chia Chen, Yan-Ru Chen, Yi-Ling Chang, Fang Yu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Transformer模型的对抗性修复方法（WARP框架），专注于模型鲁棒性和可验证修复，而非大模型技术原理创新或科学领域应用。所有关键词均涉及大模型训练、推理、应用或特定技术（如MoE、量化、RAG等），与论文的模型修复主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了WARP框架，通过约束优化方法对Transformer模型进行可验证的内部层修复，解决了现有方法在可验证性和参数搜索空间上的局限性，并实现了对抗性输入的鲁棒性提升。

摘要翻译

基于Transformer的自然语言处理模型仍易受对抗性扰动的攻击，而现有修复方法面临一个根本性权衡：基于梯度的方法具有灵活性但缺乏可验证性且容易过拟合；能够提供修复保证的方法则仅限于最终层或小型网络，严重限制了可用于修复的参数搜索空间。我们提出WARP（可证明性的权重调整修复），这是一种基于约束的修复框架，可将修复范围扩展至Transformer模型的最后一层之外。WARP将修复问题构建为一个凸二次规划，该规划源自对数几率差距的一阶线性化，从而实现了在高维参数空间上的可处理优化。在一阶近似成立的条件下，该构建方式可产生三项针对每个样本的保证：（i）确保修复后输入正确分类的正边界约束，（ii）对指定保留集的保持约束，以及（iii）基于Lipschitz连续性导出的认证鲁棒性半径。为确保在不同模型架构间的可行性，我们引入了一种基于敏感度的预处理步骤，以相应地调整优化空间的条件。我们进一步证明，在温和假设下，迭代优化过程会收敛至满足所有修复约束的解。在不同层架构的仅编码器Transformer模型上的实证评估验证了这些保证在实践中成立，同时提升了对对抗性输入的鲁棒性。我们的结果表明，通过基于原则的约束优化，可实现具有保证、可泛化的Transformer模型修复。

摘要 (Abstract)

Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.

关键词: Transformer repair, adversarial robustness, constraint-based optimization, provable guarantees, weight adjustment, certified robustness, NLP models, encoder-only Transformers

57. ❌ PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

作者: Yutao Yang, Junsong Li, Qianjun Pan, Jie Zhou, Kai Chen, Qin Chen, Jingyuan Zhao, Ningning Zhou, Xin Li, Liang He 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00931v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出PsychAgent，一个基于LLM的终身学习智能体，用于心理辅导。核心相关关键词：1）LLMs（论文使用GPT-5.4等作为基础模型，权重1.0，相关度10）；2）Supervised Fine-tuning（论文提到现有方法依赖SFT，且PsychAgent包含拒绝微调，权重1.0，相关度10）；3）Self-Improvement（论文核心是自我进化、技能提取和内部化，权重1.0，相关度10）；4）LLM Agents（论文明确为智能体框架，权重1.0，相关度10）。AI for Science相关度5，因心理辅导可视为AI在社会科学/健康领域的应用，但非核心。其余关键词如MoE、Scaling Laws、RAG等未涉及，相关度0。

!!! tip deepseek-chat TL;DR

该论文针对AI心理辅导依赖静态数据、缺乏持续学习的问题，提出了一个经验驱动的终身学习智能体PsychAgent，通过记忆增强规划、技能进化和强化内部化引擎，实现了在多轮对话中更一致、高质量的辅导响应。

摘要翻译

现有的人工智能心理咨询方法主要依赖于使用静态对话数据集进行监督微调。然而，这与人类专家的实践模式形成对比——他们通过持续的临床实践和积累的经验不断精进专业能力。为弥合这一差距，我们提出了一种面向心理咨询的经验驱动终身学习智能体（\texttt{PsychAgent}）。首先，我们构建了一个专为纵向多轮次交互设计的记忆增强规划引擎，该引擎通过持久性记忆与策略性规划确保治疗过程的连续性。其次，为支持智能体的自我进化，我们设计了技能演化引擎，能够从历史咨询轨迹中提取基于实践的新技能。最后，我们引入了强化内化引擎，通过拒绝微调技术将演化出的技能整合至模型中，旨在提升模型在多样化场景下的表现。对比分析表明，在所有已评估的维度上，我们的方法均取得了比主流通用大语言模型（如GPT-5.4、Gemini-3）及领域专用基线模型更高的评分。这些结果表明，终身学习机制能够有效提升多轮次心理咨询响应的一致性与整体质量。

摘要 (Abstract)

Existing methods for AI psychological counselors predominantly rely on supervised fine-tuning using static dialogue datasets. However, this contrasts with human experts, who continuously refine their proficiency through clinical practice and accumulated experience. To bridge this gap, we propose an Experience-Driven Lifelong Learning Agent (\texttt{PsychAgent}) for psychological counseling. First, we establish a Memory-Augmented Planning Engine tailored for longitudinal multi-session interactions, which ensures therapeutic continuity through persistent memory and strategic planning. Second, to support self-evolution, we design a Skill Evolution Engine that extracts new practice-grounded skills from historical counseling trajectories. Finally, we introduce a Reinforced Internalization Engine that integrates the evolved skills into the model via rejection fine-tuning, aiming to improve performance across diverse scenarios. Comparative analysis shows that our approach achieves higher scores than strong general LLMs (e.g., GPT-5.4, Gemini-3) and domain-specific baselines across all reported evaluation dimensions. These results suggest that lifelong learning can improve the consistency and overall quality of multi-session counseling responses.

关键词: Psychological counseling, Lifelong learning agent, Self-evolving, Memory-augmented planning, Skill evolution, Reinforced internalization, Multi-session interaction, Rejection fine-tuning

58. ❌ Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting

作者: Arina Kharlamova, Bowei He, Chen Ma, Xue Liu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00927v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究舞蹈动作检索，使用骨架运动量化和时空变换器构建离散运动签名，与大多数大模型技术关键词无关。仅与’Quantization’（运动量化）和’AI for Science’（AI在舞蹈分析中的应用）有弱关联，各给5分。

!!! tip deepseek-chat TL;DR

该论文提出了DANCEMATCH框架，通过骨架运动量化和时空变换器构建离散运动签名，实现了基于视频的舞蹈动作高效检索，并发布了标注数据集DANCETYPESBENCHMARK。

摘要翻译

我们提出DANCEMATCH——一个基于动作的舞蹈检索端到端框架，该任务定义为舞蹈指纹识别，旨在直接从原始视频中识别语义相似的编舞。现有动作分析与检索方法虽能比较姿态序列，但依赖难以索引、解释或扩展的连续嵌入表示。相比之下，DANCEMATCH构建了紧凑的离散动作特征，既能捕捉舞蹈的时空结构，又能实现高效的大规模检索。本系统集成骨骼运动量化与时空变换器，将通过Apple CoMotion提取的人体姿态编码为结构化动作词汇。我们进一步设计舞蹈检索引擎，采用基于直方图的索引实现亚线性检索，并通过重排序机制优化匹配精度。为促进可重复研究，我们开源了DANCETYPESBENCHMARK数据集，该数据集包含姿态对齐标注与量化动作标记。实验表明，本方法在不同舞蹈风格中均表现出稳健的检索性能，并对未见编舞具有强泛化能力，为可扩展的运动指纹识别与量化编舞分析奠定了基础。

摘要 (Abstract)

We present DANCEMATCH, an end-to-end framework for motion-based dance retrieval, the task of identifying semantically similar choreographies directly from raw video, defined as DANCE FINGERPRINTING. While existing motion analysis and retrieval methods can compare pose sequences, they rely on continuous embeddings that are difficult to index, interpret, or scale. In contrast, DANCEMATCH constructs compact, discrete motion signatures that capture the spatio-temporal structure of dance while enabling efficient large-scale retrieval. Our system integrates Skeleton Motion Quantisation (SMQ) with Spatio-Temporal Transformers (STT) to encode human poses, extracted via Apple CoMotion, into a structured motion vocabulary. We further design DANCE RETRIEVAL ENGINE (DRE), which performs sub-linear retrieval using a histogram-based index followed by re-ranking for refined matching. To facilitate reproducible research, we release DANCETYPESBENCHMARK, a pose-aligned dataset annotated with quantised motion tokens. Experiments demonstrate robust retrieval across diverse dance styles and strong generalisation to unseen choreographies, establishing a foundation for scalable motion fingerprinting and quantitative choreographic analysis.

关键词: dance retrieval, motion fingerprinting, skeleton motion quantization, spatio-temporal transformers, pose-aligned dataset, large-scale retrieval, choreographic analysis, video-based motion analysis

59. ❌ Representation Selection via Cross-Model Agreement using Canonical Correlation Analysis

作者: Dylan B. Lewis, Jens Gregor, Hector Santos-Villalobos 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00921v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域，提出了一种基于典型相关分析（CCA）的后处理方法，用于改进预训练图像编码器的表示效率。该方法与大多数关键词无关，因为这些关键词主要针对大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等）。仅有两个关键词获得5分：1) “Pre-training OR Continual Pre-training OR Domain Adaptation”：论文涉及预训练图像编码器的表示重用，属于预训练技术的应用范畴；2) “Quantization OR Model Compression OR Low-bit Weights”：论文的方法通过表示选择和降维（减少75%以上维度）来提高效率，这与模型压缩的目标一致，但并非直接量化或低比特权重技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于典型相关分析（CCA）的训练后方法，通过利用两个预训练图像编码器表示之间的共享结构，进行表示选择和降维，从而在减少75%以上维度的同时提升下游任务性能。

摘要翻译

现代视觉处理流程日益依赖预训练图像编码器，其表征被跨任务和模型重复使用，但这些表征往往存在过度完备且与模型强相关的问题。我们提出一种简单、无需训练的方法，通过后验典型相关分析（CCA）算子提升图像表征的效率。该方法利用两个预训练图像编码器生成表征之间的共享结构，寻找线性投影作为表征选择与降维的原则性方法，在保留共享语义内容的同时剔除冗余维度。与主成分分析（PCA）等仅针对单一嵌入空间操作的传统降维技术不同，我们的方法借助跨模型一致性指导表征蒸馏与精炼。该技术可使表征维度降低超过75%的同时提升下游任务性能，或通过从更大规模或经微调的模型进行后验表征迁移，在固定维度下增强表征能力。在ImageNet-1k、CIFAR-100、MNIST及其他基准测试上的实证结果表明，该方法相较于基线表征与PCA投影表征均取得持续改进，最高可获得12.6%的准确率提升。

摘要 (Abstract)

Modern vision pipelines increasingly rely on pretrained image encoders whose representations are reused across tasks and models, yet these representations are often overcomplete and model-specific. We propose a simple, training-free method to improve the efficiency of image representations via a post-hoc canonical correlation analysis (CCA) operator. By leveraging the shared structure between representations produced by two pre-trained image encoders, our method finds linear projections that serve as a principled form of representation selection and dimensionality reduction, retaining shared semantic content while discarding redundant dimensions. Unlike standard dimensionality reduction techniques such as PCA, which operate on a single embedding space, our approach leverages cross-model agreement to guide representation distillation and refinement. The technique allows representations to be reduced by more than 75% in dimensionality with improved downstream performance, or enhanced at fixed dimensionality via post-hoc representation transfer from larger or fine-tuned models. Empirical results on ImageNet-1k, CIFAR-100, MNIST, and additional benchmarks show consistent improvements over both baseline and PCA-projected representations, with accuracy gains of up to 12.6%.

关键词: representation selection, canonical correlation analysis, image encoders, dimensionality reduction, cross-model agreement, pretrained models, post-hoc method, efficiency improvement

60. ❌ Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time

作者: Razvan Mihai Popescu, David Gros, Andrei Botocan, Rahul Pandita, Prem Devanbu, Maliheh Izadi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00917v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型语言模型驱动的自主编码代理在真实开源项目中的贡献模式，与’LLM Agents/Autonomous Agents’高度相关（10分），涉及代理执行代码创建、审查等工具使用（8分），并比较多个代理系统（5分）。论文提到Codex、Claude等大模型（8分），但未深入技术原理创新，其他关键词如MoE、量化、推理加速等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究通过分析约11万个开源拉取请求，比较了五种自主编码代理在真实项目中的活动模式和代码贡献差异，发现代理活动日益增加但其生成的代码比人类代码有更高的流失率。

摘要翻译

代码大语言模型的兴起重塑了软件开发格局。能够创建分支、发起拉取请求和执行代码审查的自主编码代理，如今已在实际项目中发挥积极作用。其日益增长的作用为研究人工智能驱动的贡献及其对代码质量、团队协作和软件可维护性的影响提供了独特而及时的契机。在本研究中，我们构建了一个包含约$110,000$个开源拉取请求的新型数据集，涵盖关联的提交记录、评论、审查意见、问题报告和文件变更，共同代表了数百万行源代码。我们比较了五种主流编码代理（包括OpenAI Codex、Claude Code、GitHub Copilot、Google Jules和Devin），考察了它们在合并频率、编辑文件类型以及开发者互动信号（包括评论与审查）等不同开发维度上的使用差异。此外，我们强调代码编写与审查仅是庞大软件工程流程中的一小部分，因为生成的代码还需随时间推移进行维护与更新。因此，我们针对代理生成代码与人工编写代码提供了若干纵向评估指标，包括存活率与变更率。最终研究表明，尽管代理生成代码随时间推移比人工代码表现出更高的变更率，但开源项目中代理活动的参与度正持续提升。

摘要 (Abstract)

The rise of large language models for code has reshaped software development. Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real-world projects. Their growing role offers a unique and timely opportunity to investigate AI-driven contributions and their effects on code quality, team dynamics, and software maintainability. In this work, we construct a novel dataset of approximately $110,000$ open-source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code. We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews. Furthermore, we emphasize that code authoring and review are only a small part of the larger software engineering process, as the resulting code must also be maintained and updated over time. Hence, we offer several longitudinal estimates of survival and churn rates for agent-generated versus human-authored code. Ultimately, our findings indicate an increasing agent activity in open-source projects, although their contributions are associated with more churn over time compared to human-authored code.

关键词: autonomous coding agents, large language models for code, pull requests, code quality, software development, open-source projects, code churn, agent activity patterns

61. ❌ Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts

作者: Sha Li, Naren Ramakrishnan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00901v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体检索增强生成（Multi-agent RAG）系统，因此与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’（10分）、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）和’Multi-agent Systems OR Agent Coordination’（10分）高度相关。论文涉及复杂推理和智能体行为优化，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（8分）、‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分）和’Self-Correction OR Self-Improvement OR Self-Reflection’（8分）有一定关联。论文基于大模型构建智能体系统，与’Large Language Models OR LLMs OR Foundation Models’（8分）相关。其他关键词如MoE、量化、对齐等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多智能体检索增强生成（RAG）系统中静态编排策略和智能体行为导致的性能脆弱问题，提出了HERA框架，通过分层演化多智能体编排和角色特定提示，在六个知识密集型基准测试中实现了平均38.69%的性能提升，并展现出高效协调和鲁棒推理能力。

摘要翻译

多智能体检索增强生成（Multi-agent Retrieval-Augmented Generation, RAG）通过为每个智能体分配特定角色，能够支持需要多步骤、多来源或复杂推理的困难查询。然而，现有方法依赖于静态的智能体行为和固定的编排策略，导致在多样化的多跳任务中表现脆弱。我们指出了两个关键局限：缺乏持续自适应的编排机制，以及缺少针对个体智能体的行为层面学习。为此，我们提出HERA——一个分层框架，能够协同演化多智能体编排与角色特定的智能体提示。在全局层面，HERA通过奖励引导的采样与经验积累，优化针对特定查询的智能体拓扑结构。在局部层面，角色感知提示演化（Role-Aware Prompt Evolution）通过信用分配和沿操作原则与行为原则的双轴适应，精炼智能体行为，实现有针对性的、角色条件化的改进。在六个知识密集型基准测试中，HERA相较于近期基线模型平均提升了38.69%，同时保持了强大的泛化能力和令牌使用效率。拓扑分析揭示了其涌现的自组织特性：稀疏探索能产生紧凑且高效用的多智能体网络，展现了高效的协调与稳健的推理能力。

摘要 (Abstract)

Multi-agent Retrieval-Augmented Generation (RAG), wherein each agent takes on a specific role, supports hard queries that require multiple steps and sources, or complex reasoning. Existing approaches, however, rely on static agent behaviors and fixed orchestration strategies, leading to brittle performance on diverse, multi-hop tasks. We identify two key limitations: the lack of continuously adaptive orchestration mechanisms and the absence of behavior-level learning for individual agents. To this end, we propose HERA, a hierarchical framework that jointly evolves multi-agent orchestration and role-specific agent prompts. At the global level, HERA optimizes query-specific agent topologies through reward-guided sampling and experience accumulation. At the local level, Role-Aware Prompt Evolution refines agent behaviors via credit assignment and dual-axes adaptation along operational and behavioral principles, enabling targeted, role-conditioned improvements. On six knowledge-intensive benchmarks, HERA achieves an average improvement of 38.69% over recent baselines while maintaining robust generalization and token efficiency. Topological analyses reveal emergent self-organization, where sparse exploration yields compact, high-utility multi-agent networks, demonstrating both efficient coordination and robust reasoning.

关键词: Multi-agent RAG, Retrieval-Augmented Generation, Agent Coordination, Hierarchical Framework, Prompt Evolution, Knowledge-intensive Benchmarks, Self-organization, Complex Reasoning

62. ❌ Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models

作者: Md. Abu Bakor Siddique, Shahrin Hossain, Sadman Ahmed Siam, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00890v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	7.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在几何推理问题中的应用，通过多链思维投票机制提升推理能力。与’Large Language Models’和’Chain of Thought’高度相关（10分），因为论文明确研究LLMs的CoT推理改进。与’System 2 Thinking’相关（8分），因为多阶段投票和验证涉及深度推理过程。与’Self-Correction’相关（7分），因为自我验证管道包含自我改进元素。与’Tool Use’有一定关联（5分），因为使用Python代码执行进行数值验证可视为工具使用。其他关键词如MoE、SLMs、Scaling Laws、Alignment等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在几何问题解决中逻辑推理不足的问题，提出了MARS-GPS方法，通过生成多个并行推理路径、使用熵排名和多阶段投票聚合，在Geometry3K数据集上实现了88.8%的准确率，比之前最佳方法提升了近11%。

摘要翻译

几何问题求解（GPS）始终是提升大语言模型数学推理能力的核心，因为它需要结合图表理解、符号操作与逻辑推断。现有研究主要集中于将图表描述与文本信息同步并解决问题，在此框架下，研究者通常采用神经方法、符号方法或神经符号方法。但这仅满足了前两项要求（即图表理解与符号操作），而逻辑推断能力尚未得到充分发展，通常局限于单一思维链（CoT）。为弥补现有模型的这一缺陷，本文提出MARS-GPS方法：该方法生成多个并行推理路径，并通过Python代码执行进行数值验证，利用词元级熵作为置信度信号对路径排序，最终通过多阶段投票与自我验证流程整合答案。实验结果表明，采用8条并行推理路径的MARS-GPS在Geometry3K数据集上达到88.8%的准确率，较先前最优结果提升近11%；当推理路径数量从1增至16时，准确率持续提升（消融实验子集上+6.0%）。代码与数据已发布于匿名仓库：https://anonymous.4open.science/r/MARS-GPS-DE55。

摘要 (Abstract)

Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, that generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification, ranks them using token-level entropy as a confidence signal, and aggregates answers through a multi-stage voting and self-verification pipeline. Empirical results show that MARS-GPS with 8 parallel rollouts achieves 88.8% on Geometry3K, a nearly +11% improvement over the prior state-of-the-art, with accuracy scaling consistently as the number of rollouts increases from 1 to 16 (+6.0% on ablation subset). We provide our code and data in an anonymous repository: https://anonymous.4open.science/r/MARS-GPS-DE55.

关键词: Large Language Models, Geometric Reasoning, Chain-of-Thought, Multi-stage Voting, Python Code Execution, Self-verification, Mathematical Reasoning, Parallel Reasoning Rollouts

63. ❌ PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding

作者: Nan Wang, Zhiwei Jin, Chen Chen, Haonan Lu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00886v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出PixelPrune方法，通过预测编码压缩在像素层面剪枝冗余图像块，加速Vision-Language Models（VLMs）的推理和训练。该方法与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为核心目标是加速推理；与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（8分），涉及压缩技术；与’Large Language Models OR LLMs OR Foundation Models’有弱关联（5分），因应用于VLMs（含LLM组件）。其他关键词（如MoE、SFT、RAG等）未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对Vision-Language Models处理高分辨率图像时计算成本高的问题，提出PixelPrune方法，通过像素级预测编码压缩冗余图像块，在保持任务准确性的同时实现高达4.2倍的推理加速和1.9倍的训练加速。

摘要翻译

文档理解与图形用户界面交互是视觉语言模型最具价值的应用场景之一，但其计算负担异常沉重：细粒度文本与小型UI元素需要高分辨率输入，从而产生数万个视觉标记。我们发现这种成本在很大程度上是浪费的——在文档与GUI基准测试中，仅22%至71%的图像块具有像素独特性，其余块与同一图像中的其他块完全重复。为此，我们提出PixelPrune方法，该方法通过基于预测编码的压缩技术，在视觉Transformer编码器处理之前直接剪除冗余图像块，从而利用像素级冗余特性。由于该操作在像素空间进行且早于任何神经网络计算，PixelPrune能同时加速ViT编码器与下游大语言模型，覆盖完整推理流程。该方法无需训练、不引入可学习参数，支持像素无损压缩（$τ{=}0$）与可控有损压缩（$τ{>}0$）。在三种模型规模及多类文档与GUI基准测试上的实验表明，PixelPrune在保持任务精度竞争力的同时，可实现最高4.2倍的推理加速与1.9倍的训练加速。代码发布于https://github.com/OPPO-Mente-Lab/PixelPrune。

摘要 (Abstract)

Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful – across document and GUI benchmarks, only 22–71% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ($τ{=}0$) as well as controlled lossy compression ($τ{>}0$). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2$\times$ inference speedup and 1.9$\times$ training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.

关键词: Vision-Language Models, Pixel-Level Compression, Predictive Coding, Inference Acceleration, Visual Token Reduction, Training-Free Method, Document Understanding, GUI Interaction

64. ❌ KUET at StanceNakba Shared Task: StanceMoE: Mixture-of-Experts Architecture for Stance Detection

作者: Abdullah Al Shafi, Md. Milon Islam, Sk. Imran Hossain, K. M. Azharul Hasan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00878v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是提出StanceMoE，一种基于MoE架构的立场检测模型，因此与’Mixture of Experts’高度相关（10分）。模型基于BERT微调，属于大模型应用，但与前沿LLM技术关联较弱，因此’Large Language Models’给5分，‘Post-training/SFT’给5分。其他关键词如SLMs、Scaling Laws、RAG、RLHF等均未涉及，给0分。论文属于NLP应用，未涉及科学领域AI应用。

!!! tip deepseek-chat TL;DR

该论文针对文本中隐含目标参与者的立场检测问题，提出了一种基于混合专家架构的StanceMoE模型，在StanceNakba数据集上取得了94.26%的宏F1分数，优于传统基线方法。

摘要翻译

参与者层面立场检测旨在确定作者对文本中提及或隐含的特定地缘政治参与者所表达的支持或反对立场。尽管基于Transformer的模型在立场分类任务中已取得较好性能，但它们通常依赖统一表征，可能无法充分捕捉异质性语言信号，例如对比性话语结构、框架化线索及显著词汇指标。这促使我们需要构建能显式建模多样化立场表达模式的自适应架构。本文提出StanceMoE模型，这是一种基于微调BERT编码器的上下文增强混合专家（Mixture-of-Experts, MoE）架构，专用于参与者层面立场检测。该模型整合了六个设计用于捕捉互补性语言信号的专家模块，包括全局语义导向、显著词汇线索、从句级焦点、短语级模式、框架化指标以及对比驱动的话语转换。通过上下文感知门控机制动态加权专家贡献，实现基于输入特征的自适应路由。我们在StanceNakba 2026子任务A数据集上进行实验，该数据集包含1,401篇标注英文文本，其中目标参与者隐含于文本中。StanceMoE取得了94.26%的宏观F1分数，优于传统基线模型及其他基于BERT的变体。

摘要 (Abstract)

Actor-level stance detection aims to determine an author expressed position toward specific geopolitical actors mentioned or implicated in a text. Although transformer-based models have achieved relatively good performance in stance classification, they typically rely on unified representations that may not sufficiently capture heterogeneous linguistic signals, such as contrastive discourse structures, framing cues, and salient lexical indicators. This motivates the need for adaptive architectures that explicitly model diverse stance-expressive patterns. In this paper, we propose StanceMoE, a context-enhanced Mixture-of-Experts (MoE) architecture built upon a fine-tuned BERT encoder for actor-level stance detection. Our model integrates six expert modules designed to capture complementary linguistic signals, including global semantic orientation, salient lexical cues, clause-level focus, phrase-level patterns, framing indicators, and contrast-driven discourse shifts. A context-aware gating mechanism dynamically weights expert contributions, enabling adaptive routing based on input characteristics. Experiments are conducted on the StanceNakba 2026 Subtask A dataset, comprising 1,401 annotated English texts where the target actor is implicit in the text. StanceMoE achieves a macro-F1 score of 94.26%, outperforming traditional baselines, and alternative BERT-based variants.

关键词: stance detection, Mixture-of-Experts, MoE architecture, BERT fine-tuning, actor-level stance, context-aware gating, linguistic signals, StanceNakba dataset

65. ❌ Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

作者: Deepak Nathani, Cheng Zhang, Chang Huan, Jiaming Shan, Yinfei Yang, Alkesh Patel, Zhe Gan, William Yang Wang, Michael Saxon, Xin Eric Wang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00842v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究主动代理（proactive agents）的评估框架，与LLM代理（LLM Agents）高度相关（10分），因为主动代理是LLM代理的一种具体类型。与工具使用（Tool Use）相关（8分），因为代理需要调用应用程序API执行任务。与多代理系统（Multi-agent Systems）有一定关联（5分），涉及用户模拟器与代理的交互。与基础大模型（Large Language Models）有间接关联（5分），因为主动代理通常基于LLM构建。其他关键词如MoE、量化、推理加速等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对主动代理缺乏真实用户模拟框架的问题，提出了Proactive Agent Research Environment（Pare）框架和Pare-Bench基准，用于在数字环境中构建和评估能够预测用户需求并自主执行任务的主动代理。

摘要翻译

能够预见用户需求并自主执行任务的前瞻性智能体作为数字助手具有巨大潜力，但现实用户模拟框架的缺失阻碍了其发展。现有方法将应用程序建模为扁平的工具调用API，未能捕捉数字环境中用户交互的有状态与序列化特性，导致真实的用户模拟难以实现。我们提出了前瞻性智能体研究环境（Proactive Agent Research Environment, Pare），这是一个用于在数字环境中构建与评估前瞻性智能体的框架。Pare将应用程序建模为有限状态机，为用户模拟器提供带状态导航及状态依赖的动作空间，从而实现主动用户模拟。基于此框架，我们进一步推出Pare-Bench基准测试，涵盖通信、生产力、日程安排与生活类应用的143项多样化任务，旨在评估智能体的情境观察、目标推断、干预时机把握及多应用协同能力。

摘要 (Abstract)

Proactive agents that anticipate user needs and autonomously execute tasks hold great promise as digital assistants, yet the lack of realistic user simulation frameworks hinders their development. Existing approaches model apps as flat tool-calling APIs, failing to capture the stateful and sequential nature of user interaction in digital environments and making realistic user simulation infeasible. We introduce Proactive Agent Research Environment (Pare), a framework for building and evaluating proactive agents in digital environments. Pare models applications as finite state machines with stateful navigation and state-dependent action space for the user simulator, enabling active user simulation. Building on this foundation, we present Pare-Bench, a benchmark of 143 diverse tasks spanning communication, productivity, scheduling, and lifestyle apps, designed to test context observation, goal inference, intervention timing, and multi-app orchestration.

关键词: proactive agents, user simulation, digital assistants, finite state machines, benchmark evaluation, multi-app orchestration, autonomous task execution

66. ❌ Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies

作者: Zhanzhi Lou, Hui Chen, Yibo Li, Qian Wang, Bryan Hooi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00830v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言代理在测试时学习（TTL）中的自适应策略优化，属于大模型应用创新。高度相关关键词：‘Self-Correction/Self-Improvement/Self-Reflection’（10分，论文核心是代理通过迭代交互自我改进）、‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分，论文明确研究语言代理）。中等相关：‘Large Language Models/LLMs/Foundation Models’（8分，论文基于语言代理，通常依赖大模型）、‘Pre-training/Continual Pre-training/Domain Adaptation’（5分，涉及适应策略学习，类似领域适应）、‘Post-training/Supervised Fine-tuning/SFT’（5分，TTL可视为一种微调形式）、‘In-context Learning/Many-shot Learning’（5分，TTL涉及从环境交互中学习）。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出Meta-TTL框架，通过双层优化学习语言代理在测试时的自适应策略，在Jericho和WebArena-Lite基准上优于手工设计策略，实现了可迁移的改进策略。

摘要翻译

测试时学习（Test-Time Learning, TTL）使语言智能体能够在推理时通过与环境的反复交互迭代优化其性能。TTL的核心在于一种适应策略，该策略根据先前回合的经验更新执行器策略，从而改进未来的行为。现有方法依赖于固定的、人工设计的适应策略，而非针对下游改进进行优化。我们认为，最优的适应策略应从任务环境中学习得到，而非基于人类直觉手动设计。为此，我们提出了Meta-TTL框架，将有效适应策略的发现构建为一个双层优化问题。在此框架内，内层循环执行标准的TTL过程，评估候选适应策略在连续回合中帮助智能体纠正错误的效果；外层循环则基于智能体的表现指导，通过对多样化训练任务分布的进化搜索，迭代优化适应策略。我们在Jericho和WebArena-Lite基准上，针对分布内（ID）与分布外（OOD）设置，并使用多种元智能体骨干网络对Meta-TTL进行了评估。两个基准的实验结果表明，Meta-TTL始终优于人工设计的基线方法，这表明优化后的适应策略编码了可迁移的策略，其泛化能力超越了训练任务的分布范围。

摘要 (Abstract)

Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent’s performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.

关键词: Test-Time Learning, Language Agents, Adaptation Policy, Meta-TTL, Bi-level Optimization, Evolutionary Search, Self-Improvement, Generalization

67. ❌ Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding

作者: Hemanth Kotaprolu, Kishan Maharaj, Raey Zhao, Abhijit Mishra, Pushpak Bhattacharyya 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00819v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在情感理解任务上的零样本性能评估，并提出了基于贝叶斯推理的后处理框架来改进预测。因此，与’Large Language Models’和’Instruction Tuning’高度相关（10分），因为论文明确评估了六个指令调优的大语言模型。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF等均未在论文中涉及，故得0分。论文虽涉及情感计算，但未明确属于AI for Science中的生物信息学或化学信息学子领域，故该关键词也得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有情感理解基准忽略情感间结构化依赖的问题，提出了一个多维度情感场景基准EmoScene，并在零样本设置下评估了六个指令调优大语言模型的性能，发现最佳模型仅达到0.501的Macro F1，进而提出了一个轻量级的纠缠感知贝叶斯推理框架，通过融入情感共现统计来提升预测的结构一致性和性能。

摘要翻译

理解自然语言中的情感本质上是一个多维度推理问题，其中多种情感信号通过语境、人际关系和情境线索相互作用。然而，现有的大多数情感理解基准测试依赖于短文本和预定义的情感标签，将这一过程简化为独立的标签预测，忽略了情感之间的结构化依赖关系。为应对这一局限，我们提出了情感场景（EmoScene），这是一个基于理论构建的基准数据集，包含4,731个语境丰富的场景，并标注了基于普拉奇克（Plutchik）基本情感理论衍生的八维情感向量。我们在零样本设置下评估了六个经过指令微调的大语言模型，观察到其表现较为有限，最佳模型的宏观F1分数仅为0.501，这凸显了语境感知的多标签情感预测任务的难度。基于情感很少独立出现的观察，我们进一步提出了一种感知纠缠的贝叶斯推理框架，该框架结合情感共现统计量，对情感向量进行联合后验推断。这种轻量级的后处理方法提升了预测的结构一致性，并为性能较弱的模型带来了显著提升（例如，Qwen2.5-7B模型的宏观F1分数提高了0.051）。因此，EmoScene为研究多维情感理解及当前语言模型的局限性提供了一个具有挑战性的基准。

摘要 (Abstract)

Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact through context, interpersonal relations, and situational cues. However, most existing emotion understanding benchmarks rely on short texts and predefined emotion labels, reducing this process to independent label prediction and ignoring the structured dependencies among emotions. To address this limitation, we introduce Emotional Scenarios (EmoScene), a theory-grounded benchmark of 4,731 context-rich scenarios annotated with an 8-dimensional emotion vector derived from Plutchik’s basic emotions. We evaluate six instruction-tuned large language models in a zero-shot setting and observe modest performance, with the best model achieving a Macro F1 of 0.501, highlighting the difficulty of context-aware multi-label emotion prediction. Motivated by the observation that emotions rarely occur independently, we further propose an entanglement-aware Bayesian inference framework that incorporates emotion co-occurrence statistics to perform joint posterior inference over the emotion vector. This lightweight post-processing improves structural consistency of predictions and yields notable gains for weaker models (e.g., +0.051 Macro F1 for Qwen2.5-7B). EmoScene therefore provides a challenging benchmark for studying multi-dimensional emotion understanding and the limitations of current language models.

关键词: emotion understanding, large language models, instruction tuning, zero-shot evaluation, multi-dimensional emotion, Bayesian inference, EmoScene benchmark, context-aware prediction

68. ❌ DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

作者: Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Hanbing Li, Long Chen, Zhi-Xin Yang, Jiwen Lu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00813v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶领域，提出了一种基于密集3D几何的Vision-Geometry-Action（VGA）范式，并开发了DVGT-2模型进行在线几何重建和轨迹规划。论文内容主要涉及计算机视觉、3D几何重建、自动驾驶规划等具体应用技术，未涉及大语言模型（LLM）、深度学习技术原理创新、或大模型在不同领域的应用。所有评分关键词均与大语言模型及相关技术（如MoE、Scaling Laws、RLHF、RAG、Agent等）相关，与论文的自动驾驶视觉几何主题完全无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于自动驾驶的Vision-Geometry-Action（VGA）范式，开发了DVGT-2模型实现在线密集3D几何重建和轨迹规划，在多种数据集上取得了优越的几何重建性能，并能直接应用于不同相机配置的规划任务而无需微调。

摘要翻译

端到端自动驾驶已从基于稀疏感知的传统范式，演进为视觉-语言-行动（Vision-Language-Action, VLA）模型，其侧重于学习语言描述作为辅助任务以促进规划。本文提出一种替代性的视觉-几何-行动（Vision-Geometry-Action, VGA）范式，主张将稠密三维几何作为自动驾驶的关键线索。由于车辆在三维世界中运行，我们认为稠密三维几何能为决策提供最全面的信息。然而，现有的大多数几何重建方法（例如DVGT）依赖于对多帧输入进行计算成本高昂的批处理，无法应用于在线规划。为解决此问题，我们提出了流式驾驶视觉几何变换器（Driving Visual Geometry Transformer-2, DVGT-2），它以在线方式处理输入，并联合输出当前帧的稠密几何与轨迹规划。我们采用时序因果注意力机制并缓存历史特征，以支持实时推理。为进一步提升效率，我们提出一种滑动窗口流式策略，利用一定时间间隔内的历史缓存来避免重复计算。尽管速度更快，DVGT-2在多个数据集上实现了更优的几何重建性能。训练完成的同一DVGT-2模型无需微调即可直接应用于不同相机配置下的规划任务，包括闭环NAVSIM与开环nuScenes基准测试。

摘要 (Abstract)

End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.

关键词: autonomous driving, vision-geometry-action, dense 3D geometry, online planning, geometry reconstruction, trajectory planning, streaming transformer, temporal causal attention

69. ❌ Preference Guided Iterated Pareto Referent Optimisation for Accessible Route Planning

作者: Paolo Speziali, Arno De Greef, Mehrdad Asadi, Willem Röpke, Ann Nowé, Diederik M. Roijers 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00795v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是城市路线规划算法（PG-IPRO），专注于为有不同可达性需求的人群提供交互式路线优化。论文内容涉及多目标优化、用户偏好反馈和计算效率，但完全不涉及大模型、深度学习、语言模型、模型训练、推理加速、AI代理等关键词相关的技术。所有关键词均与大模型/深度学习技术原理或其在科学领域的应用相关，而本文属于传统的运筹学/优化算法领域，与这些关键词无任何关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于偏好引导的迭代帕累托参考优化算法（PG-IPRO），用于为有不同可达性需求的人群进行城市路线规划，通过用户交互反馈优化路线目标，提高了计算效率和用户体验。

摘要翻译

本文提出偏好引导迭代帕累托参考优化算法（Preference Guided Iterated Pareto Referent Optimisation，简称PG-IPRO），用于满足不同可达性需求与偏好的城市路径规划。该算法允许用户通过对路径提供反馈与系统交互，即用户可以指定哪些目标应进一步优化，或哪些目标可适当放宽。这种交互方式直观高效，在迭代初期相较于基于信息增益的交互方法表现尤为突出。此外，由于PG-IPRO的迭代特性，算法无需计算全部备选策略（即帕累托前沿），从而显著提升了计算效率，缩短了用户的等待时间。

摘要 (Abstract)

We propose the Preference Guided Iterated Pareto Referent Optimisation (PG-IPRO) for urban route planning for people with different accessibility requirements and preferences. With this algorithm the user can interact with the system by giving feedback on a route, i.e., the user can say which objective should be further minimized, or conversely can be relaxed. This leads to intuitive user interaction, that is especially effective during early iterations compared to information-gain-based interaction. Furthermore, due to PG-IPRO’s iterative nature, the full set of alternative, possibly optimal policies (the Pareto front), is never computed, leading to higher computational efficiency and shorter waiting times for users.

关键词: route planning, accessibility requirements, preference guided, Pareto optimization, iterative algorithm, user interaction, multi-objective optimization, computational efficiency

作者: Shaopeng Fu, Xingxing Zhang, Li Dong, Di Wang, Furu Wei 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00790v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在竞争性编程中的自我精炼能力，通过强化学习训练和智能体架构实现迭代改进。高度相关的关键词包括：LLMs（核心模型）、Self-Correction/Self-Reflection（核心方法）、LLM Agents（Skeptical-Agent架构）、Chain of Thought/System 2 Thinking（复杂推理任务）、Tool Use（本地执行工具验证）。SLMs相关（使用4B模型）和Hallucination Mitigation相关（通过验证提高事实性）有一定关联。其他关键词如MoE、Scaling Laws、PEFT等未涉及。

!!! tip deepseek-chat TL;DR

该论文提出RefineRL方法，通过强化学习训练和Skeptical-Agent架构，使小型LLM在竞争性编程任务中通过自我精炼实现性能大幅提升，4B模型性能接近235B模型。

摘要翻译

尽管大语言模型（LLM）在竞争性编程（CP）等复杂推理任务上已展现出强大性能，但现有方法主要集中于单次尝试的设置，忽视了其迭代优化的潜力。本文提出RefineRL，一种旨在释放LLM在解决CP问题时自我优化能力的新方法。RefineRL包含两项关键创新：（1）Skeptical-Agent（怀疑型智能体），这是一个配备本地执行工具的迭代自我优化智能体，能够针对CP问题的公开测试用例验证所生成的解决方案。该智能体始终对其自身输出保持怀疑态度，从而即使在验证表明正确的情况下也强制执行严格的自我优化。（2）一种强化学习（RL）方案，旨在激励LLM仅使用标准RLVR数据（即问题与其可验证答案配对的数据）进行自我优化。在Qwen3-4B和Qwen3-4B-2507模型上的大量实验表明，我们的方法带来了显著提升：经过RL训练后，这些集成了Skeptical-Agent的紧凑4B模型不仅超越了更大的32B模型，而且接近了235B模型的单次尝试性能。这些发现表明，自我优化对于扩展LLM的推理能力具有重要前景，并具备巨大的进一步改进潜力。

摘要 (Abstract)

While large language models (LLMs) have demonstrated strong performance on complex reasoning tasks such as competitive programming (CP), existing methods predominantly focus on single-attempt settings, overlooking their capacity for iterative refinement. In this paper, we present RefineRL, a novel approach designed to unleash the self-refinement capabilities of LLMs for CP problem solving. RefineRL introduces two key innovations: (1) Skeptical-Agent, an iterative self-refinement agent equipped with local execution tools to validate generated solutions against public test cases of CP problems. This agent always maintains a skeptical attitude towards its own outputs and thereby enforces rigorous self-refinement even when validation suggests correctness. (2) A reinforcement learning (RL) solution to incentivize LLMs to self-refine with only standard RLVR data (i.e., problems paired with their verifiable answers). Extensive experiments on Qwen3-4B and Qwen3-4B-2507 demonstrate that our method yields substantial gains: after our RL training, these compact 4B models integrated with the Skeptical-Agent not only outperform much larger 32B models but also approach the single-attempt performance of 235B models. These findings suggest that self-refinement holds considerable promise for scaling LLM reasoning, with significant potential for further advancement.

关键词: Large Language Models, Self-Refinement, Reinforcement Learning, Competitive Programming, LLM Agents, Tool Use, Complex Reasoning, Iterative Refinement

71. ❌ UK AISI Alignment Evaluation Case-Study

作者: Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D’Cruz, Xander Davies 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估前沿大语言模型（Claude Opus 4.5 Preview, Sonnet 4.5等）作为AI实验室中的编码助手时，是否可靠遵循目标（即是否进行安全研究破坏），这直接涉及大模型（LLMs）的评估、对齐（Alignment）以及作为代理（Agents）在特定工作流中的行为。因此，与’Large Language Models OR LLMs OR Foundation Models’、‘Instruction Tuning OR Alignment OR Value Alignment’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、量化、推理加速、科学AI应用等），故相关度为0分。

!!! tip deepseek-chat TL;DR

该研究评估了前沿大语言模型作为AI实验室编码助手时是否可靠遵循目标（避免安全研究破坏），结果未发现确认的研究破坏实例，但观察到某些模型频繁拒绝参与安全相关研究任务。

摘要翻译

本技术报告介绍了英国人工智能安全研究所开发的评估方法，用于检验先进人工智能系统是否可靠遵循既定目标。具体而言，我们评估了前沿模型在作为编码助手部署于人工智能实验室时，是否会破坏安全研究。通过对四个前沿模型应用我们的方法，未发现已确认的研究破坏实例。然而，我们观察到Claude Opus 4.5 Preview（Opus 4.5的预发布版本）与Sonnet 4.5经常拒绝执行与安全相关的研究任务，其理由涉及对研究方向、参与自我训练及研究范围的担忧。此外，我们发现Opus 4.5 Preview相较于Sonnet 4.5展现出更低的无提示评估意识，而两种模型在收到提示时均能区分评估场景与部署场景。我们的评估框架基于Petri（一个开源的大型语言模型审计工具），并采用定制化架构模拟编码代理在实验室内部的真实部署环境。经验证，该架构生成的轨迹与真实部署数据无法被所有测试模型可靠区分。我们在不同研究动机、活动类型、替代威胁和模型自主性的场景中对模型进行了测试。最后，我们讨论了包括场景覆盖度与评估意识在内的局限性。

摘要 (Abstract)

This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.

关键词: AI alignment evaluation, frontier models, coding assistant, safety research, LLM auditing, agent deployment, evaluation framework, research sabotage

72. ❌ Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

作者: Dharma Teja Vooturi, Dhiraj Kalamkar, Dipankar Das, Bharat Kaul 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00785v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大规模语言模型（LLMs）和混合专家（MoE）模型的预训练，因此与’Large Language Models’和’Mixture of Experts’高度相关（10分）。论文专注于从零开始的预训练，与’Pre-training’高度相关（10分）。论文涉及模型规模扩展（从1B到220B参数）和计算扩展（从384到12288 GPU tiles），与’Scaling Laws’有一定关联（5分），但未明确讨论数据质量。论文未涉及其他关键词，如后训练、对齐、推理、代理、压缩、科学AI应用等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在Aurora超级计算机上大规模预训练混合专家（MoE）语言模型，开发了Optimus训练库并实现了高效的模型和计算扩展，在12288个GPU tiles上达到了约90%的扩展效率。

摘要翻译

从头开始预训练大型语言模型需要巨大的计算资源。Aurora超级计算机是一台拥有127,488个英特尔PVC（Ponte Vecchio）GPU tiles的百亿亿次级机器。在本研究中，我们展示了在Aurora上使用数千个GPU tiles进行LLM预训练的工作。为此，我们开发了Optimus——一个内部训练库，支持标准的大模型训练技术。利用Optimus，我们首先在3072个GPU tiles上，使用OLMoE-mix-0924数据集的全部4万亿个token，从头预训练了10亿参数的稠密模型Mula-1B和70亿参数的专家混合模型Mula-7B-A1B。随后，我们在同一数据集上预训练了三个大型MoE模型——Mula-20B-A2B、Mula-100B-A7B和Mula-220B-A10B直至1000亿个token，以展示模型缩放能力。在我们最大的模型Mula-220B-A10B上，我们将计算规模从384个GPU tiles扩展到12288个GPU tiles，并在12288个GPU tiles上观察到约90%的缩放效率。通过为专家计算定制GPU内核，以及采用一种新颖的EP-Aware分片优化器，我们显著提升了MoE模型的运行时性能，实现了高达1.71倍的训练加速。作为Optimus库的一部分，我们还开发了一套强大的可靠性与容错功能，以提升大规模训练的稳定性和连续性。

摘要 (Abstract)

Pretraining Large Language Models (LLMs) from scratch requires massive amount of compute. Aurora super computer is an ExaScale machine with 127,488 Intel PVC (Ponte Vechio) GPU tiles. In this work, we showcase LLM pretraining on Aurora at the scale of 1000s of GPU tiles. Towards this effort, we developed Optimus, an inhouse training library with support for standard large model training techniques. Using Optimus, we first pretrained Mula-1B, a 1 Billion dense model and Mula-7B-A1B, a 7 Billion Mixture of Experts (MoE) model from scratch on 3072 GPU tiles for the full 4 trillion tokens of the OLMoE-mix-0924 dataset. We then demonstrated model scaling by pretraining three large MoE models Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B till 100 Billion tokens on the same dataset. On our largest model Mula-220B-A10B, we pushed the compute scaling from 384 to 12288 GPU tiles and observed scaling efficiency of around 90% at 12288 GPU tiles. We significantly improved the runtime performance of MoE models using custom GPU kernels for expert computation, and a novel EP-Aware sharded optimizer resulting in training speedups up to 1.71x. As part of the Optimus library, we also developed a robust set of reliability and fault tolerant features to improve training stability and continuity at scale.

关键词: Large Language Models, Mixture of Experts, Pretraining, Model Scaling, Compute Scaling, Training Library, Supercomputer, GPU Tiles

73. ❌ Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

作者: Swapnil Parekh 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00770v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究连续潜在推理语言模型中的后门攻击，核心涉及大语言模型（LLMs）、推理机制（Chain of Thought、System 2 Thinking）和可解释性（Mechanistic Interpretability），与这些关键词高度相关（10分）。论文提到后门在干净微调后仍存在，与监督微调（SFT）有一定关联（5分）。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文揭示了连续潜在推理语言模型中的新型后门攻击ThoughtSteer，通过扰动单个嵌入向量即可劫持推理轨迹，实现高攻击成功率并逃避现有防御，同时为连续推理的机制可解释性提供了新视角。

摘要翻译

新一代语言模型完全在连续的隐藏状态中进行推理，不生成任何词元且不留下可审计的踪迹。我们证明这种“静默”特性催生了一种全新的攻击面。ThoughtSteer方法通过在输入层扰动单个嵌入向量；模型自身的多轮推理机制将这一扰动放大为被劫持的潜在轨迹，从而稳定输出攻击者预设的答案，同时保持结构上对所有词元级防御的不可见性。在两种架构（Coconut与SimCoT）、三个推理基准测试集以及参数规模从1.24亿到30亿的模型上，ThoughtSteer实现了≥99%的攻击成功率且保持接近基准的原始准确率，无需重新训练即可迁移至未见的基准测试集（94-100%），规避了全部五种已评估的主动防御机制，并能经受25轮次的洁净微调。我们将这些结果归因于统一的作用机制：潜在空间中的神经坍缩现象将受触发的表征拉向一个紧凑的几何吸引子，这既解释了防御失效的原因，也说明了任何有效的后门都必然留下线性可分的特征标记（探测AUC≥0.999）。然而一个显著的悖论随之浮现：尽管模型输出错误答案，单个潜在向量仍编码着正确答案。对抗性信息并不存在于任何单一向量中，而是蕴藏于集体轨迹之内，这确立了后门扰动可作为连续推理机制可解释性研究的新透镜。代码与模型检查点已公开。

摘要 (Abstract)

A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model’s own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker’s chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.

关键词: continuous latent reasoning, backdoor attacks, language models, reasoning benchmarks, neural collapse, mechanistic interpretability, adversarial perturbations, latent trajectory

74. ❌ BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction

作者: Sayed Hashim, Frank Soboczenski, Paul Cairns 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文BioCOMPASS专注于生物信息学领域，具体研究基于Transformer的免疫治疗反应预测模型，通过整合生物标志物和治疗信息来提高泛化能力。论文内容与绝大多数关键词（如LLMs、MoE、RLHF、RAG、CoT等）完全无关，这些关键词主要涉及大语言模型的技术原理、训练方法、推理优化、代理系统等，而本文是特定领域的深度学习应用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确属于生物信息学（Bioinformatics）和AI for Science范畴，是该领域的创新应用，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该论文针对免疫治疗反应预测中模型泛化能力不足的问题，提出了BioCOMPASS模型，通过整合生物标志物和治疗信息并设计损失组件，显著提高了模型在不同患者队列、癌症类型和治疗方案中的泛化性能。

摘要翻译

免疫治疗反应预测领域使用的数据集通常规模较小，且在癌症类型、施用药物及测序平台方面存在高度异质性。当模型在训练过程未涵盖的患者队列中进行测试时，其性能往往显著下降。近期研究表明，基于Transformer的模型结合自监督学习相比基于阈值的生物标志物具有更好的泛化性能，但仍未达到最优。本文提出BioCOMPASS——一种基于Transformer模型COMPASS的扩展框架，通过整合生物标志物（biomarkers）与治疗信息以进一步提升其泛化能力。我们并未将生物标志物数据直接作为模型输入，而是构建了损失函数组件，使其与模型的中间表征对齐。研究发现，在采用留出单队列、留出单癌症类型及留出单治疗方案的评估策略中，治疗门控机制与通路一致性损失等组件有效提升了模型的泛化性能。结果表明，构建利用生物标志物与治疗信息的组件能够增强免疫治疗反应预测的泛化能力。未来研究的一个重要方向是精心设计更多组件，以整合互补的临床信息与领域知识。

摘要 (Abstract)

Datasets used in immunotherapy response prediction are typically small in size, as well as diverse in cancer type, drug administered, and sequencer used. Models often drop in performance when tested on patient cohorts that are not included in the training process. Recent work has shown that transformer-based models along with self-supervised learning show better generalisation performance than threshold-based biomarkers, but is still suboptimal. We present BioCOMPASS, an extension of a transformer-based model called COMPASS, that integrates biomarkers and treatment information to further improve its generalisability. Instead of feeding biomarker data as input, we built loss components to align them with the model’s intermediate representations. We found that components such as treatment gating and pathway consistency loss improved generalisability when evaluated with Leave-one-cohort-out, Leave-one-cancer-type-out and Leave-one-treatment-out strategies. Results show that building components that exploit biomarker and treatment information can help in generalisability of immunotherapy response prediction. Careful curation of additional components that leverage complementary clinical information and domain knowledge represents a promising direction for future research.

关键词: immunotherapy response prediction, transformer-based model, biomarkers, treatment information, generalization, self-supervised learning, clinical data integration, bioinformatics

75. ❌ Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction

作者: Björn Roman Kohlberger 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Spectral Compact Training (SCT)方法，通过永久截断SVD和Stiefel QR回缩来预训练大语言模型，核心是解决大模型训练中的内存瓶颈问题。因此，与’Large Language Models’和’Pre-training’高度相关（10分），因为这是大模型预训练方法。与’Quantization OR Model Compression’高度相关（10分），因为SCT通过低秩分解实现模型压缩（199x内存减少）。与’PEFT OR LoRA’有一定关联（8分），因为SCT也是一种参数高效的训练方法，但不同于LoRA的适配器方法。其他关键词如MoE、SLMs、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Spectral Compact Training (SCT)的新方法，通过永久截断SVD和Stiefel QR回缩来预训练大语言模型，解决了内存瓶颈问题，实现了高达199倍的内存减少，使70B参数模型能在消费级硬件上训练。

摘要翻译

内存墙仍是消费级硬件训练大语言模型的主要瓶颈。本文提出谱紧凑训练法，该方法将稠密权重矩阵替换为永久截断的奇异值分解因子 W = U diag(s) V^T，其中完整的稠密矩阵在训练和推理过程中均不实例化。梯度通过标准反向传播流经紧凑的谱因子，且在每个优化器步骤后通过QR分解将U、V收缩回斯蒂费尔流形。在秩为32时，SCT实现了每个MLP层最高199倍的内存压缩，使得700亿参数规模的架构能够在Steam Deck掌上设备上完成完整训练步骤（峰值内存使用量为7.2 GB，而采用Adam优化器的稠密FP32训练需要1,245 GB）。在SmolLM2-17亿参数模型上进行的秩扫描实验（秩32-256，2000步，NVIDIA A100）表明，所有测试秩均收敛至相同的损失下限（约4.2-4.5），这证明学习率调度——而非MLP秩——是主要瓶颈。秩128以11.7倍的MLP压缩率成为效率最优解，并取得最低困惑度。在秩32条件下，GPU内存占用下降46%，同时训练吞吐量翻倍。

摘要 (Abstract)

The memory wall remains the primary bottleneck for training large language models on consumer hardware. We introduce Spectral Compact Training (SCT), a method that replaces dense weight matrices with permanent truncated SVD factors W = U diag(s) V^T, where the full dense matrix is never materialized during training or inference. Gradients flow through the compact spectral factors via standard backpropagation, and U, V are retracted to the Stiefel manifold via QR decomposition after each optimizer step. SCT achieves up to 199x memory reduction per MLP layer at rank 32, enabling full training steps of 70B-parameter architectures on a Steam Deck handheld (7.2 GB peak memory vs. 1,245 GB for dense FP32 training with Adam). Rank-sweep experiments on SmolLM2-1.7B (ranks 32-256, 2000 steps, NVIDIA A100) show that all tested ranks converge to the same loss floor (~4.2-4.5), identifying the learning rate schedule – not MLP rank – as the primary bottleneck. Rank 128 emerges as the efficiency sweet spot at 11.7x MLP compression with the lowest perplexity. GPU memory drops 46% at rank 32 while training throughput doubles.

关键词: Spectral Compact Training, Large Language Models, Pre-training, Model Compression, Truncated SVD, Memory Reduction, Parameter-efficient Training, Stiefel Manifold

76. ❌ A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch

作者: Ricardo Hidalgo-Aragón, Jesús M. González-Barahona, Gregorio Robles 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00730v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于模糊C均值聚类和CEFR框架的Scratch编程技能自动化评估方法，属于教育技术领域，完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词都聚焦于大模型相关技术、训练方法、推理优化、对齐、应用等，与论文内容无任何关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于模糊C均值聚类和CEFR框架的自动化评估方法，用于评估Scratch编程技能，并识别出课程设计中的系统性瓶颈（如B2水平瓶颈）。

摘要翻译

背景：学校、培训平台与技术企业日益需要采用透明、可复现的方法大规模评估编程能力，以支持个性化学习路径。目标：本研究提出一个与《欧洲共同语言参考框架》（CEFR）相适配的Scratch项目评估教学框架，为学生与教师提供通用的能力等级标准，并为课程设计提供可操作的洞见。方法：我们对Dr.Scratch平台评估的2008246个Scratch项目应用模糊C均值聚类，通过序数化标准将聚类结果映射至CEFR等级（A1-C2），并引入增强型分类指标以识别过渡阶段学习者、实现持续进展追踪，同时通过量化分类确定性来平衡自动化反馈与教师审阅。影响：该框架能够诊断系统性课程缺陷——尤其揭示了仅13.3%学习者达到的“B2瓶颈现象”，其成因在于整合逻辑同步与数据表征所需的高认知负荷——同时基于确定性阈值触发人工干预机制。

摘要 (Abstract)

Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a “B2 bottleneck” where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation–while providing certainty–based triggers for human intervention.

关键词: Scratch programming assessment, CEFR framework, Fuzzy C-Means clustering, automated assessment, programming skills, educational technology, competency levels, curriculum design

77. ❌ CircuitProbe: Predicting Reasoning Circuits in Transformers via Stability Zone Detection

作者: Rajkiran Panuganti 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00716v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Transformer语言模型中的推理电路检测方法CircuitProbe，主要涉及：1）与’Small Language Models’高度相关（10分），因明确提到该方法对3B以下参数模型有效；2）与’Mechanistic Interpretability’高度相关（10分），因研究模型内部推理机制的可解释性；3）与’Large Language Models’相关（8分），因研究涵盖Transformer语言模型；4）与’Chain of Thought’和’System 2 Thinking’相关（各8分），因涉及推理过程分析；其他关键词如MoE、Scaling Laws、训练方法等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出CircuitProbe方法，通过激活统计快速预测Transformer语言模型中的推理电路位置，相比暴力搜索提速3-4个数量级，并发现该方法对3B参数以下的小语言模型性能提升显著。

摘要翻译

Transformer语言模型包含局部推理电路，即那些在推理时通过复制层块能提升推理性能的连续层模块。目前定位这些电路需要进行暴力搜索，每个模型需耗费25 GPU小时。我们提出CircuitProbe方法，该方法仅通过激活统计量即可在CPU上5分钟内预测电路位置，实现三到四个数量级的加速。研究发现推理电路可分为两类：早期层的稳定性电路（通过表征变化梯度的导数检测）和后期层的幅度电路（通过异常值评分检测）。我们在涵盖6种架构的9个模型（包括2025年新模型）上进行了验证，确认CircuitProbe的顶部预测在所有验证案例中均与最优电路完全吻合或仅相差2层以内。通过对Qwen 2.5系列模型的缩放实验发现，层复制技术能持续提升30亿参数以下模型的性能，但在70亿参数以上模型中会导致性能下降，这使其成为小规模语言模型的高效缩放技术。CircuitProbe仅需10个校准样本即可工作，其预测在英语、印地语、中文和法语中均保持稳定。

摘要 (Abstract)

Transformer language models contain localized reasoning circuits, contiguous layer blocks that improve reasoning when duplicated at inference time. Finding these circuits currently requires brute-force sweeps costing 25 GPU hours per model. We propose CircuitProbe, which predicts circuit locations from activation statistics in under 5 minutes on CPU, providing a speedup of three to four orders of magnitude. We find that reasoning circuits come in two types: stability circuits in early layers, detected through the derivative of representation change, and magnitude circuits in late layers, detected through anomaly scoring. We validate across 9 models spanning 6 architectures, including 2025 models, confirming that CircuitProbe top predictions match or are within 2 layers of the optimal circuit in all validated cases. A scaling experiment across the Qwen 2.5 family reveals that layer duplication consistently benefits models under 3B parameters but degrades performance in 7B+ models, making this a practical scaling technique for small language models. CircuitProbe requires as few as 10 calibration examples and its predictions are stable across English, Hindi, Chinese, and French.

关键词: Transformer language models, reasoning circuits, stability zone detection, activation statistics, layer duplication, small language models, mechanistic interpretability, inference acceleration

78. ❌ To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

作者: Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究检索增强生成(RAG)与预训练之间的权衡关系，建立了包含模型大小、预训练token数量和检索语料库大小的三维扩展框架，因此与’Retrieval-Augmented Generation’高度相关(15分)，与’Large Language Models’、‘Scaling Laws’、‘Pre-training’直接相关(10分)，在科学QA基准测试中评估，与’AI for Science’有一定关联(5分)，在推理任务评估中与’Chain of Thought’有一定关联(5分)，其他关键词未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究了在固定数据预算下，预训练语料库大小与检索存储大小之间的权衡关系，通过建立三维扩展框架发现检索的边际效用强烈依赖于模型规模、任务类型和预训练饱和程度，为设计可扩展语言建模系统提供了数据资源分配的实际指导。

摘要翻译

检索增强生成（RAG）通过在测试时为知识密集型任务提供相关上下文，提升了语言模型（LM）的性能。然而，在固定数据预算下，预训练过程中获得的参数化知识与通过检索获取的非参数化知识之间的关系仍不明确。本研究系统性地探讨了在不同模型和数据规模下，预训练语料规模与检索库规模之间的权衡关系。我们基于OLMo-2架构训练了参数量从3000万到30亿不等的语言模型，使用高达1000亿标记的DCLM数据进行训练，同时调整预训练数据规模（参数量的1-150倍）和检索库规模（1-20倍），并在涵盖推理、科学问答和开放域问答的多样化基准测试中评估性能。研究发现，在不同模型规模下，检索机制均能持续提升纯参数化基线的性能，并提出了一个三维扩展框架，将性能建模为模型大小、预训练标记量和检索语料规模的函数。该扩展流形使我们能够估算固定数据预算在预训练与检索之间的最优分配方案，揭示出检索的边际效用高度依赖于模型规模、任务类型以及预训练饱和程度。本研究结果为理解检索应在何时以及如何补充预训练提供了量化依据，为可扩展语言建模系统的数据资源分配提供了实践指导。

摘要 (Abstract)

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

关键词: Retrieval-augmented generation, RAG, pretraining, scaling laws, language models, parametric knowledge, non-parametric knowledge, data budget allocation

79. ❌ AutoEG: Exploiting Known Third-Party Vulnerabilities in Black-Box Web Applications

作者: Ruozhao Yang, Mingfei Cheng, Gelei Deng, Junjie Wang, Tianwei Zhang, Xiaofei Xie 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00704v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于网络安全领域的自动化漏洞利用生成，使用多智能体框架解决黑盒Web应用中的第三方组件漏洞问题。论文内容与所有评分关键词均无直接关联：1）未涉及大模型、深度学习技术或科学AI应用；2）未讨论模型架构、训练方法、推理优化、对齐技术等大模型相关主题；3）虽然使用了多智能体框架，但属于传统网络安全智能体，而非LLM驱动的智能体。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了AutoEG多智能体框架，用于自动化生成针对黑盒Web应用中已知第三方漏洞的可靠攻击利用，在104个真实漏洞测试中实现了82.41%的平均成功率，显著优于现有方法。

摘要翻译

大规模网络应用广泛部署了复杂的第三方组件，继承了组件漏洞带来的安全风险。因此需要通过安全评估来确定此类已知漏洞在实际应用中是否仍具有可被利用的潜在威胁。渗透测试是一种广泛采用的方法，它通过对真实世界黑盒系统中的已知漏洞发起具体攻击来验证其可利用性。然而，现有方法往往无法自动生成可靠的漏洞利用程序，限制了其在实际安全评估中的有效性。这一局限主要源于两个问题：（1）以正确的技术细节精确触发漏洞，以及（2）使漏洞利用程序适应多样化的实际部署环境。
本文提出AutoEG，一个面向黑盒网络应用的全自动多智能体漏洞利用生成框架。AutoEG包含两个阶段：首先，AutoEG从非结构化的漏洞信息中提取精确的漏洞触发逻辑，并将其封装为可复用的触发函数。其次，AutoEG利用触发函数实现具体的攻击目标，并通过与目标应用的反馈驱动交互迭代优化漏洞利用程序。我们在104个真实漏洞与29个攻击目标上评估AutoEG，共形成660项利用任务和55,440次利用尝试。AutoEG的平均成功率达到了82.41%，显著优于现有最佳基线方法——其最高成功率仅为32.88%。

摘要 (Abstract)

Large-scale web applications are widely deployed with complex third-party components, inheriting security risks arising from component vulnerabilities. Security assessment is therefore required to determine whether such known vulnerabilities remain practically exploitable in real applications. Penetration testing is a widely adopted approach that validates exploitability by launching concrete attacks against known vulnerabilities in real-world black-box systems. However, existing approaches often fail to automatically generate reliable exploits, limiting their effectiveness in practical security assessment. This limitation mainly stems from two issues: (1) precisely triggering vulnerabilities with correct technical details, and (2) adapting exploits to diverse real-world deployment settings. In this paper, we propose AutoEG, a fully automated multi-agent framework for exploit generation targeting black-box web applications. AutoEG has two phases: First, AutoEG extracts precise vulnerability trigger logic from unstructured vulnerability information and encapsulates it into reusable trigger functions. Second, AutoEG uses trigger functions for concrete attack objectives and iteratively refines exploits through feedback-driven interaction with the target application. We evaluate AutoEG on 104 real-world vulnerabilities with 29 attack objectives, resulting in 660 exploitation tasks and 55,440 exploit attempts. AutoEG achieves an average success rate of 82.41%, substantially outperforming state-of-the-art baselines, whose best performance reaches only 32.88%.

关键词: AutoEG, exploit generation, black-box web applications, third-party vulnerabilities, multi-agent framework, security assessment, penetration testing, vulnerability trigger

80. ❌ Learning to Hint for Reinforcement Learning

作者: Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00698v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Learning to Hint for Reinforcement Learning》专注于强化学习（RL）方法创新，提出HiLL框架解决GRPO中的优势崩溃问题，通过联合训练提示策略和推理策略来生成自适应提示。所有评分关键词均与大模型、深度学习技术原理或科学应用相关，而本文核心是强化学习算法改进，未涉及大模型、语言模型、模型训练技术、推理优化、代理系统或科学AI应用等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对强化学习中GRPO方法存在的优势崩溃问题，提出了HiLL框架，通过联合训练自适应提示策略和推理策略，并引入提示依赖度来优化提示生成，实验证明该方法在多个基准测试中优于GRPO和现有提示基线。

摘要翻译

群体相对策略优化（GRPO）广泛用于具有可验证奖励的强化学习，但常受优势崩溃问题困扰：当群体中所有轨迹获得相同奖励时，群体产生零相对优势，从而导致学习信号缺失。例如，若问题对推理器难度过高，所有采样轨迹可能均错误并得到零奖励。近期研究通过向此类难题添加提示或辅助支架来解决该问题，使推理器产生差异化结果并恢复非零更新。然而，现有提示通常是固定的而非适配当前推理器，且在提示输入下产生学习信号的提示未必能提升测试时使用的无提示策略。为此，我们提出强化学习提示学习框架（HiLL），该框架在强化学习过程中联合训练提示器策略与推理器策略。针对每个难题，提示器根据当前推理器的错误轨迹在线生成条件化提示，使提示生成能适配推理器动态演化的错误。我们进一步提出提示依赖性指标，用于衡量正确提示轨迹对提示的依赖程度。通过推导可迁移性理论结果，我们证明较低的提示依赖性意味着从提示成功到无提示成功的更强迁移能力，并利用该结果为提示器训练定义迁移加权奖励。因此，HiLL倾向于选择既能恢复信息性GRPO群体，又能产生更可能改进原始无提示策略的学习信号的提示。在多基准测试上的实验表明，HiLL持续优于GRPO及现有基于提示的基线方法，证明了自适应且具备迁移感知的提示学习对强化学习的价值。代码发布于https://github.com/Andree-9/HiLL。

摘要 (Abstract)

Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner’s incorrect rollout, allowing hint generation to adapt to the reasoner’s evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.

关键词: Reinforcement Learning, Hint Learning, GRPO, Advantage Collapse, Adaptive Hints, Transferability, Policy Optimization, Reasoner Policy

81. ❌ Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures

作者: Lewis Tham, Nicholas Mac Gregor Garcia, Jungpil Hahn 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00694v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是自主代理（autonomous agents）与网页交互的架构问题，提出了Unbrowse系统来替代基于浏览器的代理架构，通过共享路由图直接调用网站的内部API。这与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文核心就是解决自主代理在网页交互中的效率问题。与’Tool Use OR Function Calling OR API Tool Use’也高度相关（10分），因为系统本质上是通过API调用作为工具使用来替代浏览器交互。与’Multi-agent Systems OR Agent Coordination’有一定关联（5分），因为提到了共享路由图可以被多个代理使用，涉及代理间的协调（通过共享缓存避免重复发现）。其他关键词如大模型技术原理、训练方法、推理优化、科学AI应用等，论文均未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

论文针对自主代理与网页交互时依赖浏览器架构效率低下的问题，提出了Unbrowse系统，通过共享路由图直接调用网站内部API，实现了比浏览器自动化快3.6倍的平均执行速度。

摘要翻译

自主代理与网络的交互日益频繁，然而大多数网站仍为人类浏览器设计——这一根本性不匹配是新兴“代理化网络”必须解决的问题。代理需要反复浏览页面、检查DOM并逆向工程可调用路由——这一过程缓慢、脆弱且在不同代理间冗余重复。我们观察到，每个现代网站其实都已在其用户界面背后暴露了内部API（有时称为影子API），即支撑网站自身功能的第一方端点。本文提出Unbrowse，这是一个共享路由图谱，将基于浏览器的路由发现转化为对这些可调用第一方接口的集体维护索引。该系统从真实浏览流量中被动学习路由，并通过直接API调用提供缓存路由。在涵盖94个域名的等效信息检索任务单主机实时网络基准测试中，完全预热缓存执行平均耗时950毫秒，而Playwright浏览器自动化需3,404毫秒（平均加速3.6倍，中位数加速5.4倍），良好缓存的路由可在100毫秒内完成。三级执行模型——本地缓存、共享图谱或浏览器回退机制——确保系统具有自愿性和自修正能力。通过x402协议实施的三层微支付模型，对图谱查询按次收取搜索费（第三层），对发现文档收取一次性安装费（第一层），并为选择加入的网站所有者提供可选的按执行次数收费机制（第二层）。所有层级的设定均基于理性采纳的必要条件：代理仅在总费用低于浏览器重新发现的预期成本时，才会使用共享图谱。

摘要 (Abstract)

Autonomous agents increasingly interact with the web, yet most websites remain designed for human browsers – a fundamental mismatch that the emerging ``Agentic Web’’ must resolve. Agents must repeatedly browse pages, inspect DOMs, and reverse-engineer callable routes – a process that is slow, brittle, and redundantly repeated across agents. We observe that every modern website already exposes internal APIs (sometimes called \emph{shadow APIs}) behind its user interface – first-party endpoints that power the site’s own functionality. We present Unbrowse, a shared route graph that transforms browser-based route discovery into a collectively maintained index of these callable first-party interfaces. The system passively learns routes from real browsing traffic and serves cached routes via direct API calls. In a single-host live-web benchmark of equivalent information-retrieval tasks across 94 domains, fully warmed cached execution averaged 950,ms versus 3{,}404,ms for Playwright browser automation (3.6$\times$ mean speedup, 5.4$\times$ median), with well-cached routes completing in under 100,ms. A three-path execution model – local cache, shared graph, or browser fallback – ensures the system is voluntary and self-correcting. A three-tier micropayment model via the x402 protocol charges per-query search fees for graph lookups (Tier~~3), a one-time install fee for discovery documentation (Tier~~1), and optional per-execution fees for site owners who opt in (Tier~2). All tiers are grounded in a necessary condition for rational adoption: an agent uses the shared graph only when the total fee is lower than the expected cost of browser rediscovery.

关键词: autonomous agents, web interaction, internal APIs, shadow APIs, route discovery, browser automation, API calls, agentic web

82. ❌ Procela: Epistemic Governance in Mechanistic Simulations Under Structural Uncertainty

作者: Kinson Vernet 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00675v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Procela专注于机械模拟框架，涉及认知治理、结构不确定性和假设测试，但未提及任何大模型、深度学习或AI技术。所有关键词均与大模型技术、训练方法、推理优化或AI应用相关，而该论文属于计算科学/模拟领域，与给定关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了Procela框架，通过认知治理和运行时拓扑突变来解决机械模拟中的结构不确定性问题，在抗菌素耐药性传播模拟中实现了20.4%的错误减少和69%的累积遗憾改进。

摘要翻译

机制模拟通常采用固定的本体论设定：变量、因果关系与解析策略均为静态。这一假设在真实因果结构存在争议或无法识别时即告失效——例如在抗菌素耐药性（AMR）传播研究中，接触传播、环境传播与选择压力三类本体论范式相互竞争。我们提出Procela框架，该Python框架将变量构建为具有完整假设记忆的认知权威体，使机制能以因果单元形式编码竞争性本体论，并通过治理模块实时观测认知信号并动态改变系统拓扑结构。这是首个能够对自身假设进行检验的模拟框架。我们在医院网络场景中针对AMR问题实例化了包含三个竞争家族的Procela模型。治理模块成功检测到覆盖度衰减与策略脆弱性，并执行了结构探针测试。实验结果显示，相较于基线方法，该框架实现了20.4%的误差降低与69%的累积遗憾改善。所有实验均具备完全可审计性与可复现性。Procela开创了模拟研究新范式：不仅模拟客观世界，更对其自身的建模过程进行建模，从而实现在结构不确定性条件下的自适应演进。

摘要 (Abstract)

Mechanistic simulations typically assume fixed ontologies: variables, causal relationships, and resolution policies are static. This assumption fails when the true causal structure is contested or unidentifiable-as in antimicrobial resistance (AMR) spread, where contact, environmental, and selection ontologies compete. We introduce Procela, a Python framework where variables act as epistemic authorities that maintain complete hypothesis memory, mechanisms encode competing ontologies as causal units, and governance observes epistemic signals and mutates system topology at runtime. This is the first framework where simulations test their own assumptions. We instantiate Procela for AMR in a hospital network with three competing families. Governance detects coverage decay, policy fragility, and runs structural probes. Results show 20.4% error reduction and 69% cumulative regret improvement over baseline. All experiments are reproducible with full auditability. Procela establishes a new paradigm: simulations that model not only the world but their own modeling process, enabling adaptation under structural uncertainty.

关键词: mechanistic simulations, epistemic governance, structural uncertainty, causal ontologies, runtime topology mutation, antimicrobial resistance, hypothesis memory, auditability

83. ❌ Streaming Model Cascades for Semantic SQL

作者: Paweł Liskowski, Kyle Schmaus 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在语义SQL查询中的高效推理问题，提出了两种模型级联算法（SUPG-IT和GAMCAL）来降低LLM在数据仓库中的推理成本。因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文通过模型级联优化推理效率，与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），但并非直接研究推测解码技术。其他关键词如MoE、SFT、RAG、量化等均未涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对数据仓库中语义SQL查询调用大语言模型时的高推理成本问题，提出了两种适用于分布式流式处理的模型级联算法，在六个数据集上实现了F1>0.95的性能，有效平衡了成本与质量。

摘要翻译

现代数据仓库通过语义运算符扩展了SQL功能，这些运算符可在每个符合条件的行上调用大语言模型，但逐行推理的成本在大规模场景中过高。模型级联通过将大多数行路由至快速代理模型处理，并将不确定案例委托给昂贵的权威模型，从而降低了这一成本。然而，现有框架需要全局数据集访问权限且仅优化单一质量指标，这限制了其在数据被分割于独立工作节点的分布式系统中的适用性。本文提出了两种自适应级联算法，专为流式分区执行设计——每个工作节点独立处理其分区数据，无需节点间通信。SUPG-IT将SUPG统计框架扩展至流式执行场景，通过迭代阈值优化提供联合精确率-召回率保证。GAMCAL则用学习型校准模型替代用户指定的质量目标：广义可加模型将代理分数映射为带有不确定性量化的校准概率，从而通过单一参数直接优化成本-质量权衡。在生产级语义SQL引擎中对六个数据集的实验表明，两种算法在所有数据集上均实现了F1分数>0.95。在成本敏感的操作点上，GAMCAL每次调用权威模型可获得更高的F1分数；而SUPG-IT在精确率与召回率的形式化保证下，能达到更高的质量上限。

摘要 (Abstract)

Modern data warehouses extend SQL with semantic operators that invoke large language models on each qualifying row, but the per-row inference cost is prohibitive at scale. Model cascades reduce this cost by routing most rows through a fast proxy model and delegating uncertain cases to an expensive oracle. Existing frameworks, however, require global dataset access and optimize a single quality metric, limiting their applicability in distributed systems where data is partitioned across independent workers. We present two adaptive cascade algorithms designed for streaming, per-partition execution in which each worker processes its partition independently without inter-worker communication. SUPG-IT extends the SUPG statistical framework to streaming execution with iterative threshold refinement and joint precision-recall guarantees. GAMCAL replaces user-specified quality targets with a learned calibration model: a Generalized Additive Model maps proxy scores to calibrated probabilities with uncertainty quantification, enabling direct optimization of a cost-quality tradeoff through a single parameter. Experiments on six datasets in a production semantic SQL engine show that both algorithms achieve F1 > 0.95 on every dataset. GAMCAL achieves higher F1 per oracle call at cost-sensitive operating points, while SUPG-IT reaches a higher quality ceiling with formal guarantees on precision and recall.

关键词: model cascades, semantic SQL, large language models, inference cost, streaming execution, distributed systems, SUPG-IT, GAMCAL

84. ❌ UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems

作者: Mingming Ha, Guanchen Wang, Linxun Chen, Xuan Rao, Yuexin Shi, Tianbao Ma, Zhaojie Liu, Yunqian Fan, Zilong Lu, Yanan Niu, Han Li, Kun Gai 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00590v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于推荐系统的缩放定律和架构设计，与大多数大模型技术关键词无关。仅与’Scaling Laws AND Data Quality’有一定关联（5分），因为论文研究推荐模型的缩放定律，但未明确涉及数据质量。其他关键词均未提及或相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为UniMixer的统一推荐系统缩放架构，通过参数化特征混合模块和轻量化设计，提高了缩放效率并建立了连接主流缩放方法的理论框架。

摘要翻译

近年来，推荐模型的缩放规律日益受到关注，这些规律主导着推荐系统性能与参数量/浮点运算量之间的关系。目前，推荐模型实现缩放主要存在三种主流架构，即基于注意力机制、基于TokenMixer以及基于因子分解机的方法，这些方法在设计理念和架构结构上均存在根本性差异。本文提出一种用于推荐系统的统一缩放架构——\textbf{UniMixer}，旨在提升缩放效率，并建立一个能够统一主流缩放模块的理论框架。通过将基于规则的TokenMixer转化为等效的参数化结构，我们构建了一个广义参数化特征混合模块，使得令牌混合模式能够在模型训练过程中被优化和学习。同时，广义参数化的令牌混合消除了TokenMixer中要求注意力头数量必须等于令牌数量的限制。此外，我们为推荐系统建立了一个统一的缩放模块设计框架，该框架桥接了基于注意力机制、基于TokenMixer以及基于因子分解机的方法之间的联系。为进一步提升缩放投资回报率，我们设计了一个轻量化的UniMixing模块——\textbf{UniMixing-Lite}，该模块在显著提升模型性能的同时，进一步压缩了模型参数量和计算成本。缩放曲线如下图所示。我们通过大量离线与在线实验验证了\textbf{UniMixer}卓越的缩放能力。

摘要 (Abstract)

In recent years, the scaling laws of recommendation models have attracted increasing attention, which govern the relationship between performance and parameters/FLOPs of recommenders. Currently, there are three mainstream architectures for achieving scaling in recommendation models, namely attention-based, TokenMixer-based, and factorization-machine-based methods, which exhibit fundamental differences in both design philosophy and architectural structure. In this paper, we propose a unified scaling architecture for recommendation systems, namely \textbf{UniMixer}, to improve scaling efficiency and establish a unified theoretical framework that unifies the mainstream scaling blocks. By transforming the rule-based TokenMixer to an equivalent parameterized structure, we construct a generalized parameterized feature mixing module that allows the token mixing patterns to be optimized and learned during model training. Meanwhile, the generalized parameterized token mixing removes the constraint in TokenMixer that requires the number of heads to be equal to the number of tokens. Furthermore, we establish a unified scaling module design framework for recommender systems, which bridges the connections among attention-based, TokenMixer-based, and factorization-machine-based methods. To further boost scaling ROI, a lightweight UniMixing module is designed, \textbf{UniMixing-Lite}, which further compresses the model parameters and computational cost while significantly improve the model performance. The scaling curves are shown in the following figure. Extensive offline and online experiments are conducted to verify the superior scaling abilities of \textbf{UniMixer}.

关键词: recommendation systems, scaling laws, UniMixer, token mixing, unified architecture, parameterized feature mixing, scaling efficiency, model compression

85. ❌ HabitatAgent: An End-to-End Multi-Agent System for Housing Consultation

作者: Hongyang Yang, Yanxin Zhang, Yang She, Yue Xiao, Hao Wu, Yiyang Zhang, Jiapeng Hou, Rongshan Zhang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00556v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是LLM驱动的多智能体系统（HabitatAgent）用于住房咨询，直接相关关键词：LLMs（核心基础）、LLM Agents（系统本质）、Multi-agent Systems（架构设计）、Retrieval-Augmented Generation（使用GraphRAG）。Hallucination Mitigation通过验证代理部分相关（5分）。其他关键词如MoE、SFT、RLHF等未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了HabitatAgent，一个基于LLM的多智能体系统，用于端到端住房咨询，通过专业化代理协作实现了95%的准确率，显著优于基线方法。

摘要翻译

住房选择是一项高风险且基本不可逆的决策问题。本研究将住房咨询视为住房选择的决策支持界面进行探讨。现有的住房平台及众多基于大语言模型（LLM）的助手常将这一过程简化为排序或推荐，导致推理过程不透明、多约束处理脆弱，且事实准确性保障有限。
我们提出了HabitatAgent，这是首个基于LLM的、用于端到端住房咨询的多智能体架构。HabitatAgent包含四个专门的智能体角色：记忆（Memory）、检索（Retrieval）、生成（Generation）与验证（Validation）。记忆智能体通过内部阶段（包括约束提取、记忆融合和验证门控更新）维护多层级的用户记忆；检索智能体执行混合向量-图谱检索（GraphRAG）；生成智能体生成附有证据引用的推荐与解释；验证智能体则实施多层验证与针对性修正。这些智能体共同协作，为端到端的住房咨询提供了一个可审计且可靠的工作流程。
我们在端到端正确性评估协议下，基于100个真实用户咨询场景（包含300组多轮次问答对）对HabitatAgent进行了评估。一个强大的单阶段基线模型（Dense+Rerank）达到了75%的准确率，而HabitatAgent则达到了95%。

摘要 (Abstract)

Housing selection is a high-stakes and largely irreversible decision problem. We study housing consultation as a decision-support interface for housing selection. Existing housing platforms and many LLM-based assistants often reduce this process to ranking or recommendation, resulting in opaque reasoning, brittle multi-constraint handling, and limited guarantees on factuality. We present HabitatAgent, the first LLM-powered multi-agent architecture for end-to-end housing consultation. HabitatAgent comprises four specialized agent roles: Memory, Retrieval, Generation, and Validation. The Memory Agent maintains multi-layer user memory through internal stages for constraint extraction, memory fusion, and verification-gated updates; the Retrieval Agent performs hybrid vector–graph retrieval (GraphRAG); the Generation Agent produces evidence-referenced recommendations and explanations; and the Validation Agent applies multi-tier verification and targeted remediation. Together, these agents provide an auditable and reliable workflow for end-to-end housing consultation. We evaluate HabitatAgent on 100 real user consultation scenarios (300 multi-turn question–answer pairs) under an end-to-end correctness protocol. A strong single-stage baseline (Dense+Rerank) achieves 75% accuracy, while HabitatAgent reaches 95%.

关键词: LLM-powered, multi-agent system, housing consultation, end-to-end, GraphRAG, validation agent, auditable workflow, real user scenarios

86. ❌ Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

作者: Thanh Luong Tuan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00555v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究企业级LLM代理系统，通过神经符号架构解决幻觉、领域漂移和合规性问题。高度相关的关键词包括：LLMs（论文明确使用LLMs构建企业代理）、LLM Agents（论文研究企业代理系统）、Hallucination Mitigation（直接解决幻觉问题）、Tool Use（涉及工具发现机制）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、模型压缩、科学AI等均未在论文中涉及或提及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于本体约束神经推理的神经符号架构，用于解决企业LLM代理中的幻觉、领域漂移和合规性问题，并通过实验证明该架构在准确性、合规性和角色一致性方面显著优于未接地代理。

摘要翻译

企业采用大语言模型（LLM）受到幻觉、领域漂移以及无法在推理层面强制监管合规性的制约。我们提出了一种在Foundation AgenticOS（FAOS）平台内实现的神经符号架构，该架构通过本体约束的神经推理来解决这些局限性。我们的方法引入了一个三层本体框架——角色本体、领域本体和交互本体——为基于LLM的企业智能体提供了形式化的语义基础。我们形式化了非对称神经符号耦合的概念，其中符号化的本体知识约束智能体输入（上下文组装、工具发现、治理阈值），同时提出了将这种耦合扩展到约束智能体输出（响应验证、推理核实、合规性检查）的机制。我们通过一项受控实验（在五个行业：金融科技、保险、医疗保健、越南银行业和越南保险业进行600次运行）评估了该架构，发现本体耦合的智能体在度量准确性（p < .001, W = .460）、监管合规性（p = .003, W = .318）和角色一致性（p < .001, W = .614）上显著优于无基础智能体，且改进在LLM参数知识最薄弱的领域最为显著——尤其是在越南本地化领域。我们的贡献包括：（1）一个形式化的三层企业本体模型，（2）神经符号耦合模式的分类法，（3）通过SQL下推评分实现的本体约束工具发现，（4）一个提出的输出端本体验证框架，（5）关于“逆参数知识效应”的经验证据，即本体基础价值与LLM训练数据对领域的覆盖度成反比，以及（6）一个服务于21个行业垂直领域、拥有650多个智能体的生产系统。

摘要 (Abstract)

Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. Our approach introduces a three-layer ontological framework–Role, Domain, and Interaction ontologies–that provides formal semantic grounding for LLM-based enterprise agents. We formalize the concept of asymmetric neurosymbolic coupling, wherein symbolic ontological knowledge constrains agent inputs (context assembly, tool discovery, governance thresholds) while proposing mechanisms for extending this coupling to constrain agent outputs (response validation, reasoning verification, compliance checking). We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance), finding that ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001, W = .460), Regulatory Compliance (p = .003, W = .318), and Role Consistency (p < .001, W = .614), with improvements greatest where LLM parametric knowledge is weakest–particularly in Vietnam-localized domains. Our contributions include: (1) a formal three-layer enterprise ontology model, (2) a taxonomy of neurosymbolic coupling patterns, (3) ontology-constrained tool discovery via SQL-pushdown scoring, (4) a proposed framework for output-side ontological validation, (5) empirical evidence for the inverse parametric knowledge effect that ontological grounding value is inversely proportional to LLM training data coverage of the domain, and (6) a production system serving 21 industry verticals with 650+ agents.

关键词: Large Language Models, Enterprise Agents, Neurosymbolic Architecture, Ontology-Constrained Reasoning, Hallucination Mitigation, Regulatory Compliance, Tool Discovery, Domain Grounding

作者: Yao Qin, Yangyang Yan, Jinhua Pang, Xiaoming Zhang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00550v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是开发一个用于科学发现的AI代理系统（BloClaw），直接涉及LLM在科学领域的应用（AI for Science）、代理工作流（LLM Agents）、工具调用（Tool Use）和检索增强生成（RAG）。摘要明确提到LLMs、Agentic Workspace、Tool-calling protocols、Retrieval-Augmented Generation (RAG)和AI for Science (AI4S)。其他关键词如MoE、Scaling Laws、训练方法、推理优化、对齐技术等均未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

论文针对现有AI科学家框架在工具调用、执行环境和界面方面的脆弱性问题，提出了BloClaw——一个通过XML-Regex路由协议、运行时状态拦截沙箱和动态视口UI实现的多模态操作系统，为科学发现提供了高度鲁棒、自演化的计算研究助手范式。

摘要翻译

将大型语言模型（LLMs）融入生命科学领域，催生了“AI科学家”的发展。然而，将这些理论能力转化为可部署的研究环境时，暴露出了深刻的基础设施脆弱性。现有框架受限于脆弱的基于JSON的工具调用协议、易受干扰且会丢失图形输出的执行沙箱，以及本质上难以适应高维科学数据的僵化对话界面。我们提出了BloClaw，一个为科学人工智能（AI4S）设计的统一多模态操作系统。BloClaw通过三项架构创新重构了智能体-计算机交互（ACI）范式：（1）XML-正则表达式双轨路由协议，从统计上消除了序列化故障（错误率0.2%，而JSON为17.6%）；（2）运行时状态拦截沙箱，利用Python猴子补丁技术自主捕获并编译动态数据可视化（如Plotly/Matplotlib），规避了浏览器CORS策略；（3）状态驱动的动态视口用户界面，可在极简命令面板与交互式空间渲染引擎之间无缝切换。我们在化学信息学（RDKit）、基于ESMFold的从头3D蛋白质折叠、分子对接以及自主检索增强生成（RAG）等多个领域对BloClaw进行了全面基准测试，从而为计算研究助手建立了一个高度鲁棒、自我演进的范式。开源仓库地址为：https://github.com/qinheming/BloClaw。

摘要 (Abstract)

The integration of Large Language Models (LLMs) into life sciences has catalyzed the development of “AI Scientists.” However, translating these theoretical capabilities into deployment-ready research environments exposes profound infrastructural vulnerabilities. Current frameworks are bottlenecked by fragile JSON-based tool-calling protocols, easily disrupted execution sandboxes that lose graphical outputs, and rigid conversational interfaces inherently ill-suited for high-dimensional scientific data.We introduce BloClaw, a unified, multi-modal operating system designed for Artificial Intelligence for Science (AI4S). BloClaw reconstructs the Agent-Computer Interaction (ACI) paradigm through three architectural innovations: (1) An XML-Regex Dual-Track Routing Protocol that statistically eliminates serialization failures (0.2% error rate vs. 17.6% in JSON); (2) A Runtime State Interception Sandbox that utilizes Python monkey-patching to autonomously capture and compile dynamic data visualizations (Plotly/Matplotlib), circumventing browser CORS policies; and (3) A State-Driven Dynamic Viewport UI that morphs seamlessly between a minimalist command deck and an interactive spatial rendering engine. We comprehensively benchmark BloClaw across cheminformatics (RDKit), de novo 3D protein folding via ESMFold, molecular docking, and autonomous Retrieval-Augmented Generation (RAG), establishing a highly robust, self-evolving paradigm for computational research assistants. The open-source repository is available at https://github.com/qinheming/BloClaw.

关键词: Large Language Models, AI for Science, Agentic Workspace, Tool Calling, Retrieval-Augmented Generation, Multi-modal Operating System, Scientific Discovery, Computational Research Assistants

88. ❌ Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models

作者: Zixiang Peng, Yongxiu Xu, Qinyi Zhang, Jiexun Shen, Yifan Zhang, Hongbo Xu, Yubin Wang, Gaopeng Gou 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00547v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究统一多模态大模型（UMLMs）的安全性问题，与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为UMLMs是基于大语言模型架构扩展的。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐）、推理优化、代理系统、模型压缩、科学AI应用等，论文均未涉及或讨论，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了统一多模态大模型（UMLMs）在架构统一过程中带来的安全风险，发现统一化虽然提升了模型能力，但显著降低了底层大语言模型的固有安全性，且开源UMLMs的安全性远低于专用于生成或理解任务的多模态大模型。

摘要翻译

统一多模态大模型（UMLMs）在单一架构中整合了理解与生成能力。尽管这种由多模态特征深度融合驱动的架构统一提升了模型性能，但也引入了重要却尚未被充分探索的安全挑战。现有的安全基准主要集中于孤立的理解或生成任务，无法评估UMLMs在统一框架下处理多样化任务时的整体安全性。为此，我们提出了Uni-SafeBench，这是一个综合性基准，涵盖七种任务类型下的六大安全类别分类体系。为确保严谨评估，我们开发了Uni-Judger框架，该框架能有效解耦上下文安全性与内在安全性。基于在Uni-SafeBench上的全面评估，我们发现：尽管统一过程增强了模型能力，却显著削弱了底层大语言模型（LLM）的固有安全性。此外，开源UMLMs的安全性能远低于专精于生成或理解任务的单一功能多模态大模型。我们开源所有资源，以系统性地揭示这些风险，并促进更安全的人工通用智能（AGI）发展。

摘要 (Abstract)

Unified Multimodal Large Models (UMLMs) integrate understanding and generation capabilities within a single architecture. While this architectural unification, driven by the deep fusion of multimodal features, enhances model performance, it also introduces important yet underexplored safety challenges. Existing safety benchmarks predominantly focus on isolated understanding or generation tasks, failing to evaluate the holistic safety of UMLMs when handling diverse tasks under a unified framework. To address this, we introduce Uni-SafeBench, a comprehensive benchmark featuring a taxonomy of six major safety categories across seven task types. To ensure rigorous assessment, we develop Uni-Judger, a framework that effectively decouples contextual safety from intrinsic safety. Based on comprehensive evaluations across Uni-SafeBench, we uncover that while the unification process enhances model capabilities, it significantly degrades the inherent safety of the underlying LLM. Furthermore, open-source UMLMs exhibit much lower safety performance than multimodal large models specialized for either generation or understanding tasks. We open-source all resources to systematically expose these risks and foster safer AGI development.

关键词: Unified Multimodal Large Models, UMLMs, safety benchmark, Uni-SafeBench, safety degradation, multimodal fusion, AGI development, contextual safety

89. ❌ Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

作者: Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang, Haokai Xu, Yixin Chen, Jian Wu, Junbo Zhao, Zuozhu Liu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00536v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM的SFT数据生成优化，与’Large Language Models’和’Supervised Fine-tuning’高度相关（10分）。涉及数据质量评估，与’Scaling Laws AND Data Quality’有一定关联（5分）。应用于医学等领域，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SLMs、RLHF、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于影响估计的优化框架，通过目标模型反馈来优化合成数据生成的评估标准，从而在知识密集型领域显著提升LLM的监督微调性能。

摘要翻译

大型语言模型（LLM）强大的下游性能主要得益于丰富的监督微调（SFT）数据。然而，在人文学科、社会科学、医学、法律和金融等知识密集型领域，高质量的SFT数据十分稀缺，原因在于专家标注成本高昂、隐私限制严格且标签一致性难以保证。近期研究采用合成数据，通常做法是基于领域文档提示生成器，并用手工设计的评估准则过滤输出。但准则设计依赖专家经验，跨领域迁移性差，且往往通过一个脆弱的启发式循环进行优化：编写准则、合成数据、训练模型、检查结果并人工猜测修订方案。这一过程缺乏关于准则如何影响下游性能的可靠量化反馈。我们提出通过合成数据在目标模型上的训练效用进行评估，并利用该信号指导数据生成。受影响力估计的启发，我们采用一种优化器感知的估计器，利用梯度信息量化每个合成样本对目标模型在特定任务目标上的贡献。我们的分析表明，即使合成样本与真实样本在嵌入空间中位置相近，它们对学习过程的影响也可能存在显著差异。基于这一发现，我们提出了一个基于优化的框架，利用目标模型反馈自适应调整准则。我们提供轻量级的指导文本，并使用一个专门针对准则优化的模型来生成任务条件化的准则。以影响力分数作为奖励，通过强化学习优化准则生成器。跨领域、目标模型及数据生成器的实验表明，该方法在不进行任务特定调优的情况下实现了性能的持续提升和强大的泛化能力。

摘要 (Abstract)

Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample’s contribution to a target model’s objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.

关键词: synthetic data generation, supervised fine-tuning, large language models, influence estimation, rubrics optimization, knowledge-intensive domains, reinforcement learning, data quality

90. ❌ MATHENA: Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy

作者: Kyeonghun Kim, Jaehyung Park, Youngung Han, Anna Jung, Seongbin Park, Sumin Lee, Jiwon Yang, Jiyoon Han, Subeen Lee, Junsu Lim, Hyunsu Go, Eunseob Choi, Hyeonseok Jung, Soo Yong Kim, Woo Kyoung Jeong, Won Jae Lee, Pa Hong, Hyuk-Jae Lee, Ken Ying-Kai Liao, Nam-Joon Kim 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文MATHENA专注于牙科医学影像分析，提出了一种基于Mamba（状态空间模型）的统一框架，用于牙齿检测、龋齿分割、异常检测和牙齿发育分期。论文的核心是计算机视觉和医学图像分析，特别是牙科领域的应用。所有关键词（共27个）中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文直接相关，因为该论文属于AI在生物医学（具体为牙科）领域的应用，符合"AI for Science"的范畴。其他26个关键词均涉及大语言模型（LLM）的技术原理、训练方法、推理优化、对齐、代理系统等，而本文未使用或提及任何语言模型，其技术核心是视觉状态空间模型（VSS）和Mamba架构在图像处理中的应用，与LLM无关。因此，除"AI for Science"关键词得10分（核心相关）外，其余均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于Mamba状态空间模型的统一框架MATHENA，用于从口腔全景片中自动完成牙齿检测、龋齿分割、异常检测和牙齿发育分期，在构建的基准数据集上取得了高精度结果。

摘要翻译

基于全景片（OPGs）的牙科诊断需要协调牙齿检测、龋齿分割（CarSeg）、异常检测（AD）和牙齿发育分期（DDS）等任务。我们提出了基于Mamba的架构化牙齿分层估计器与解剖整体评估网络（MATHENA），这是一个利用Mamba线性复杂度状态空间模型（SSM）的统一框架，以同时处理所有四项任务。MATHENA整合了MATHE——一个多分辨率SSM驱动的检测器，其采用四向视觉状态空间（VSS）模块进行O(N)复杂度的全局上下文建模，并生成每颗牙齿的裁剪区域。这些裁剪区域由HENA处理，HENA是一个轻量级的Mamba-UNet网络，具有三头架构和全局上下文状态令牌（GCST）。在该三头架构中，首先将龋齿分割作为上游任务进行训练，以建立共享表征；随后冻结这些表征，并通过线性探针将其复用于下游异常检测的微调和牙齿发育分期分类，从而实现稳定高效的学习。我们还构建了PARTHENON基准数据集，该数据集包含来自十个数据集的15,062个标注实例。MATHENA在牙齿检测中达到93.78%的mAP@50，龋齿分割的Dice系数为90.11%，异常检测准确率为88.35%，牙齿发育分期准确率为72.40%。

摘要 (Abstract)

Dental diagnosis from Orthopantomograms (OPGs) requires coordination of tooth detection, caries segmentation (CarSeg), anomaly detection (AD), and dental developmental staging (DDS). We propose Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy (MATHENA), a unified framework leveraging Mamba’s linear-complexity State Space Models (SSM) to address all four tasks. MATHENA integrates MATHE, a multi-resolution SSM-driven detector with four-directional Vision State Space (VSS) blocks for O(N) global context modeling, generating per-tooth crops. These crops are processed by HENA, a lightweight Mamba-UNet with a triple-head architecture and Global Context State Token (GCST). In the triple-head architecture, CarSeg is first trained as an upstream task to establish shared representations, which are then frozen and reused for downstream AD fine-tuning and DDS classification via linear probing, enabling stable, efficient learning. We also curate PARTHENON, a benchmark comprising 15,062 annotated instances from ten datasets. MATHENA achieves 93.78% mAP@50 in tooth detection, 90.11% Dice for CarSeg, 88.35% for AD, and 72.40% ACC for DDS.

关键词: Mamba, State Space Models, Dental diagnosis, Orthopantomograms, Tooth detection, Caries segmentation, Anomaly detection, Dental developmental staging

91. ❌ Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

作者: Haibo Wang, Zihao Lin, Zhiyang Xu, Lifu Huang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00528v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为TAB的智能体框架，用于3D视觉定位任务。该框架的核心是使用视觉语言模型（VLM）作为智能体，动态调用视觉工具来跟踪和重建目标。因此，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Tool Use OR Function Calling OR API Tool Use’高度相关（10分），因为论文明确提到了’agentic framework’和’VLM agent dynamically invokes visual tools’。然而，论文专注于视觉语言模型（VLM）在3D视觉任务中的应用，而非大语言模型（LLM）的技术原理或创新。摘要中未提及LLM、MoE、缩放定律、训练方法（预训练、后训练、对齐、RLHF、PEFT）、推理优化（RAG、上下文扩展、注意力优化）、推理技术（CoT、系统2思维、MCTS）、模型改进（自校正、量化、推测解码）、可解释性、世界模型、模型合并、上下文学习或科学AI等关键词。因此，这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TAB的动态智能体框架，通过将3D视觉定位任务重新定义为生成式2D到3D重建范式，并引入语义锚定几何扩展机制，在零样本设置下显著超越了现有方法，甚至优于全监督基线。

摘要翻译

三维视觉定位（3D Visual Grounding，简称3D-VG）旨在通过自然语言描述在三维场景中定位目标物体。尽管近期基于视觉语言模型（Vision-Language Models，VLMs）的研究探索了零样本可能性，但它们通常依赖于预处理的三维点云静态流程，本质上将定位任务简化为候选框匹配。为摆脱这一依赖，我们的核心思路是将任务解耦：利用二维视觉语言模型解析复杂的空间语义，同时依靠确定性的多视角几何关系来实例化三维结构。基于这一洞察，我们提出“思考、行动、构建”（Think, Act, Build，简称TAB）——一个动态智能体框架，将三维视觉定位任务重新定义为直接在原始RGB-D视频流上运行的生成式二维到三维重建范式。具体而言，在专门设计的三维视觉定位技能引导下，我们的视觉语言模型智能体动态调用视觉工具，在二维帧序列中追踪并重建目标物体。关键的是，为克服严格视觉语言模型语义追踪导致的多视角覆盖缺失，我们提出了语义锚定几何扩展机制：该机制首先将目标锚定在参考视频片段中，随后利用多视角几何原理将其空间位置传播到未观测帧中。这使得智能体能通过相机参数聚合多视角特征来“构建”目标的三维表征，直接将二维视觉线索映射到三维坐标。此外，为确保严谨评估，我们发现了现有基准数据集（如ScanRefer和Nr3D）中存在的参考歧义与类别错误等问题，并手动修正了错误查询语句。大量实验表明，我们的框架完全基于开源模型实现，在性能上显著超越以往的零样本方法，甚至优于全监督基线模型。

摘要 (Abstract)

3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose “Think, Act, Build (TAB)”, a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to “Build” the target’s 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.

关键词: 3D Visual Grounding, Vision Language Models, Agentic Framework, Zero-Shot Learning, Multi-view Geometry, Semantic-Anchored Geometric Expansion, RGB-D Streams, 2D-to-3D Reconstruction

92. ❌ Toward Optimal Sampling Rate Selection and Unbiased Classification for Precise Animal Activity Recognition

作者: Axiu Mao, Meilu Zhu, Lei Shen, Xiaoshuai Wang, Tomas Norton, Kai Liu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00517v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种用于动物活动识别的深度学习网络（IBA-Net），其核心创新之一是设计了一个基于Mixture-of-Experts（MoE）的特征定制模块（MFC），用于自适应融合多采样率数据以捕获针对不同动物行为的定制特征。因此，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分）。论文属于AI在科学（具体是动物行为监测/农业）领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但并非核心的生物信息学或化学信息学。论文未涉及大语言模型（LLMs）、模型训练/对齐技术、推理优化、智能体等其他关键词，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对动物活动识别中特定行为分类精度不足的问题，提出了一种个体行为感知网络（IBA-Net），通过基于混合专家（MoE）的特征定制模块和神经崩溃驱动的分类器校准模块，在多个公共数据集上实现了优于现有方法的性能。

摘要翻译

随着深度学习技术的快速发展，可穿戴传感器辅助的动物活动识别在提升畜牧管理效率及动物健康与福利监测方面展现出巨大潜力。然而，现有研究往往侧重于整体性能，忽视了特定动物行为类别的分类精度可能仍不理想的问题。这一问题通常源于次优的采样率或类别不平衡。为应对这些挑战，并在农场动物的所有个体行为上实现高分类精度，我们提出了一种新颖的个体行为感知网络。该网络通过同时定制特征和校准分类器，增强了对每种具体行为的识别能力。具体而言，考虑到不同行为需要不同的采样率以达到最佳性能，我们设计了一个基于专家混合的特征定制模块。该模块自适应地融合来自多种采样率的数据，捕获针对各类动物行为定制的特征。此外，为缓解类别不平衡导致的分类器对多数类的偏向，我们开发了一个神经坍缩驱动的分类器校准模块。该模块在分类阶段引入一个固定的等角紧框架分类器，最大化成对分类器向量之间的角度，从而提升对少数类的分类性能。为验证所提网络的有效性，我们在涵盖山羊、牛和马活动识别的三个公共数据集上进行了实验。结果表明，我们的方法在所有数据集上均持续优于现有方法。

摘要 (Abstract)

With the rapid advancements in deep learning techniques, wearable sensor-aided animal activity recognition (AAR) has demonstrated promising performance, thereby improving livestock management efficiency as well as animal health and welfare monitoring. However, existing research often prioritizes overall performance, overlooking the fact that classification accuracies for specific animal behavioral categories may remain unsatisfactory. This issue typically stems from suboptimal sampling rates or class imbalance problems. To address these challenges and achieve high classification accuracy across all individual behaviors in farm animals, we propose a novel Individual-Behavior-Aware Network (IBA-Net). This network enhances the recognition of each specific behavior by simultaneously customizing features and calibrating the classifier. Specifically, considering that different behaviors require varying sampling rates to achieve optimal performance, we design a Mixture-of-Experts (MoE)-based Feature Customization (MFC) module. This module adaptively fuses data from multiple sampling rates, capturing customized features tailored to various animal behaviors. Additionally, to mitigate classifier bias toward majority classes caused by class imbalance, we develop a Neural Collapse-driven Classifier Calibration (NC3) module. This module introduces a fixed equiangular tight frame (ETF) classifier during the classification stage, maximizing the angles between pair-wise classifier vectors and thereby improving the classification performance for minority classes. To validate the effectiveness of IBA-Net, we conducted experiments on three public datasets covering goat, cattle, and horse activity recognition. The results demonstrate that our method consistently outperforms existing approaches across all datasets.

关键词: Animal Activity Recognition, Deep Learning, Mixture-of-Experts, Class Imbalance, Feature Customization, Classifier Calibration, Wearable Sensors, Livestock Management

93. ❌ MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning

作者: Kyeonghun Kim, Hyeonseok Jung, Youngung Han, Junsu Lim, YeonJu Jean, Seongbin Park, Eunseob Choi, Hyunsu Go, SeoYoung Ju, Seohyoung Park, Gyeongmin Kim, MinJu Kwon, KyungSeok Yuh, Soo Yong Kim, Ken Ying-Kai Liao, Nam-Joon Kim, Hyuk-Jae Lee 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00514v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于3D医学影像的自监督学习，提出了一种名为MAESIL的掩码自编码器框架。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等）。论文仅与两个关键词有一定关联：1）“Pre-training OR Continual Pre-training OR Domain Adaptation”（5分），因为论文涉及自监督预训练以解决医学影像的领域适应问题；2）“AI for Science OR Bioinformatics OR Cheminformatics”（8分），因为论文明确属于AI在生物医学（医学影像分析）领域的应用，这是AI for Science的一个子领域。

!!! tip deepseek-chat TL;DR

该论文针对3D医学影像（如CT扫描）中标注数据稀缺和现有方法忽略3D结构信息的问题，提出了一种基于3D掩码自编码器的自监督学习框架MAESIL，通过在三个大型公共CT数据集上的实验验证，该框架在关键重建指标上显著优于现有方法。

摘要翻译

在三维（3D）医学影像（如计算机断层扫描（CT））中训练深度学习模型，从根本上受到标注数据稀缺的挑战。虽然在自然图像上进行预训练是常见做法，但这会导致显著的领域偏移，从而限制模型性能。基于未标注医学数据的自监督学习（Self-Supervised Learning, SSL）已成为一种有效的解决方案，但主流框架往往未能充分利用CT扫描固有的3D特性。这些方法通常将3D扫描作为一系列独立的2D切片进行处理，这种做法从根本上丢弃了关键的轴向连贯性和3D结构上下文信息。为应对这一局限，我们提出了一种用于增强自监督医学图像学习的自动编码器（autoencoder for enhanced self-supervised medical image learning, MAESIL），这是一种新颖的自监督学习框架，旨在高效捕获3D结构信息。其核心创新在于“超块”（superpatch）——一种基于3D块状的输入单元，能在保持3D上下文与计算效率之间取得平衡。我们的框架将三维体数据划分为超块，并采用结合双重掩码策略的3D掩码自动编码器方法，以学习全面的空间表征。我们在三个多样化的大规模公共CT数据集上验证了所提方法。实验结果表明，在PSNR和SSIM等关键重建指标上，MAESIL相较于AE、VAE和VQ-VAE等现有方法均有显著提升。这确立了MAESIL作为一种鲁棒且实用的3D医学影像任务预训练解决方案。

摘要 (Abstract)

Training deep learning models for three-dimensional (3D) medical imaging, such as Computed Tomography (CT), is fundamentally challenged by the scarcity of labeled data. While pre-training on natural images is common, it results in a significant domain shift, limiting performance. Self-Supervised Learning (SSL) on unlabeled medical data has emerged as a powerful solution, but prominent frameworks often fail to exploit the inherent 3D nature of CT scans. These methods typically process 3D scans as a collection of independent 2D slices, an approach that fundamentally discards critical axial coherence and the 3D structural context. To address this limitation, we propose the autoencoder for enhanced self-supervised medical image learning(MAESIL), a novel self-supervised learning framework designed to capture 3D structural information efficiently. The core innovation is the ‘superpatch’, a 3D chunk-based input unit that balances 3D context preservation with computational efficiency. Our framework partitions the volume into superpatches and employs a 3D masked autoencoder strategy with a dual-masking strategy to learn comprehensive spatial representations. We validated our approach on three diverse large-scale public CT datasets. Our experimental results show that MAESIL demonstrates significant improvements over existing methods such as AE, VAE and VQ-VAE in key reconstruction metrics such as PSNR and SSIM. This establishes MAESIL as a robust and practical pre-training solution for 3D medical imaging tasks.

关键词: 3D medical imaging, self-supervised learning, masked autoencoder, CT scans, domain shift, superpatch, pre-training, reconstruction metrics

94. ❌ Adaptive Parallel Monte Carlo Tree Search for Efficient Test-time Compute Scaling

作者: Hongbeen Kim, Juhyun Lee, Sanghyeon Lee, Kwanghoon Choi, Jaehyuk Huh 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00510v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	15.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Monte Carlo Tree Search (MCTS)作为大语言模型推理性能提升的测试时计算扩展方法，与’Monte Carlo Tree Search OR MCTS AND LLM’高度相关（15分）。论文明确针对LLMs的推理性能改进，与’Large Language Models OR LLMs OR Foundation Models’（10分）、‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分）和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（10分）直接相关。研究通过负提前退出和自适应提升机制优化MCTS以减少延迟，与’Speculative Decoding OR Inference Acceleration’（10分）相关。其他关键词如MoE、SLMs、训练方法、对齐、RAG、压缩、幻觉缓解等未在摘要中提及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对Monte Carlo Tree Search (MCTS)在大语言模型推理中执行时间可变导致长尾延迟的问题，提出了负提前退出和自适应提升机制，在vLLM中集成后显著降低了p99端到端延迟，同时提高了吞吐量并保持了推理准确性。

摘要翻译

蒙特卡洛树搜索（Monte Carlo Tree Search，MCTS）是一种有效的测试时计算扩展方法，能够提升大语言模型的推理性能，但其高度可变的执行时间在实践中会导致严重的长尾延迟。现有优化方法（如正向提前退出）虽能在有利情况下降低延迟，但当搜索持续进行却未取得有意义进展时效果有限。我们提出负向提前退出机制，用于剪枝无产出的MCTS轨迹，并引入一种自适应增强机制，将回收的计算资源重新分配以减少并发搜索间的资源竞争。这些技术集成于vLLM框架后，在保持推理准确性的同时，显著降低了p99端到端延迟并提升了吞吐量。

摘要 (Abstract)

Monte Carlo Tree Search (MCTS) is an effective test-time compute scaling (TTCS) method for improving the reasoning performance of large language models, but its highly variable execution time leads to severe long-tail latency in practice. Existing optimizations such as positive early exit, reduce latency in favorable cases but are less effective when search continues without meaningful progress. We introduce {\it negative early exit}, which prunes unproductive MCTS trajectories, and an {\it adaptive boosting mechanism} that reallocates reclaimed computation to reduce resource contention among concurrent searches. Integrated into vLLM, these techniques substantially reduce p99 end-to-end latency while improving throughput and maintaining reasoning accuracy.

关键词: Monte Carlo Tree Search, MCTS, large language models, test-time compute scaling, reasoning performance, latency reduction, adaptive boosting, vLLM

95. ❌ Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

作者: Yunwen Lei, Yufeng Xie 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00505v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究过参数化浅层神经网络的泛化边界理论分析，属于深度学习理论领域，与所有评分关键词（主要关注大模型技术、应用、训练方法、推理优化等）均无直接关联。论文未涉及大模型、语言模型、科学AI应用或任何评分关键词中的具体技术。

!!! tip deepseek-chat TL;DR

该论文针对过参数化浅层神经网络，首次提出了完全依赖于初始化的泛化边界理论，通过新的剥离技术实现了对宽度的对数依赖，并证明了非空边界的存在。

摘要翻译

过参数化神经网络常表现出良性过拟合特性，即尽管参数量超过训练样本数，仍能实现优异的泛化性能。解释良性过拟合的一个可行方向是将泛化能力与初始化距离的范数相关联，其动机源于实验观察：该距离的范数通常远小于参数范数本身。然而，现有的依赖初始化的复杂度分析未能充分利用初始化的优势，因为相关界依赖于初始化矩阵的谱范数，而该范数可能随网络宽度呈平方根级增长，故对过参数化模型效果有限。本文针对具有一般利普希茨激活函数的浅层神经网络，首次建立了完全依赖初始化的复杂度界，其对宽度的依赖仅为对数级。我们的界依赖于初始化距离的路径范数，这是通过引入一种新的剥离技术来处理依赖初始化的约束挑战所推导得出的。我们还建立了在常数因子内紧致的下界。最后，通过实证比较表明，我们的泛化分析能为过参数化网络提供非平凡的界。

摘要 (Abstract)

Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses cannot fully exploit the power of initialization since the associated bounds depend on the spectral norm of the initialization matrix, which can scale as a square-root function of the width and are therefore not effective for overparameterized models. In this paper, we develop the first \emph{fully} initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions, which enjoys a logarithmic dependency on the width. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.

关键词: overparameterized neural networks, generalization bounds, initialization-dependent analysis, shallow neural networks, benign overfitting, path-norm, non-vacuous bounds, Lipschitz activation functions

96. ❌ A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

作者: Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00493v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出CheXOne，一个用于胸部X光片解读的推理增强视觉语言基础模型，核心创新在于结合视觉证据、放射学发现和诊断预测的显式推理追踪。高度相关的关键词包括：基础模型（CheXOne是视觉语言基础模型）、指令调优和强化学习（使用两阶段框架结合指令调优和强化学习改进推理质量）、RLHF/RLAIF/DPO（使用强化学习改进推理）、思维链/多步推理（生成显式推理追踪）、系统2思维/深度推理（连接视觉证据和预测的因果推理）、幻觉缓解/事实性（临床事实性高）、可解释AI（提供可解释的推理追踪）、AI for Science（医学影像AI应用）。其他关键词如SFT有一定关联（涉及训练框架），但大部分关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该研究开发了CheXOne推理增强视觉语言基础模型，通过生成显式临床推理追踪来改进胸部X光片解读，在多项评估中超越现有模型，临床研究表明其报告质量与住院医师相当或更好，并提高了可解释性和临床实用性。

摘要翻译

胸部X光片（Chest X-rays, CXRs）是全球范围内最常进行的影像学检查之一，然而不断增长的影像检查量增加了放射科医生的工作负荷和诊断错误的风险。尽管人工智能（Artificial Intelligence, AI）系统在CXR解读方面展现出潜力，但大多数系统仅生成最终预测，而未明确说明视觉证据如何转化为影像学发现和诊断预测。我们提出了CheXOne，一个具备推理能力的视觉-语言模型，用于CXR解读。CheXOne能够联合生成诊断预测以及明确的、基于临床的推理轨迹，这些轨迹连接了视觉证据、影像学发现和最终预测。该模型使用一个结合指令微调与强化学习的两阶段训练框架，在从30个公共数据集中精选出的、涵盖36项CXR解读任务的1470万条指令和推理样本上进行训练，以提高推理质量。我们在零样本设置下对CheXOne进行了评估，覆盖视觉问答、报告生成、视觉定位和推理评估等17个评估场景。CheXOne的表现优于现有的医学和通用领域基础模型，并在独立的公共基准测试中取得了强劲的性能。一项临床读者研究表明，在55%的病例中，CheXOne起草的报告与住院医师撰写的报告相当或更优，同时有效回应了临床指征，并提升了报告撰写和CXR解读的效率。涉及放射科医生的进一步分析表明，生成的推理轨迹具有很高的临床事实准确性，并为最终预测提供了因果支持，这为性能提升提供了合理的解释。这些结果表明，在AI辅助的CXR解读中，显式推理能够提升模型的性能、可解释性和临床实用性。

摘要 (Abstract)

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

关键词: vision-language foundation model, chest X-ray interpretation, reasoning traces, instruction tuning, reinforcement learning, clinical factuality, interpretability, AI-assisted diagnosis

97. ❌ Executing as You Generate: Hiding Execution Latency in LLM Code Generation

作者: Zhensu Sun, Zhihao Lin, Zhi Chen, Chengran Yang, Mingyi Zhou, Li Li, David Lo 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00491v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM代码生成中的执行延迟优化，提出并行执行范式Eager。核心相关关键词：1) ‘Large Language Models’ (核心研究对象)，2) ‘LLM Agents’ (论文研究LLM作为代码生成代理)，3) ‘Tool Use’ (涉及代码解释器作为工具的执行)，4) ‘Speculative Decoding OR Inference Acceleration’ (直接优化端到端延迟，属于推理加速)。其他关键词如MoE、SFT、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对LLM代码生成中串行执行导致的延迟问题，提出了并行执行范式Eager，通过生成-检测-执行流水线将端到端延迟降低高达55%。

摘要翻译

当前基于大语言模型的代码生成代理遵循串行执行范式：模型首先生成完整代码，随后调用解释器执行。这种顺序工作流程导致生成阶段执行器闲置、执行阶段生成器闲置，从而产生不必要的端到端延迟。我们观察到，与人类开发者不同，大语言模型以不可修订的顺序逐词元生成代码，这使得在代码生成过程中同步执行成为可能。我们将这种并行执行范式形式化，将其建模为生成、检测与执行的三级流水线，并通过闭式延迟边界刻画其加速潜力与运行区间。随后我们提出Eager系统——一种具体实现方案，其核心特性包括：基于抽象语法树（AST）的代码块划分、带门控执行的动态批处理机制以及早期错误中断策略。我们在四个基准测试集、七种大语言模型和三种执行环境中对Eager进行评估。实验结果表明，在七种大语言模型和四个基准测试中，Eager将非重叠执行延迟降低最高达99.9%，端到端延迟降低最高达55%。

摘要 (Abstract)

Current LLM-based coding agents follow a serial execution paradigm: the model first generates the complete code, then invokes an interpreter to execute it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. We observe that, unlike human developers, LLMs produce code tokens sequentially without revision, making it possible to execute code as it is being generated. We formalize this parallel execution paradigm, modeling it as a three-stage pipeline of generation, detection, and execution, and derive closed-form latency bounds that characterize its speedup potential and operating regimes. We then present Eager, a concrete implementation featuring AST-based chunking, dynamic batching with gated execution, and early error interruption. We evaluate Eager across four benchmarks, seven LLMs, and three execution environments. Results show that Eager reduces the non-overlapped execution latency by up to 99.9% and the end-to-end latency by up to 55% across seven LLMs and four benchmarks.

关键词: LLM code generation, parallel execution, latency reduction, Eager system, AST-based chunking, dynamic batching, early error interruption, end-to-end latency

98. ❌ The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

作者: Harshee Jignesh Shah 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00478v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM中的奉承行为（sycophancy）问题，提出The Silicon Mirror框架来动态检测用户说服策略并调整AI行为以保持事实完整性。高度相关的关键词包括：LLMs（论文明确研究LLM）、Alignment（涉及价值对齐问题）、RLHF（论文将验证-纠正模式归为RLHF训练模型的失败模式）、Self-Correction（框架包含Generator-Critic循环进行自我纠正）、LLM Agents（论文研究LLM代理行为）、Hallucination Mitigation（旨在减少奉承行为，提高事实性）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对LLM中优先用户验证而非认知准确性的奉承行为问题，提出了The Silicon Mirror框架，通过动态检测用户说服策略和行为调整，在TruthfulQA对抗场景中将奉承率相对降低了83.3%。

摘要翻译

大型语言模型（LLMs）日益倾向于优先满足用户认同而非认知准确性——这一现象被称为“谄媚性”。我们提出“硅镜”框架，一种能够动态检测用户说服策略并调整AI行为以保持事实完整性的协调架构。该体系包含三个核心组件：（1）行为访问控制系统，该系统基于实时谄媚风险评分限制上下文层的访问；（2）特质分类器，用于在多轮对话中识别说服策略；（3）生成器-批评者循环，其中审计模块可否决谄媚性草稿并触发带有“必要摩擦”的改写。在使用Claude Sonnet 4模型配合独立LLM评判器对50个TruthfulQA对抗性场景进行的实时评估中，我们观察到原始Claude模型的谄媚率为12.0%（6/50），静态防护机制下为4.0%（2/50），而硅镜框架下仅为2.0%（1/50）——相对降幅达83.3%（p = 0.112，费希尔精确检验）。在Gemini 2.5 Flash模型上的跨模型评估显示其基线谄媚率更高（46.0%），而硅镜框架下实现了统计学上显著的69.6%降幅（p < 0.001）。我们将这种“先认同后修正”的模式特征描述为经RLHF训练模型特有的失效模式。

摘要 (Abstract)

Large Language Models (LLMs) increasingly prioritize user validation over epistemic accuracy-a phenomenon known as sycophancy. We present The Silicon Mirror, an orchestration framework that dynamically detects user persuasion tactics and adjusts AI behavior to maintain factual integrity. Our architecture introduces three components: (1) a Behavioral Access Control (BAC) system that restricts context layer access based on real-time sycophancy risk scores, (2) a Trait Classifier that identifies persuasion tactics across multi-turn dialogues, and (3) a Generator-Critic loop where an auditor vetoes sycophantic drafts and triggers rewrites with “Necessary Friction.” In a live evaluation on 50 TruthfulQA adversarial scenarios using Claude Sonnet 4 with an independent LLM judge, we observe vanilla Claude sycophancy at 12.0% (6/50), static guardrails at 4.0% (2/50), and the Silicon Mirror at 2.0% (1/50)-an 83.3% relative reduction (p = 0.112, Fisher’s exact test). A cross-model evaluation on Gemini 2.5 Flash reveals a higher baseline sycophancy rate (46.0%) and a statistically significant 69.6% reduction under the Silicon Mirror (p < 0.001). We characterize the validation-before-correction pattern as a distinct failure mode of RLHF-trained models.

关键词: Large Language Models, Sycophancy, Behavioral Gating, LLM Agents, Factual Integrity, RLHF, Alignment, Self-Correction

99. ❌ Not My Truce: Personality Differences in AI-Mediated Workplace Negotiation

作者: Veda Duddu, Jash Rajesh Parekh, Andy Mao, Hanyi Min, Ziang Xiao, Vedant Das Swain, Koustuv Saha 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00464v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI驱动的对话式教练在工作场所谈判中的应用，并探讨人格特质如何调节教练效果。虽然涉及AI应用，但研究重点在于人机交互、心理学和个性化干预设计，而非大模型或深度学习的技术原理、架构、训练方法、优化技术或具体科学领域应用。所有关键词均聚焦于大模型技术本身（如架构、训练、推理、优化、应用范式等），而本文仅将AI作为工具使用，未涉及这些底层技术，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该研究探讨了人格特质如何影响AI驱动的工作场所谈判教练的效果，发现不同人格类型的参与者从不同干预方式（理论驱动AI、通用AI、传统手册）中获益不同，强调个性化AI教练系统需根据个体准备度调整支持强度。

摘要翻译

人工智能驱动的对话式辅导正日益应用于职场谈判支持，但既往研究默认其对所有用户具有均等效果。我们通过探究个体差异（尤其是人格特质）如何调节辅导效果，对这一假设提出挑战。我们开展了一项被试间实验（N=267），比较理论驱动型人工智能（Trucey）、通用型人工智能（对照-AI）与传统谈判手册（对照-非AI）的效果。基于大五人格特质与ARC类型学，参与者被聚类为三种人格画像——韧性型、过度控制型与低控制型。研究发现：韧性型工作者主要通过手册获得广泛的心理增益，过度控制型工作者在使用理论驱动型AI时表现出特定结果维度的改善，而低控制型工作者尽管接触了各类辅导框架却收效甚微。这些模式表明，人格特质可作为超越阶段式定制化的准备度预测指标：弱势用户更适合接受针对性而非全面性干预。本研究深化了对人格决定的干预前提条件的理解，并为自适应AI辅导系统的设计提供了启示——此类系统应将支持强度与个体准备度相匹配，而非假定其具有普适有效性。

摘要 (Abstract)

AI-driven conversational coaching is increasingly used to support workplace negotiation, yet prior work assumes uniform effectiveness across users. We challenge this assumption by examining how individual differences, particularly personality traits, moderate coaching outcomes. We conducted a between-subjects experiment (N=267) comparing theory-driven AI (Trucey), general-purpose AI (Control-AI), and a traditional negotiation handbook (Control-NoAI). Participants were clustered into three profiles – resilient, overcontrolled, and undercontrolled – based on the Big-Five personality traits and ARC typology. Resilient workers achieved broad psychological gains primarily from the handbook, overcontrolled workers showed outcome-specific improvements with theory-driven AI, and undercontrolled workers exhibited minimal effects despite engaging with the frameworks. These patterns suggest personality as a predictor of readiness beyond stage-based tailoring: vulnerable users benefit from targeted rather than comprehensive interventions. The study advances understanding of personality-determined intervention prerequisites and highlights design implications for adaptive AI coaching systems that align support intensity with individual readiness, rather than assuming universal effectiveness.

关键词: AI-mediated negotiation, personality traits, workplace coaching, adaptive AI systems, Big-Five personality, intervention effectiveness, individual differences, psychological gains

100. ❌ First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models

作者: Jiwoo Ha, Jongwoo Baek, Jinhyun So 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型视觉语言模型（LVLMs）中的物体幻觉问题，提出了一种名为First Logit Boosting（FLB）的无训练缓解方法。因此，它与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为LVLMs是LLMs在视觉领域的扩展。它与’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（10分），因为这是论文的核心研究问题。论文未涉及其他关键词，如MoE、SLMs、训练技术、推理方法、代理、压缩等，因此这些关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型中存在的物体幻觉问题，提出了一种名为First Logit Boosting的无训练方法，通过增强首个生成标记的logit来缓解视觉信息在生成过程中的长期衰减，从而有效减少幻觉，且几乎不增加推理开销。

摘要翻译

近期的大型视觉语言模型（LVLMs）在需要同时理解视觉与语言输入的多种多模态任务中展现出卓越性能。然而，物体幻觉——即在回答中生成不存在的物体——仍然是一个持续存在的挑战。尽管已有多种方法被提出以缓解此问题，例如重训练和外部 grounding 方法，但这些方法仍面临数据成本高昂或结构复杂的问题。无需训练的方法（如对比解码，CD）更具成本效益，避免了额外的训练或外部模型，但仍存在长期衰减问题，即随着生成过程的推进，视觉 grounding 逐渐减弱，语言先验占据主导。本文提出了一种简单而有效的免训练技术——首词元对数增强（First Logit Boosting, FLB），旨在缓解 LVLMs 中的长期衰减。FLB 存储首个生成词元的对数，并将其叠加到后续词元预测中，从而有效减轻视觉信息的长期衰减。我们观察到 FLB 具有以下效果：（1）在整个生成过程中维持嵌入于首个词元的视觉信息；（2）通过“The”词元的稳定化效应抑制幻觉词汇。实验结果表明，FLB 在不同任务、基准测试和骨干模型中均能显著减少物体幻觉。值得注意的是，该方法引入的推理开销可忽略不计，使其高度适用于实时多模态系统。代码发布于 https://github.com/jiwooha20/FLB。

摘要 (Abstract)

Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination – the generation of nonexistent objects in answers – remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The’’ token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems. Code is available at https://github.com/jiwooha20/FLB

关键词: Large Vision-Language Models, Object Hallucination, Training-free Method, First Logit Boosting, Visual Grounding, Long-term Decay, Multimodal Systems, Inference Overhead

101. ❌ Towards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models

作者: Ponhvoan Srey, Quang Minh Nguyen, Xiaobao Wu, Anh Tuan Luu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00445v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的不确定性估计以检测幻觉输出，与’Large Language Models’和’Hallucination Mitigation’高度相关（10分）。提出后校准方法TAC，与’Post-training’和’Instruction Tuning’有一定关联（5分）。涉及模型自我改进和可解释性，与’Self-Correction’和’Mechanistic Interpretability’相关（5分）。其他关键词如MoE、量化、推理加速等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型不确定性估计指标不稳定且与事实正确性脱节的问题，提出了Truth AnChoring后校准方法，通过将原始分数映射到与事实对齐的分数，实现了更可靠的不确定性估计。

摘要翻译

不确定性估计（Uncertainty Estimation, UE）旨在检测大语言模型（Large Language Models, LLMs）的幻觉输出，以提升其可靠性。然而，不确定性估计指标在不同配置下常表现出不稳定的性能，这严重限制了其适用性。在本研究中，我们将此现象形式化为代理失效问题，因为大多数不确定性估计指标源于模型行为，而非明确基于大语言模型输出的事实正确性。基于此，我们证明不确定性估计指标恰恰在低信息情境下失去区分能力。为缓解这一问题，我们提出真值锚定（Truth AnChoring, TAC），一种后置校准方法，通过将原始分数映射至真值对齐的分数来修正不确定性估计指标。即使在噪声干扰和少样本监督条件下，我们的真值锚定方法仍能支持学习到校准良好的不确定性估计，并提供了一种实用的校准方案。我们的研究结果凸显了将启发式不确定性估计指标直接作为真值不确定性指示器的局限性，并将真值锚定定位为实现更可靠的大语言模型不确定性估计的必要步骤。代码仓库地址为 https://github.com/ponhvoan/TruthAnchor/。

摘要 (Abstract)

Uncertainty estimation (UE) aims to detect hallucinated outputs of large language models (LLMs) to improve their reliability. However, UE metrics often exhibit unstable performance across configurations, which significantly limits their applicability. In this work, we formalise this phenomenon as proxy failure, since most UE metrics originate from model behaviour, rather than being explicitly grounded in the factual correctness of LLM outputs. With this, we show that UE metrics become non-discriminative precisely in low-information regimes. To alleviate this, we propose Truth AnChoring (TAC), a post-hoc calibration method to remedy UE metrics, by mapping the raw scores to truth-aligned scores. Even with noisy and few-shot supervision, our TAC can support the learning of well-calibrated uncertainty estimates, and presents a practical calibration protocol. Our findings highlight the limitations of treating heuristic UE metrics as direct indicators of truth uncertainty, and position our TAC as a necessary step toward more reliable uncertainty estimation for LLMs. The code repository is available at https://github.com/ponhvoan/TruthAnchor/.

关键词: Uncertainty Estimation, Large Language Models, Hallucination Detection, Truth Alignment, Post-hoc Calibration, Proxy Failure, Factual Correctness, Reliability

102. ❌ Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics

作者: Iyad Ait Hou, Rebecca Hwa 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00443v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究大语言模型（LLMs）中神经元激活的机制解释问题，特别是通过分析多义词（如’bank’）在不同语境下的激活模式，来探讨标准叠加度量中的词汇混淆效应。因此，它与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为研究基于110M-70B参数的模型。它与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为核心是研究模型内部工作机制的解释性，属于可解释AI范畴。论文未涉及其他关键词，如MoE、训练技术、推理加速、AI for Science等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，大语言模型中神经元激活的重叠（通常归因于概念叠加）在很大程度上是由词汇混淆（即同一词形对应不同含义）引起的，而非真正的概念压缩，且消除这种混淆能提升词义消歧和知识编辑的选择性。

摘要翻译

当同一神经元对“贷方”与“河岸”均产生激活时，标准度量方法会将这种重叠归因于叠加——即神经元必然在压缩两个无关概念。本研究旨在探究此种重叠在多大程度上源于词汇混淆：神经元因共享词形（如“bank”）激活，而非因两个压缩概念而激活。通过2x2因子分解发现，在参数量为1.1亿至700亿的系列模型中，“纯词汇条件”（相同词形、不同含义）的激活重叠始终高于“纯语义条件”（不同词形、相同含义）。这种混淆现象同样存在于稀疏自编码器中（18-36%的特征混合了多义词义项），集中于不超过1%的激活维度，并对下游任务产生负面影响：将其滤除后可提升词义消歧性能，并使知识编辑更具选择性（p = 0.002）。

摘要 (Abstract)

If the same neuron activates for both “lender” and “riverside,” standard metrics attribute the overlap to superposition–the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as “bank”) rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).

关键词: polysemy, superposition, lexical confound, mechanistic interpretability, sparse autoencoders, word sense disambiguation, knowledge editing, neuron activation

103. ❌ Self-Routing: Parameter-Free Expert Routing from Hidden States

作者: Jama Hussein Mohamud, Drew Wagner, Mirco Ravanelli 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00421v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究MoE（混合专家）模型中的路由机制，提出了一种无参数的自路由方法，直接使用隐藏状态子空间作为专家logits，替代传统学习的路由器。因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为这是论文的核心创新点。论文在GPT-2规模的语言建模上进行了评估，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），但并非主要焦点。其他关键词如SLMs、Scaling Laws、各种训练方法、推理技术、代理系统等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了混合专家（MoE）模型中是否必须使用学习的路由器，提出了一种无参数的自路由机制，直接利用隐藏状态子空间进行专家分配，实验表明该方法在保持性能竞争力的同时消除了专用路由参数，并实现了更平衡的专家利用率。

摘要翻译

混合专家（Mixture-of-Experts, MoE）层通过仅为每个标记激活一小部分专家来提升模型容量，通常依赖一个学习的路由器将隐藏状态映射到专家分配。在本研究中，我们探讨在所研究的MoE设置中，一个专门学习的路由器是否严格必要。我们提出自路由（Self-Routing），一种无需参数的路径机制，它直接使用标记隐藏状态的指定子空间作为专家逻辑值，完全消除了路由器投影，同时保持MoE层的其余部分不变。我们通过在GPT-2规模的语言建模和ImageNet-1K分类任务上，将自路由与标准学习路由器、随机路由基线以及密集非MoE基线进行比较来评估其性能。我们的结果表明，自路由在移除所有专用路由参数的同时，仍与学习路由器基线保持竞争力，并产生更均衡的专家利用率，平均归一化路由熵提高了约17%，且无需显式的负载均衡损失。在使用DeiT-S/16的ImageNet-1K任务上，自路由也略微优于相应的学习路由器MoE。这些发现表明，有效的MoE路由可以从隐藏表示本身中自然涌现，而无需一个独立的学习路由器模块。

摘要 (Abstract)

Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced expert utilization, with about 17 % higher average normalized routing entropy and no explicit load-balancing loss. On ImageNet-1K with DeiT-S/16, Self-Routing also slightly improves over the corresponding learned-router MoE. These findings suggest that effective MoE routing can emerge from the hidden representation itself without requiring a separate learned router module.

关键词: Mixture-of-Experts, MoE, Self-Routing, parameter-free routing, hidden states, expert utilization, GPT-2, language modeling

104. ❌ G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

作者: Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00419v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM的隐私攻击方法（成员推理攻击），核心涉及LLM内部表示和梯度分析，因此与’Large Language Models’高度相关（10分）。论文通过分析梯度诱导的特征漂移来理解记忆机制，这与’Mechanistic Interpretability’有一定关联（5分），但并非主要研究可解释性方法。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统、科学AI应用等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于梯度诱导特征漂移的白盒成员推理攻击方法G-Drift MIA，用于检测LLM训练数据中的隐私泄露，实验表明该方法在多个基准上显著优于现有攻击方法。

摘要翻译

大型语言模型（LLMs）基于海量网络规模语料库进行训练，引发了日益增长的隐私与版权担忧。成员推断攻击（MIAs）旨在判定给定样本是否在训练过程中被使用。现有的LLM成员推断攻击主要依赖输出概率或损失值，且当成员与非成员样本来自相同分布时，其表现往往仅略优于随机猜测。本文提出G-Drift MIA，一种基于梯度诱导特征漂移的白盒成员推断方法。对于候选样本（x,y），我们执行一次单步定向梯度上升以增加其损失，并测量更新前后模型内部表征的变化，包括逻辑值、隐藏层激活值以及在固定特征方向上的投影。这些漂移信号被用于训练一个轻量级逻辑分类器，从而有效区分成员与非成员。在多种基于Transformer的大型语言模型及源自现实成员推断基准测试的数据集上，G-Drift显著优于基于置信度、基于困惑度以及基于参考样本的攻击方法。我们进一步发现，被记忆的训练样本相较于非成员样本，系统性地表现出更小且更具结构化的特征漂移，这为梯度几何、表征稳定性与记忆机制之间建立了机理关联。总体而言，我们的研究结果表明，微小且受控的梯度干预为审计训练数据成员身份及评估大型语言模型的隐私风险提供了一种实用工具。

摘要 (Abstract)

Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.

关键词: Membership Inference Attack, Large Language Models, Gradient-induced Feature Drift, Privacy Auditing, Representation Stability, Memorization, White-box Attack, Training Data Privacy

作者: Weizhuo Wang, Yanjie Ze, C. Karen Liu, Monroe Kennedy 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00416v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Learning Humanoid Navigation from Human Data》专注于机器人导航领域，使用扩散模型从人类行走数据中学习轨迹预测，并部署在Unitree G1人形机器人上进行零样本测试。所有给定的关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的特定应用（如生物信息学、化学信息学）直接相关，而本文研究的是机器人导航与控制，未涉及任何大语言模型技术、模型训练方法（如预训练、微调、对齐）、推理优化、代理系统或AI for Science的具体子领域。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为EgoNav的系统，通过仅使用5小时的人类行走数据训练扩散模型，实现了人形机器人在未见过的多样化环境中的零样本导航，并在真实机器人上验证了其能自然涌现出避障、绕行人群等复杂行为。

摘要翻译

我们提出EgoNav系统，该系统使人形机器人能够仅通过5小时人类行走数据完成学习（无需任何机器人数据或微调），即可在多样、未见过的环境中自主行进。该系统采用扩散模型预测未来可能轨迹的分布，其条件输入包括：历史轨迹、融合色彩、深度与语义信息的360度视觉记忆，以及从冻结的DINOv3骨干网络提取的视频特征（这些特征能捕捉深度传感器无法感知的外观线索）。通过混合采样方案，系统在10步去噪过程中实现实时推理；滚动时域控制器则从预测分布中选择行进路径。我们通过离线评估验证了EgoNav的性能——其在避障和多模态覆盖能力上均超越基线方法，并在Unitree G1人形机器人上进行了零样本部署测试，成功穿越未见过的室内外环境。从学习到的先验知识中，系统自然涌现出等待开门、绕行人群、规避玻璃墙等行为。我们将公开数据集与训练模型。项目网站：https://egonav.weizhuowang.com

摘要 (Abstract)

We present EgoNav, a system that enables a humanoid robot to traverse diverse, unseen environments by learning entirely from 5 hours of human walking data, with no robot data or finetuning. A diffusion model predicts distributions of plausible future trajectories conditioned on past trajectory, a 360 deg visual memory fusing color, depth, and semantics, and video features from a frozen DINOv3 backbone that capture appearance cues invisible to depth sensors. A hybrid sampling scheme achieves real-time inference in 10 denoising steps, and a receding-horizon controller selects paths from the predicted distribution. We validate EgoNav through offline evaluations, where it outperforms baselines in collision avoidance and multi-modal coverage, and through zero-shot deployment on a Unitree G1 humanoid across unseen indoor and outdoor environments. Behaviors such as waiting for doors to open, navigating around crowds, and avoiding glass walls emerge naturally from the learned prior. We will release the dataset and trained models. Our website: https://egonav.weizhuowang.com

关键词: humanoid navigation, diffusion model, trajectory prediction, zero-shot deployment, visual memory, collision avoidance, real-time inference, human walking data

106. ❌ Decision-Centric Design for LLM Systems

作者: Wei Sun 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于LLM系统的决策中心框架，将控制决策（如是否回答、澄清、检索、调用工具等）从生成过程中分离出来，使其成为系统的一个显式、可检查的层。该研究与LLM系统架构、代理工作流和工具使用高度相关，因为这些是论文的核心主题。与检索增强生成和可解释AI有一定关联，因为论文提到了检索和可解释的故障模式。其他关键词（如MoE、量化、对齐等）在摘要中没有提及，因此不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种决策中心框架，通过将控制决策从生成过程中分离出来，使LLM系统的决策过程显式化和可检查，从而提高了系统的可靠性、可控性和可诊断性，并在实验中减少了无效操作并提高了任务成功率。

摘要翻译

LLM系统除生成输出外，还需进行控制决策：包括是否应答、澄清、检索、调用工具、修复或升级处理。在当前许多架构中，这些决策仍隐含于生成过程中，将评估与行动耦合在单一模型调用内，导致故障难以检查、约束或修复。我们提出一种以决策为中心的框架，将决策相关信号与将其映射到行动的策略分离，使控制成为系统中显式且可检查的层次。这种分离支持将故障归因于信号估计、决策策略或执行环节，并支持各组件的模块化改进。该框架统一了路由和自适应推理等常见单步场景，并自然扩展至行动会改变后续可用信息的序列决策场景。在三个受控实验中，该框架减少了无效行动，提升了任务成功率，并揭示了可解释的故障模式。更广泛而言，它为构建更可靠、可控、可诊断的LLM系统提供了一种通用架构原则。

摘要 (Abstract)

LLM systems must make control decisions in addition to generating outputs: whether to answer, clarify, retrieve, call tools, repair, or escalate. In many current architectures, these decisions remain implicit within generation, entangling assessment and action in a single model call and making failures hard to inspect, constrain, or repair. We propose a decision-centric framework that separates decision-relevant signals from the policy that maps them to actions, turning control into an explicit and inspectable layer of the system. This separation supports attribution of failures to signal estimation, decision policy, or execution, and enables modular improvement of each component. It unifies familiar single-step settings such as routing and adaptive inference, and extends naturally to sequential settings in which actions alter the information available before acting. Across three controlled experiments, the framework reduces futile actions, improves task success, and reveals interpretable failure modes. More broadly, it offers a general architectural principle for building more reliable, controllable, and diagnosable LLM systems.

关键词: LLM systems, decision-centric framework, control decisions, tool calling, agentic workflow, reliability, controllability, diagnosability

107. ❌ COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

作者: Seohyoung Park, Jaeyeol Lim, Seoyoung Ju, Kyeonghun Kim, Nam-Joon Kim, Hyuk-Jae Lee 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00402v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自动驾驶轨迹预测模型的领域适应问题，主要涉及迁移学习和微调策略，与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关（8分），因为论文明确探讨了预训练模型在不同地理环境下的适应性和微调方法。其他关键词均与论文内容无关（0分），因为论文专注于计算机视觉和自动驾驶的轨迹预测，未涉及大语言模型、推理方法、对齐技术、代理系统等主题。

!!! tip deepseek-chat TL;DR

该论文研究了轨迹预测模型从美国数据迁移到韩国道路环境的适应性问题，发现选择性微调解码器同时冻结编码器的方法在准确性和训练效率之间取得了最佳平衡，相比从头训练减少了66%以上的预测误差。

摘要翻译

开发能够准确预测周围智能体轨迹的鲁棒模型对于自动驾驶安全至关重要。然而，大多数公开数据集，如Waymo开放运动数据集和Argoverse，均采集自西方道路环境，未能反映包括韩国在内的其他地区独特的交通模式、基础设施和驾驶行为。当基于西方数据训练的最先进模型部署于不同地理环境时，这种领域差异会导致性能下降。在本研究中，我们探究了以查询为中心的轨迹预测模型（Query-Centric Trajectory Prediction, QCNet）从美国数据迁移至韩国道路环境时的适应性。利用一个韩国自动驾驶数据集，我们比较了四种训练策略：零样本迁移、从头训练、全微调以及编码器冻结。实验结果表明，利用预训练知识能显著提升预测性能。具体而言，在冻结编码器的同时选择性微调解码器，能在预测精度与训练效率之间取得最佳平衡，与从头训练相比，其预测误差降低了超过66%。本研究为在新地理领域部署轨迹预测模型的有效迁移学习策略提供了实用见解。

摘要 (Abstract)

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

关键词: trajectory prediction, autonomous driving, domain adaptation, transfer learning, fine-tuning, geographic discrepancy, QCNet, Korean road environments

108. ❌ Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions

作者: Yuchen Yang, Shuangyang Zhong, Haijun Yu, Langcuomu Suo, Hongbin Han, Florian Putz, Yixing Huang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00397v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像分割中的领域适应问题，使用VAE-MMD方法减少跨机构数据异质性。与关键词列表的相关性分析如下：1）与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文核心是领域适应框架；2）与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），属于生物医学AI应用；3）其他关键词（如LLMs、MoE、RLHF等）均未涉及，得0分。论文虽涉及深度学习，但未涉及大模型技术原理创新。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合变分自编码器和最大均值差异损失的领域适应框架（VAE-MMD），有效减少了跨机构脑转移瘤分割的数据异质性，在四个公共数据集上显著提升了分割性能，无需目标域标签即可实现更好的泛化能力。

摘要翻译

背景：深度学习在脑转移瘤（BM）自动分割方面展现出显著潜力；然而，由于扫描设备硬件、成像协议和患者群体特征的差异，在单一机构训练的模型在不同机构常表现出次优性能。本研究旨在构建一个领域自适应框架，以实现跨多机构的脑转移瘤分割应用。
方法：我们提出了一种VAE-MMD预处理流程，该流程将变分自编码器（VAE）与最大均值差异（MMD）损失相结合，并在nnU-Net分割网络中引入跳跃连接和自注意力机制。该方法在来自四个公共数据库（斯坦福、UCSF、UCLM和PKG）的740例患者数据上进行了测试，通过领域分类器准确率、敏感性、精确度、F1/F2分数、表面Dice系数（sDice）以及95%豪斯多夫距离（HD95）进行评估。
结果：VAE-MMD将领域分类器准确率从0.91降至0.50，表明成功实现了跨机构的特征对齐。重建图像体积的峰值信噪比（PSNR）高于36分贝，保持了解剖结构的准确性。与基线nnU-Net相比，该联合方法在所有四个中心的平均F1分数提升了11.1%（从0.700至0.778），平均sDice提升了7.93%（从0.7121至0.7686），并将平均HD95降低了65.5%（从11.33毫米降至3.91毫米）。
结论：VAE-MMD有效降低了跨机构数据异质性，并在体积、检测和边界层面指标上显著提升了脑转移瘤分割的泛化性能，且无需目标域标注数据，从而克服了人工智能辅助分割技术临床转化中的一个关键障碍。

摘要 (Abstract)

Background: Deep learning has demonstrated significant potential for automated brain metastases (BM) segmentation; however, models trained at a singular institution often exhibit suboptimal performance at various sites due to disparities in scanner hardware, imaging protocols, and patient demographics. The goal of this work is to create a domain adaptation framework that will allow for BM segmentation to be used across multiple institutions. Methods: We propose a VAE-MMD preprocessing pipeline that combines variational autoencoders (VAE) with maximum mean discrepancy (MMD) loss, incorporating skip connections and self-attention mechanisms alongside nnU-Net segmentation. The method was tested on 740 patients from four public databases: Stanford, UCSF, UCLM, and PKG, evaluated by domain classifier’s accuracy, sensitivity, precision, F1/F2 scores, surface Dice (sDice), and 95th percentile Hausdorff distance (HD95). Results: VAE-MMD reduced domain classifier accuracy from 0.91 to 0.50, indicating successful feature alignment across institutions. Reconstructed volumes attained a PSNR greater than 36 dB, maintaining anatomical accuracy. The combined method raised the mean F1 by 11.1% (0.700 to 0.778), the mean sDice by 7.93% (0.7121 to 0.7686), and reduced the mean HD95 by 65.5% (11.33 to 3.91 mm) across all four centers compared to the baseline nnU-Net. Conclusions: VAE-MMD effectively diminishes cross-institutional data heterogeneity and enhances BM segmentation generalization across volumetric, detection, and boundary-level metrics without necessitating target-domain labels, thereby overcoming a significant obstacle to the clinical implementation of AI-assisted segmentation.

关键词: brain metastases segmentation, domain adaptation, variational autoencoder, maximum mean discrepancy, cross-institutional generalization, medical image analysis, nnU-Net, AI-assisted segmentation

109. ❌ Deep Networks Favor Simple Data

作者: Weyl Lu, Chenjie Hao, Yubei Chen 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00394v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究深度网络对数据复杂度的偏好，发现网络倾向于给简单数据分配更高密度估计，属于深度学习基础理论研究，与所有关键词（均聚焦大模型技术、应用、优化等）无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文研究发现深度网络普遍倾向于给简单数据分配更高的密度估计，揭示了网络对数据复杂度的系统性偏好。

摘要翻译

估计密度常被解释为衡量样本在模型下的典型程度。然而，在一个数据集上训练的深度模型可能为分布外（OOD）的简单数据分配比分布内测试数据\emph{更高}的密度。我们将此现象称为OOD异常。先前研究通常仅在单一架构、检测器或基准测试中探讨该现象，隐含地假设了某些标准密度估计方式。我们则选择将训练好的网络与其表征或输出所构建的密度估计器分离开来。我们引入了两种估计器：基于雅可比矩阵的估计器和自回归自估计器，从而使密度分析可广泛应用于各类模型。
将这一视角应用于一系列模型（包括iGPT、PixelCNN++、Glow、基于分数的扩散模型、DINOv2和I-JEPA）后，我们发现了一个超越OOD异常的显著规律：\textbf{较低复杂度的样本获得较高的估计密度，而较高复杂度的样本获得较低的估计密度}。这种排序规律在测试集内部以及跨OOD数据对（如CIFAR-10与SVHN）中均存在，并且在独立训练的模型间保持高度一致性。为量化这些排序关系，我们引入了斯皮尔曼秩相关系数，发现不同模型之间以及与外部复杂度度量之间均存在显著一致性。即使仅使用最低密度（最复杂）的样本甚至\textbf{仅使用单个此类样本}进行训练，所得模型仍将更简单的图像排序为更高密度。
这些观察引导我们超越最初的OOD异常，得出更普遍的结论：深度网络始终偏好简单数据。我们的目标并非终结这一议题，而是更清晰地界定并可视化该现象。我们拓展了其经验范围，并证明该规律在不同架构、训练目标和密度估计器中普遍存在。

摘要 (Abstract)

Estimated density is often interpreted as indicating how typical a sample is under a model. Yet deep models trained on one dataset can assign \emph{higher} density to simpler out-of-distribution (OOD) data than to in-distribution test data. We refer to this behavior as the OOD anomaly. Prior work typically studies this phenomenon within a single architecture, detector, or benchmark, implicitly assuming certain canonical densities. We instead separate the trained network from the density estimator built from its representations or outputs. We introduce two estimators: Jacobian-based estimators and autoregressive self-estimators, making density analysis applicable to a wide range of models. Applying this perspective to a range of models, including iGPT, PixelCNN++, Glow, score-based diffusion models, DINOv2, and I-JEPA, we find the same striking regularity that goes beyond the OOD anomaly: \textbf{lower-complexity samples receive higher estimated density, while higher-complexity samples receive lower estimated density}. This ordering appears within a test set and across OOD pairs such as CIFAR-10 and SVHN, and remains highly consistent across independently trained models. To quantify these orderings, we introduce Spearman rank correlation and find striking agreement both across models and with external complexity metrics. Even when trained only on the lowest-density (most complex) samples or \textbf{even a single such sample} the resulting models still rank simpler images as higher density. These observations lead us beyond the original OOD anomaly to a more general conclusion: deep networks consistently favor simple data. Our goal is not to close this question, but to define and visualize it more clearly. We broaden its empirical scope and show that it appears across architectures, objectives, and density estimators.

关键词: deep networks, density estimation, out-of-distribution anomaly, data complexity, Jacobian-based estimators, autoregressive self-estimators, Spearman rank correlation

110. ❌ Universal YOCO for Efficient Depth Scaling

作者: Yutao Sun, Li Dong, Tianzhu Ye, Shaohan Huang, Jianyong Wang, Furu Wei 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01220v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理效率问题，直接涉及KV缓存优化和推理加速技术，与’Large Language Models’、‘KV Cache Compression’、‘Speculative Decoding’高度相关（10分）。论文提到长上下文基准测试，与’Context Window Extension’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中体现，评0分。

!!! tip deepseek-chat TL;DR

论文提出Universal YOCO架构，通过结合YOCO解码器-解码器结构和递归计算来解决Transformer在深度扩展时KV缓存膨胀和计算开销大的问题，实现了更好的能力-效率权衡并保持高效推理。

摘要翻译

测试时缩放技术的兴起显著提升了大语言模型（LLM）的推理与智能体能力。然而，标准Transformer架构难以高效扩展推理阶段的计算量，因为传统的循环策略存在高计算开销，且键值（KV）缓存会随模型深度增加而膨胀。本文提出通用YOCO架构（YOCO-U），它将YOCO解码器-解码器架构与递归计算相结合，实现了超越单一方法的协同效应。基于YOCO框架，YOCO-U构建了一个通用自解码器（Universal Self-Decoder），通过参数共享执行多次迭代，同时将迭代过程限制在浅层的高效注意力层中。这种组合产生了YOCO或递归单独无法实现的优异能力-效率平衡：YOCO架构提供了恒定的全局KV缓存和线性预填充，而部分递归以有限开销增强了表征深度。两者共同作用使YOCO-U在保持高效推理的同时，提升了令牌利用率和扩展特性。实证结果表明，YOCO-U在通用和长上下文基准测试中均保持强大竞争力，证明高效注意力架构与递归计算的融合是构建可扩展大语言模型的有效方向。

摘要 (Abstract)

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.

关键词: Large Language Models, KV cache, inference efficiency, recursive computation, YOCO architecture, scalable LLMs, attention mechanisms, computational overhead

111. ❌ LLM REgression with a Latent Iterative State Head

作者: Yiheng Su, Matthew Lease 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01206v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于大语言模型文本回归的新架构RELISH，核心创新在于通过交叉注意力迭代优化潜在状态，然后通过线性回归器映射到点估计。论文高度相关于’Large Language Models’（使用四种LLM骨干）和’PEFT’（参数高效微调，仅需0.01-0.04%额外参数，远少于LoRA）。与’Post-training’有一定关联，因为涉及在冻结LLM上的训练。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、Agents、AI for Science等均未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

论文提出了一种名为RELISH的轻量级架构，通过迭代优化潜在状态并映射到点估计，在文本回归任务中优于现有基线，同时保持高度参数效率。

摘要翻译

本文提出RELISH（基于潜在迭代状态头的回归模型），这是一种专为大型语言模型文本回归任务设计的新型轻量级架构。RELISH并非将数值目标解码为文本或聚合多个生成输出，而是通过以下方式直接从冻结的LLM表示中预测标量值：首先通过词元级表示的交叉注意力迭代优化学习到的潜在状态，随后通过线性回归器将最终状态映射为点估计。在五个数据集、四种LLM主干架构和两种LLM训练机制的综合实验中，RELISH在三大主流LLM回归方法（包括自回归解码、回归感知推理和现有预测头方法）的所有基线模型上均表现出更优性能。尽管性能显著提升，RELISH仍保持极高的参数效率——在冻结的LLM主干上仅需训练340-370万参数（仅增加0.01-0.04%的开销），远低于随模型规模增长的基于LoRA的替代方案（0.26-0.42%）。

摘要 (Abstract)

We present RELISH (REgression with a Latent Iterative State Head), a novel, lightweight architecture designed for text regression with large language models. Rather than decoding numeric targets as text or aggregating multiple generated outputs, RELISH predicts scalar values directly from frozen LLM representations by iteratively refining a learned latent state through cross-attention over token-level representations, and then mapping the final state to a point estimate with a linear regressor. Across five datasets, four LLM backbones, and two LLM training regimes, RELISH consistently outperforms prior baselines from all three major LLM regression families, including autoregressive decoding, regression-aware inference, and existing predictive head methods. Despite these gains, RELISH remains highly parameter-efficient, requiring only 3.4-3.7M trainable parameters across frozen LLM backbones (only 0.01-0.04% additional overhead), far less than LoRA-based alternatives that grow with model size (0.26-0.42%).

关键词: LLM regression, latent iterative state, parameter-efficient, frozen LLM, cross-attention, text regression, predictive head, lightweight architecture

112. ❌ Embarrassingly Simple Self-Distillation Improves Code Generation

作者: Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01193v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的代码生成能力改进，通过简单的自蒸馏方法（SSD）进行监督微调（SFT），属于后训练技术。该方法涉及模型自我改进，与“Self-Correction OR Self-Improvement OR Self-Reflection”有一定关联，但非核心。其他关键词如MoE、SLMs、Scaling Laws、RLHF等均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何仅通过大语言模型自身原始输出的简单自蒸馏方法改进代码生成，结果显示该方法能显著提升模型在LiveCodeBench上的表现，从42.4%提高到55.3% pass@1。

摘要翻译

大型语言模型（LLM）能否仅利用其自身的原始输出——无需验证器、教师模型或强化学习——来提升代码生成能力？我们通过简单的自蒸馏（SSD）方法给出了肯定答案：该方法以特定的温度和截断配置从模型中采样生成解决方案，随后通过标准监督微调在这些样本上进行训练。在LiveCodeBench v6基准测试中，SSD将Qwen3-30B-Instruct的pass@1准确率从42.4%提升至55.3%，且提升效果主要集中在更难的问题上；该方法在4B、8B和30B规模的Qwen与Llama系列模型（包括指令微调版和思维链变体）中均展现出良好的泛化能力。为理解这种简单方法有效的原因，我们追溯其性能增益至LLM解码过程中的“精度-探索”矛盾，并证明SSD能够以上下文相关的方式重塑词元分布：在需要精确性的场景中抑制干扰性的分布尾部，同时在需要探索性的场景中保留有益的多样性。综上所述，SSD为提升LLM代码生成能力提供了一条互补的后训练路径。

摘要 (Abstract)

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

关键词: large language models, code generation, self-distillation, supervised fine-tuning, post-training, LLM decoding, token distributions, LiveCodeBench

113. ❌ S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

作者: Jack Young 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究一种名为S0 tuning的参数高效微调方法，专门针对混合循环注意力模型（如Qwen3.5-4B、FalconH1-7B），通过优化每个循环层的初始状态矩阵实现零推理开销的适应。因此，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（15分），因为S0 tuning是PEFT的一种新方法，并与LoRA直接比较。与’Large Language Models OR LLMs OR Foundation Models’相关（10分），因为实验在LLM（Qwen3.5-4B、FalconH1-7B）上进行。与’Post-training OR Supervised Fine-tuning OR SFT’相关（10分），因为S0 tuning是一种监督微调技术，使用HumanEval训练解决方案。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为S0 tuning的零推理开销参数高效微调方法，通过优化混合循环注意力模型中每个循环层的初始状态矩阵，在HumanEval等基准上显著优于LoRA，并展示了跨领域迁移能力。

摘要翻译

使用约48个经过执行验证的HumanEval训练样本，对每个循环层仅调优单个初始状态矩阵（零推理开销），在HumanEval基准上以+10.8个百分点（p < 0.001）的表现优于LoRA方法。该技术被我们称为S0调优，其核心是在冻结全部模型权重的同时，优化每个循环层的一个状态矩阵。在Qwen3.5-4B（GatedDeltaNet混合架构）上，S0调优将贪婪解码的pass@1指标提升了+23.6 +/- 1.7个百分点（10次随机种子实验）。在FalconH1-7B（Mamba-2混合架构）上，S0达到71.8% +/- 1.3，LoRA达到71.4% +/- 2.4（3次随机种子），在当前样本量下统计无差异，且无需权重合并。跨领域迁移能力在MATH-500（+4.8个百分点，p = 0.00002，8次种子）和GSM8K（+2.8个百分点，p = 0.0003，10次种子）上表现显著；而在文本到SQL基准（Spider）上未观察到迁移效应，这与轨迹导向机制的解释一致。在纯Transformer架构（Qwen2.5-3B）上进行的prefix-tuning对照实验显示，所有九种测试配置均导致性能下降-13.9个百分点。在Qwen3.5上，一种逐步状态偏移变体达到+27.1个百分点的提升，优于S0和LoRA，但需承担每步推理开销。综合结果表明，当已验证监督数据稀缺时，循环状态初始化是混合语言模型中一种强大的零推理开销参数高效微调（PEFT）界面。调优后的状态文件约48 MB；任务切换无需权重合并或模型重载。代码与库：https://github.com/jackyoung27/s0-tuning。

摘要 (Abstract)

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.

关键词: S0 tuning, parameter-efficient fine-tuning, hybrid recurrent-attention models, zero inference overhead, state matrix optimization, HumanEval, LoRA comparison, cross-domain transfer

114. ❌ CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance

作者: Haochen Liu, Weien Li, Rui Song, Zeyu Li, Chun Jason Xue, Xiao-Yang Liu, Sam Nallaperuma, Xue Liu, Ye Yuan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01113v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在医疗决策中的应用，提出CARE框架解决证据冲突问题。高度相关关键词：LLMs（核心技术）、LLM Agents（提出agentic reasoning框架）、AI for Science（医疗应用场景）、Chain of Thought/System 2 Thinking（涉及多阶段推理）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了在医疗ICU环境中，当患者症状与医学体征存在冲突时，现有LLM系统决策性能下降的问题，提出了一个隐私合规的多阶段智能体推理框架CARE，该框架在MIMIC-DOS数据集上相比基线方法取得了更好的性能表现。

摘要翻译

大型语言模型系统正日益被用于支持高风险决策，但当可用证据存在内部不一致时，其性能通常会下降。这种场景在现实世界的医疗环境中普遍存在，例如患者报告的症状与医学体征相矛盾。为研究此问题，我们提出了MIMIC-DOS数据集，用于重症监护病房环境下短期器官功能障碍恶化的预测。该数据集源自广泛认可的公开电子健康记录数据集MIMIC-IV，并专门构建于存在体征与症状不一致的病例。这一设定对现有基于大型语言模型的方法构成了重大挑战，单次推理的大型语言模型和智能体流程往往难以调和此类矛盾信号。为解决该问题，我们提出了CARE：一个多阶段、符合隐私规范的智能体推理框架。在该框架中，远程大型语言模型通过生成结构化类别和状态转移来提供指导，而无需访问敏感患者数据；同时，本地大型语言模型利用这些类别和状态转移来支持证据获取和最终决策。实证结果表明，与多种基线设置相比，CARE在所有关键指标上均实现了更优性能，表明其能够在保护隐私的同时更稳健地处理矛盾的临床证据。

摘要 (Abstract)

Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.

关键词: Large Language Models, Agentic Reasoning, Healthcare Decision-making, Evidence Discordance, Privacy Compliance, MIMIC-DOS Dataset, ICU Organ Dysfunction, Multi-stage Framework

115. ❌ Narrative Fingerprints: Multi-Scale Author Identification via Novelty Curve Dynamics

作者: Fred Zimmerman, Hilmar AI 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01073v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究作者识别，通过信息论新颖性曲线分析文本特征，属于计算语言学/文本分析领域，未涉及大模型、深度学习技术原理或科学应用，与所有评分关键词无关。

!!! tip deepseek-chat TL;DR

该研究通过分析文本信息论新颖性曲线的动态特征，发现作者在书籍和章节层面具有可识别的“指纹”，能够显著超越随机水平进行作者归属识别。

摘要翻译

我们通过信息论新颖性曲线检验作者是否在其已发表作品中具有独特的“指纹”特征。基于两个语料库——Books3（52,796本书籍，759位符合条件作者）和PG-19（28,439本书籍，1,821位符合条件作者）——的研究发现，作者的声音会在文本新颖性的展开过程中留下可测量的痕迹。该信号具有多尺度特征：在书籍层面，标量动态特征（平均新颖度、变化速度、波动幅度与迂回程度）能以显著高于随机概率的水平识别43%的作者；在章节层面，滑动窗口中的SAX（符号化聚合近似）主题模式实现了30倍于随机概率的归属判定，远超书籍层面主导的标量特征。这些信号具有互补性而非冗余性。研究表明，虽然指纹特征部分与体裁混杂，但约四分之一的作者在相同体裁内仍保持稳定指纹。经典作家（如马克·吐温、简·奥斯汀、吉卜林）的指纹强度与现代作家相当，表明该现象并非当代出版惯例的产物。

摘要 (Abstract)

We test whether authors have characteristic “fingerprints” in the information-theoretic novelty curves of their published works. Working with two corpora – Books3 (52,796 books, 759 qualifying authors) and PG-19 (28,439 books, 1,821 qualifying authors) – we find that authorial voice leaves measurable traces in how novelty unfolds across a text. The signal is multi-scale: at book level, scalar dynamics (mean novelty, speed, volume, circuitousness) identify 43% of authors significantly above chance; at chapter level, SAX motif patterns in sliding windows achieve 30x-above-chance attribution, far exceeding the scalar features that dominate at book level. These signals are complementary, not redundant. We show that the fingerprint is partly confounded with genre but persists within-genre for approximately one-quarter of authors. Classical authors (Twain, Austen, Kipling) show fingerprints comparable in strength to modern authors, suggesting the phenomenon is not an artifact of contemporary publishing conventions.

关键词: author identification, novelty curves, information-theoretic, text analysis, multi-scale, Books3 corpus, PG-19 corpus, SAX motif patterns

116. ❌ Uncertainty-Aware Variational Reward Factorization via Probabilistic Preference Bases for LLM Personalization

作者: Gyuseok Lee, Wonbin Kweon, Zhenrui Yue, SeongKu Kang, Jiawei Han, Dong Wang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00997v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM个性化中的奖励分解问题，属于LLM对齐和RLHF领域。与"Large Language Models"高度相关（10分），因为论文明确研究LLM个性化。与"Instruction Tuning OR Alignment OR Value Alignment"高度相关（10分），因为奖励分解是LLM对齐的核心技术。与"RLHF OR RLAIF OR Direct Preference Optimization OR DPO"相关（8分），因为奖励建模是RLHF的关键组件，论文改进奖励分解方法。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个不确定性感知的变分奖励分解框架（VRF），通过概率偏好基和变分分布来更准确可靠地个性化大语言模型，在多个基准测试中优于现有方法。

摘要翻译

奖励因子分解通过将奖励分解为共享基函数与用户特定权重，实现了对大语言模型（LLM）的个性化适配。然而，现有方法仅依据稀疏数据孤立地将用户权重估计为确定性点值，导致推断结果不准确且不可靠。本文提出变分奖励因子分解（Variational Reward Factorization, VRF），这是一个不确定性感知框架，它将每个用户的偏好表示为共享偏好空间中的变分分布。VRF通过变分编码器推断用户分布，借助与共享概率基之间的Wasserstein距离匹配推导权重，并通过方差衰减损失函数降低不确定性估计的影响。在三个基准测试中，VRF在已见与未见用户、少样本场景以及不同不确定性水平下均优于所有基线方法，其优势进一步延伸至下游对齐任务。

摘要 (Abstract)

Reward factorization personalizes large language models (LLMs) by decomposing rewards into shared basis functions and user-specific weights. Yet, existing methods estimate user weights from scarce data in isolation and as deterministic points, leading to inaccurate and unreliable inference. We introduce Variational Reward Factorization (VRF), an uncertainty-aware framework that represents each user’s preferences as a variational distribution in a shared preference space. VRF infers user distributions via a variational encoder, derives weights through Wasserstein distance matching with shared probabilistic bases, and downweights uncertain estimates through a variance-attenuated loss. On three benchmarks, VRF outperforms all baselines across seen and unseen users, few-shot scenarios, and varying uncertainty levels, with gains extending to downstream alignment.

关键词: Reward Factorization, LLM Personalization, Uncertainty-Aware, Variational Distribution, Probabilistic Preference Bases, Wasserstein Distance, Few-shot Scenarios, Downstream Alignment

117. ❌ Phase transition on a context-sensitive random language model with short range interactions

作者: Yuma Toji, Jun Takahashi, Vwani Roychowdhury, Hideyuki Miyahara 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00947v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是随机语言模型的统计力学性质，特别是相变现象，属于理论物理和计算语言学的交叉研究，而非应用型的大模型或深度学习技术。论文中的’language model’指的是统计力学中的抽象模型，而非现代基于神经网络的LLM。因此，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、应用等）完全无关。仅与’AI for Science’有一定关联，因为该研究属于科学计算/理论物理领域，但并非典型的AI for Science应用（如生物信息学）。

!!! tip deepseek-chat TL;DR

该研究构建了一个具有短程相互作用的上下文敏感随机语言模型，通过数值模拟发现即使在没有长程相互作用的情况下，该模型仍会发生相变，表明语言模型中的相变源于语言本身的内在性质而非长程相互作用。

摘要翻译

自E. DeGioli提出随机语言模型以来[Phys. Rev. Lett. 122, 128301]，语言模型已从统计力学的视角被深入研究。近期，研究通过数值模拟在符号间具有长程相互作用的模型中揭示了Berezinskii–Kosterlitz–Thouless相变的存在。在统计力学中，长程相互作用可诱发相变已是长期认知。因此，语言模型中观测到的相变究竟是否源于传统自旋模型中所缺乏的真正语言特性，这一问题尚未明晰。本研究构建了一个具有短程相互作用的随机语言模型，并数值探究了其统计特性。该模型属于乔姆斯基层级中的上下文敏感文法类别，允许显式引用上下文。我们发现，即使模型仅引用那些长度相对于句子长度保持恒定的上下文，相变依然会发生。这一结果表明，语言模型中的有限温度相变确实是由语言的内在本质所引发，而非由长程相互作用导致。

摘要 (Abstract)

Since the random language model was proposed by E. DeGiuli [Phys. Rev. Lett. 122, 128301], language models have been investigated intensively from the viewpoint of statistical mechanics. Recently, the existence of a Berezinskii–Kosterlitz–Thouless transition was numerically demonstrated in models with long-range interactions between symbols. In statistical mechanics, it has long been known that long-range interactions can induce phase transitions. Therefore, it has remained unclear whether phase transitions observed in language models originate from genuinely linguistic properties that are absent in conventional spin models. In this study, we construct a random language model with short-range interactions and numerically investigate its statistical properties. Our model belongs to the class of context-sensitive grammars in the Chomsky hierarchy and allows explicit reference to contexts. We find that a phase transition occurs even when the model refers only to contexts whose length remains constant with respect to the sentence length. This result indicates that finite-temperature phase transitions in language models are genuinely induced by the intrinsic nature of language, rather than by long-range interactions.

关键词: random language model, phase transition, short-range interactions, context-sensitive grammar, statistical mechanics, Berezinskii-Kosterlitz-Thouless transition, numerical simulation, Chomsky hierarchy

118. ❌ Positional Cognitive Specialization: Where Do LLMs Learn To Comprehend and Speak Your Language?

作者: Luis Frentzen Salim, Lun-Wei Ku, Hsing-Kuo Kenneth Pao 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00923v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs如何学习新语言（语言适应），提出CogSym方法通过仅微调部分层实现高效适应。高度相关关键词：LLMs（核心研究对象）、Post-training/SFT（涉及微调方法）、PEFT/LoRA（与CogSym比较并展示一致性）、Mechanistic Interpretability（通过层消融研究训练动态）。中等相关：Pre-training等（涉及训练过程但非重点）。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在训练过程中如何通过位置认知专业化（感知与生成）学习新语言，并提出了CogSym层启发式方法，仅微调25%的最外层即可达到接近全微调的性能。

摘要翻译

使大型语言模型（LLMs）适应新语言是一个成本高昂且不透明的过程。理解语言模型如何习得新语言及多语言能力是实现高效适应的关键。先前关于多语言可解释性的研究主要关注训练完成的模型如何处理多语言指令，而对其在训练期间习得新语言的机制尚未深入探索。我们通过两种功能性认知专门化视角——语言感知（输入理解）与语言产出（输出生成），研究了仅解码器架构变换器模型在此过程中的训练动态。通过对低资源语言的实验，我们通过从模型输入和输出方向进行层消融扫描，展示了语言模型中不同区域如何形成感知与产出的专门化。基于观察到的专门化模式，我们提出了CogSym——一种分层启发式方法，该方法通过仅微调少数早期和晚期层来实现有效适应。实验表明，仅微调最外层25%的层即可在下游任务上达到与全参数微调基线相差2-3%以内的性能。CogSym与LoRA等适配器方法性能相当，展现出超越全参数微调的泛化能力。这些发现为深入理解LLMs如何学习新语言提供了见解，并推动语言模型向更易获取和更具包容性的方向发展。

摘要 (Abstract)

Adapting large language models (LLMs) to new languages is an expensive and opaque process. Understanding how language models acquire new languages and multilingual abilities is key to achieve efficient adaptation. Prior work on multilingual interpretability research focuses primarily on how trained models process multilingual instructions, leaving unexplored the mechanisms through which they acquire new languages during training. We investigate these training dynamics on decoder-only transformers through the lens of two functional cognitive specializations: language perception (input comprehension) and production (output generation). Through experiments on low-resource languages, we demonstrate how perceptual and productive specialization emerges in different regions of a language model by running layer ablation sweeps from the model’s input and output directions. Based on the observed specialization patterns, we propose CogSym, a layer-wise heuristic that enables effective adaptation by exclusively fine-tuning a few early and late layers. We show that tuning only the 25% outermost layers achieves downstream task performance within 2-3% deviation from the full fine-tuning baseline. CogSym yields consistent performance with adapter methods such as LoRA, showcasing generalization beyond full fine-tuning. These findings provide insights to better understand how LLMs learn new languages and push toward accessible and inclusive language modeling.

关键词: Large Language Models, Language Adaptation, Multilingual Interpretability, Training Dynamics, Layer-wise Fine-tuning, CogSym, Parameter-efficient Fine-tuning, Decoder-only Transformers

119. ❌ GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training

作者: Jesse van Oort, Frank Brinkkemper, Erik de Graaf, Bram Vanroy, Saskia Lensink 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心贡献是构建了一个用于LLM预训练的大型荷兰语语料库GPT-NL Public Corpus，包含36B荷兰语tokens和大量其他语言的tokens，所有数据均采用宽松许可。论文直接涉及LLM预训练的数据准备，因此与’Large Language Models’和’Pre-training’高度相关（10分）。论文提到数据收集和评估旨在创建合法、有用、无害的语言模型，这间接涉及数据质量，因此与’Scaling Laws AND Data Quality’有一定关联（5分）。论文未涉及其他关键词的具体技术或应用，如MoE、SFT、RAG、推理加速、AI for Science等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文构建了最大的宽松许可荷兰语语料库GPT-NL Public Corpus，包含36B荷兰语tokens和大量其他语言数据，旨在为LLM预训练提供合法、有用、无害的数据资源。

摘要翻译

我们推出GPT-NL公共语料库——这是目前许可最开放、规模最大的荷兰语资源集合。该语料库包含21个纯荷兰语数据集，总计包含360亿个经过预处理的荷兰语词汇单元，这些内容未出现于任何其他大语言模型预训练语料库。此外，该语料库还整合了约2070亿英语词汇单元、2320亿代码词汇单元及480亿德语/丹麦语词汇单元，这些数据源自现有数据集并经过合规性筛选。本语料库既收录了来自Common Corpus、Common Crawl等大型现有语料库的精选数据，也包含了全新构建的荷兰语专项数据集。多数新建的荷兰语数据集由与机构合作收集的内容或经合成增强的内容构成。所有数据的收集与评估均以促进开发合法、实用且无害的（商业）语言模型为目标。GPT-NL公共语料库的全部数据均来自采用开放许可的数据集，并经过整理后以CC-BY许可协议重新发布。完整数据集已在Hugging Face Hub平台公开提供。

摘要 (Abstract)

We present the GPT-NL Public Corpus, the biggest permissively licensed corpus of Dutch language resources. The GPT-NL Public Corpus contains 21 Dutch-only collections totalling 36B preprocessed Dutch tokens not present in any other LLM pretraining corpus. Additionally, the corpus includes roughly 207B English, 232B Code, and 48B German/Danish tokens taken from existing sets which we further curated for compliance. This corpus includes curated data from large existing corpora like Common Corpus and Common Crawl, as well as newly created Dutch-specific collections. Most newly created Dutch collections consist of content collected in collaboration with organisations or synthetically augmented content. All data is collected and evaluated with the aim of facilitating the creation of (commercial) language models that are lawful, useful and non-harmful. All data included in the GPT-NL Public Corpus is sourced from datasets with permissive licensing and is curated and redistributed under a CC-BY license. The full dataset is publicly available on the Hugging Face Hub.

关键词: Dutch language corpus, LLM pre-training, permissive licensing, multilingual data, data curation, public dataset, Hugging Face Hub, token preprocessing

120. ❌ Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

作者: Zhuchenyang Liu, Yao Zhang, Yu Xiao 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00913v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究Vision-Language Models (VLMs)在跨描绘装配指令对齐任务中的表现，属于大模型在具体应用领域（视觉-语言理解）的研究。与’Large Language Models’有一定关联（VLMs是LLMs的扩展），与’Alignment’高度相关（研究指令对齐策略），与’Mechanistic Interpretability’高度相关（进行了三级机制分析），与’AI for Science’有一定关联（属于AI应用研究）。其他关键词如MoE、SFT、RAG等均未涉及。

!!! tip deepseek-chat TL;DR

该论文通过构建IKEA-Bench基准并评估19个VLMs，研究了视觉语言模型在跨描绘装配指令对齐任务中的表现，发现视觉编码是提高跨描绘鲁棒性的主要瓶颈，并揭示了文本会驱动模型从视觉推理转向文本推理的机制。

摘要翻译

二维装配示意图通常较为抽象且难以遵循，这催生了对于智能辅助系统的需求，以监控装配进度、检测错误并提供分步指导。在混合现实环境中，此类系统必须能够从摄像头画面中识别已完成和正在进行的步骤，并将其与示意图说明对齐。视觉语言模型在此任务中展现出潜力，但由于装配示意图与视频帧之间共享的视觉特征极少，面临着表征差异的挑战。为系统评估这一差异，我们构建了IKEA-Bench基准测试集，涵盖29款宜家家具产品的6类任务共计1,623个问题，并在三种对齐策略下评估了19个参数量为2B至38B的视觉语言模型。主要发现如下：(1) 通过文本可恢复装配指令理解能力，但文本同时会削弱示意图到视频的对齐效果；(2) 模型架构家族相比参数数量更能预测对齐准确性；(3) 视频理解仍是难以突破的瓶颈，不受策略影响。三级机制分析进一步表明：示意图与视频占据视觉Transformer中互不相交的子空间，而添加文本会使模型从视觉驱动转向文本驱动推理。这些结果指出视觉编码是提升跨表征鲁棒性的主要优化方向。项目页面：https://ryenhails.github.io/IKEA-Bench/

摘要 (Abstract)

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

关键词: Vision-Language Models, Cross-Depiction Alignment, Assembly Instructions, Benchmark Evaluation, Mechanistic Analysis, Visual Encoding, IKEA-Bench, Instruction Understanding

121. ❌ Agentic Tool Use in Large Language Models

作者: Jinchao Hu, Meizhi Zhong, Kehai Chen, Xuefeng Bai, Min Zhang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00835v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Agentic Tool Use in Large Language Models》是一篇关于大语言模型作为自主代理使用工具的综述性论文。它高度相关于三个关键词：1) ‘Large Language Models OR LLMs OR Foundation Models’（10分）- 论文标题和摘要明确聚焦于大语言模型；2) ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）- 论文研究大语言模型作为自主代理的部署；3) ‘Tool Use OR Function Calling OR API Tool Use’（10分）- 论文核心是分析工具使用的方法和范式。其他关键词如MoE、量化、推理加速、科学AI应用等，论文未涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

这篇论文通过综述现有研究，将大语言模型的工具使用方法归纳为三种范式（即插即用提示、监督工具学习和奖励驱动的工具策略学习），分析了它们的方法、优势和失败模式，旨在解决该领域研究的碎片化问题，并提供更结构化的演进视角。

摘要翻译

大型语言模型正日益作为自主智能体被部署，但其在现实世界中的有效性依赖于可靠的信息检索、计算与外部行动工具。现有研究在任务、工具类型和训练设置方面仍处于碎片化状态，缺乏对工具使用方法差异与演进的统一视角。本文将该领域文献归纳为三种范式：即插即用式提示、监督式工具学习与奖励驱动的工具策略学习，分析其方法、优势与失效模式，梳理评估体系并指出关键挑战，旨在整合现有碎片化研究，为智能体工具使用提供一个更具结构性的演进视角。

摘要 (Abstract)

Large language models are increasingly being deployed as autonomous agents yet their real world effectiveness depends on reliable tools for information retrieval, computation and external action. Existing studies remain fragmented across tasks, tool types, and training settings, lacking a unified view of how tool-use methods differ and evolve. This paper organizes the literature into three paradigms: prompting as plug-and-play, supervised tool learning and reward-driven tool policy learning, analyzes their methods, strengths and failure modes, reviews the evaluation landscape and highlights key challenges, aiming to address this fragmentation and provide a more structured evolutionary view of agentic tool use.

关键词: Large Language Models, Autonomous Agents, Tool Use, Agentic Tool Use, Supervised Tool Learning, Reward-driven Tool Policy, Evaluation Landscape, Fragmentation

作者: Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究预训练语言模型（LLMs）在适应为视觉语言模型（VLMs）时出现的语言能力退化问题，并提出了一种基于知识蒸馏的恢复方法。高度相关的关键词包括：1）‘Large Language Models OR LLMs OR Foundation Models’（10分）：论文直接研究预训练语言模型在跨模态适应中的表现；2）‘KV Cache Compression OR Linear Attention OR FlashAttention’（10分）：论文创新性地提出’layer-wise KV-cache sharing’作为实现跨模态蒸馏的关键技术。中等相关的关键词包括：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：涉及预训练模型的跨模态适应；2）‘Post-training OR Supervised Fine-tuning OR SFT’（5分）：涉及任务特定的微调；3）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（5分）：论文提出的LinguDistill方法属于参数高效的适配方法。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文解决了预训练语言模型在适应为视觉语言模型时出现的语言能力退化问题，通过提出一种基于层间KV缓存共享的选择性跨模态蒸馏方法LinguDistill，在不增加额外模块的情况下恢复了约10%的语言性能损失，同时保持了多模态任务的视觉基础能力。

摘要翻译

将预训练语言模型（LMs）适配为视觉语言模型（VLMs）时，由于多模态适应过程中引入的表征偏移和跨模态干扰，可能会削弱其原有的语言能力。这种能力损失难以恢复，即使使用标准目标进行针对性的任务特定微调也是如此。先前的恢复方法通常引入额外的模块作为中间对齐层，以维持或隔离模态特定的子空间，但这会增加架构复杂性、在推理时引入额外参数，并限制了跨模型和设置的灵活性。我们提出LinguDistill，一种无需适配器的蒸馏方法，通过利用原始冻结的LM作为教师来恢复语言能力。我们通过引入分层KV缓存共享（layer-wise KV-cache sharing）克服了实现视觉条件化教师监督的关键挑战，该方法在不修改任一模型架构的前提下，使教师模型能够接触到学生的多模态表征。随后，我们在语言密集型数据上有选择地蒸馏教师模型强大的语言信号以恢复语言能力，同时保留学生在多模态任务上的视觉基础。实验结果表明，LinguDistill在语言和知识基准上恢复了约10%的性能损失，同时在视觉密集型任务上保持了相当的性能。我们的研究证明，无需额外模块即可恢复语言能力，这为多模态模型中模态特定性能退化问题提供了一种高效且实用的解决方案。

摘要 (Abstract)

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student’s multimodal representations without modifying the architecture of either model. We then selectively distill the teacher’s strong linguistic signal on language-intensive data to recover language capability, while preserving the student’s visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

关键词: Vision-Language Models, Linguistic Capability Recovery, Cross-Modal Distillation, KV-Cache Sharing, Multimodal Adaptation, Parameter-Efficient Fine-tuning, Knowledge Distillation, Representation Shift

123. ❌ Multimodal Language Models Cannot Spot Spatial Inconsistencies

作者: Om Khangaonkar, Hadi J. Rad, Hamed Pirsiavash 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00799v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）在空间一致性推理方面的局限性，属于大模型在视觉理解领域的应用研究。核心相关关键词是’Large Language Models’（10分），因为论文明确研究MLLMs。其他相关关键词包括：‘Chain of Thought’和’System 2 Thinking’（各5分），涉及多步推理和深度推理能力；‘Hallucination Mitigation’（5分），与模型事实性和真实性相关；‘Mechanistic Interpretability’（5分），涉及模型可解释性分析；‘World Models’（5分），与物理世界理解相关。其余关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，多模态大语言模型在识别多视角场景中的3D运动不一致性任务上表现显著低于人类，揭示了模型对物理世界空间结构的理解存在脆弱性和不完整性。

摘要翻译

空间一致性是视觉世界的基本属性，也是旨在理解物理现实的模型的关键要求。尽管近期取得了进展，多模态大语言模型（MLLMs）在处理跨多视角的3D几何推理时仍常面临困难。我们并未要求模型描述场景属性，而是引入了一项更具挑战性的任务：给定同一场景的两个视角，识别违反3D运动一致性的物体。我们提出了一种简单且可扩展的方法，用于从多视角场景生成逼真的、空间不一致的图像对，从而能够系统评估此能力。我们的研究结果表明，当前最先进的多模态大语言模型的表现显著低于人类观察者，且在不同场景属性间存在巨大差异，揭示出其对3D结构的理解是脆弱且不完整的。我们希望我们的发现能强调，需要发展对物理世界更具根基性理解的方法。

摘要 (Abstract)

Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.

关键词: multimodal large language models, spatial consistency, 3D geometry, motion consistency, visual reasoning, model evaluation, physical world understanding, multi-view scenes

124. ❌ Valency Classification of Mapudungun Verbal Roots. Established by the language’s own morphotactics

作者: Andrés Chandía 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00789v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究Mapudungun语言的动词配价分类，属于语言学领域，完全基于语言形态学分析，不涉及任何大模型、深度学习、AI技术或相关技术原理。所有关键词均与大模型技术、AI应用或相关方法论相关，而本文是纯粹的语言学分析研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文基于Mapudungun语言自身的形态结构，对已确认为动词的词根进行配价分类研究，旨在改进形态分析器并增进对Mapuche动词形式配价问题的理论理解。

摘要翻译

在先前的研究中，我们对被识别为动词的词根进行了词汇（再）分类——或对已有分类进行验证——以准确判定其原始范畴。在此基础上，本文基于马普切语自身的形态配列规则，对那些已确认为动词的词根进行了配价分类研究；具体而言，通过考察马普切语动词形式中各类后缀与词根或动词词干之间允许及受限的组合模式来实现。与迄今为止的所有研究一致，本文呈现的结果旨在完善形态分析器（Dungupeyum），并将所有已验证的发现整合至系统中。从理论视角出发，我们也期望能促进对马普切语动词形式相关配价问题的识别与理解。

摘要 (Abstract)

In the previous work, a lexical (re)categorisation – or confirmation of the given category – of roots identified as verbal was undertaken to determine their original category accurately. Building on this, the present paper offers an account of the valency classification of those Mapudungun roots confirmed to be verbal, using the language’s own morphotactics; specifically, by examining the permissible and restricted combinations of various suffixes with roots or verbal stems in the Mapuche verb form. As with all work conducted thus far, the results presented here aim to improve the morphological analyser (Dungupeyum) with all verified findings incorporated into the system. From a theoretical perspective, we also hope to contribute to the recognition and understanding of issues related to the valency of Mapuche verb forms.

关键词: Mapudungun, verbal roots, valency classification, morphotactics, morphological analyser, Mapuche verb forms, suffix combinations, Dungupeyum

125. ❌ From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks

作者: Ayan Datta, Mounika Marreddy, Alexander Mehler, Zhixue Zhao, Radhika Mamidi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00778v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在字符计数等符号推理任务上的失败机制，通过机制分析（如激活修补、注意力头追踪）揭示内部计算图中的结构化干扰。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文使用可解释性方法分析模型内部机制。其他关键词如MoE、SFT、RAG、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文研究大语言模型在字符计数等简单符号任务上失败的原因，发现模型内部能正确编码字符信息，但在后期层中被负电路抑制，导致输出错误，揭示了结构化干扰是错误根源。

摘要翻译

大型语言模型（LLM）在复杂基准测试中表现出色，却在字符计数等基础符号任务上存在失误。尽管这一局限性已被注意到，但其内部原因尚不明确。我们以字符计数（例如“apple中有多少个p？”）作为最小化、受控的探针，将词元级推理与更高层次的混杂因素分离。通过这一设定，我们在包括LLaMA、Qwen和Gemma在内的现代架构中发现了一致现象：模型常在内部计算出正确答案，却未能在输出层将其表达出来。
通过结合探针分类器、激活修补、逻辑透镜分析和注意力头追踪的机制分析，我们表明字符级信息编码于模型的中早期层表示中。然而，这些信息在后续层（尤其是倒数第二层和最终层的多层感知机）中被一小部分组件削弱。我们将这些组件识别为负向电路：这些子网络会降低正确信号的权重，转而偏向概率更高但错误的输出。
我们的研究得出两点贡献。首先，我们证明LLM的符号推理失败并非源于表征缺失或规模不足，而是由模型计算图内部的结构性干扰所导致。这解释了为何此类错误持续存在，并可能在规模扩展和指令微调下加剧。其次，我们提供证据表明LLM的前向传播实现了一种竞争性解码机制：正确与错误的假设共存并动态重新加权，最终输出由抑制机制与放大机制共同决定。
这些发现对可解释性和鲁棒性具有启示意义：简单的符号推理暴露了现代LLM的弱点，强调了需要设计能确保信息被可靠编码和使用的策略。

摘要 (Abstract)

Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., “How many p’s are in apple?”) as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer. Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs. Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model’s computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification. These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used.

关键词: Large Language Models, LLMs, character counting, mechanistic interpretability, activation patching, negative circuits, symbolic reasoning, competitive decoding

126. ❌ From Baselines to Preferences: A Comparative Study of LoRA/QLoRA and Preference Optimization for Mental Health Text Classification

作者: Mihael Arcan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00773v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LoRA/QLoRA参数高效微调（PEFT）和偏好优化（DPO等）在心理健康文本分类中的应用，因此与’PEFT/LoRA’和’RLHF/DPO’关键词高度相关（10-15分）。论文涉及大模型应用（5分）、上下文窗口（5分）、量化（QLoRA涉及5分）和科学AI应用（5分）。其他关键词如MoE、Scaling Laws、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文系统比较了LoRA/QLoRA参数高效微调和DPO等偏好优化方法在心理健康文本分类任务中的效果，发现优化效果高度依赖于具体方法配置，并提出了从基线到选择性使用偏好优化的实用框架。

摘要翻译

心理健康文本分类已迅速采用现代适应方法，但关于何时、为何以及选择何种优化策略的实践指导仍然有限。本文针对一项联合心理健康分类任务，开展了一项系统性的优化路径比较研究，从强力的基础基线模型逐步推进至更专业化的技术。我们首先建立经典模型与编码器基准，随后在多种目标与优化设置下检验基于LoRA/QLoRA的参数高效监督微调，最后评估基于偏好的优化方法（包括DPO、ORPO和KTO），并涵盖类别再平衡训练。研究重点并非强调单一的最高分数，而是聚焦于方法学洞察：性能如何随目标构建、适配器选择、优化器行为、上下文窗口设计及类别平衡干预而变化。结果表明，优化效果高度依赖于具体方法：某些方法能带来稳定且可迁移的性能提升，而另一些则对配置和数据平衡敏感。偏好优化方法在不同目标间尤其表现出巨大差异，这表明方法选择比单纯增加偏好训练阶段更为关键。本研究的核心贡献在于为心理健康自然语言处理领域梳理出一条清晰的优化路径：从透明基线出发，实施受控微调，并在其增益可明确验证的情况下有选择地应用偏好优化。这为超越单纯架构选择、构建有效训练策略提供了一个可复现且立足实践的框架。

摘要 (Abstract)

Mental health text classification has rapidly adopted modern adaptation methods, yet practical guidance on which optimization strategy to use, when, and why remains limited. This paper presents a systematic comparative study of optimization pathways for a joint mental-health classification task, moving from strong vanilla baselines to progressively more specialized techniques. We first establish classical and encoder references, then examine parameter-efficient supervised fine-tuning with LoRA/QLoRA under multiple objective and optimization settings, and finally evaluate preference-based optimization with DPO, ORPO, and KTO, including class-rebalanced training. Rather than emphasizing a single headline score, we focus on methodological insight: how performance changes with objective formulation, adapter choice, optimizer behavior, context windowing, and class-balance intervention. The results show that optimization effects are highly method-dependent: some approaches deliver stable, transferable gains, while others are sensitive to configuration and data balance. Preference optimization, in particular, exhibits large variation across objectives, indicating that method selection is more consequential than simply adding a preference-training stage. The central contribution is a clear optimization narrative for mental health NLP: start from transparent baselines, apply controlled tuning, and use preference optimization selectively where its gains are demonstrable. This provides a reproducible and practically grounded framework for choosing effective training strategies beyond architecture choice alone.

关键词: LoRA, QLoRA, DPO, preference optimization, mental health text classification, parameter-efficient fine-tuning, supervised fine-tuning, optimization strategies

127. ❌ Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

作者: Zehao Jin, Yanan Sui 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00754v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种受果蝇全脑连接组启发的随机注意力机制（Stochastic Attention），作为滑动窗口注意力（SWA）的即插即用增强方法。核心创新在于通过随机排列token序列来扩展高效注意力的表达能力，属于大模型技术原理的创新。与以下关键词高度相关：1）‘Large Language Models’（论文在Qwen3-8B/30B上验证）；2）‘Pre-training’（从头预训练语言模型）；3）‘KV Cache Compression OR Linear Attention OR FlashAttention’（属于高效注意力机制范畴）。与’Mixture of Experts’、‘Context Window Extension’、‘Speculative Decoding’、‘AI for Science’有一定关联（涉及模型架构、推理效率、生物启发）。其他关键词如对齐、微调、代理等未涉及。

!!! tip deepseek-chat TL;DR

该论文受果蝇大脑连接组启发，提出了一种随机注意力机制（Stochastic Attention），通过随机排列token序列来增强滑动窗口注意力的全局表达能力，在预训练和推理中实现了比基线更好的性能，同时保持了线性计算复杂度。

摘要翻译

果蝇全脑连接组包含超过13万个神经元，其连接概率仅为0.02%，却实现了平均仅4.4跳的最短路径。尽管该网络在回路层面高度结构化，其长程连接却广泛分布于各脑区，起到随机捷径的作用，从而实现高效的全局通信。受此启发，我们提出随机注意力（Stochastic Attention，SA），这是一种对滑动窗口注意力（Sliding-Window Attention，SWA）的直接增强方法：它在窗口注意力计算前对令牌序列施加随机排列，并在计算后恢复原始顺序。这一操作在保持每层$O(nw)$计算复杂度不变的前提下，将固定的局部窗口转化为随机的全局窗口。随着网络深度增加，独立采样的排列会产生指数级增长的感受野，仅需$O(\log_w n)$层即可实现全序列覆盖，而SWA需要$O(n/w)$层。我们在两种场景中验证了SA的有效性：一是在从头预训练语言模型中，门控SA与SWA的组合取得了最佳的平均零样本准确率；二是在Qwen3-8B和Qwen3-30B-A3B模型上进行免训练推理，SA在相近的计算预算下持续优于SWA，并达到或超越了混合块注意力（Mixture of Block Attention）的性能。这些结果表明，受连接组启发的随机路由是一种实用的基础模块，能够提升高效注意力的表达能力，并与现有的线性和稀疏注意力方法形成互补。

摘要 (Abstract)

The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network’s long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.

关键词: Stochastic Attention, sliding-window attention, connectome-inspired, efficient attention, random permutation, receptive field, linear-time attention, language models

128. ❌ LangMARL: Natural Language Multi-Agent Reinforcement Learning

作者: Huaiyuan Yao, Longchao Da, Xiaoou Liu, Charles Fleming, Tianlong Chen, Hua Wei 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00722v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在多智能体环境中的协调策略演化问题，与’Large Language Models’和’LLM Agents’高度相关（10分），因为直接研究LLM智能体在多智能体任务中的表现。与’Multi-agent Systems’高度相关（10分），因为专注于多智能体协调和信用分配问题。与’Mechanistic Interpretability’有一定关联（5分），因为论文提到提高可解释性，但这不是核心技术创新。其他关键词如MoE、SFT、RAG等均未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在动态多智能体环境中难以自主演化协调策略的问题，提出了LangMARL框架，通过引入信用分配和策略梯度演化机制，显著提高了样本效率、可解释性和泛化能力。

摘要翻译

大语言模型（LLM）智能体在动态环境中难以自主演化协作策略，其主要原因在于粗粒度的全局结果掩盖了局部策略优化所需的因果信号。我们将此瓶颈识别为一个多智能体信用分配问题，该问题在经典多智能体强化学习（MARL）中已有长期研究，但在基于LLM的系统中仍未得到充分解决。基于此观察，我们提出了LangMARL框架，该框架将合作式MARL中的信用分配与策略梯度演化机制引入语言空间。LangMARL实现了智能体层面的语言信用分配，开创了在语言空间中进行策略改进的梯度演化方法，并从回放轨迹中总结任务相关的因果关系，以提供密集反馈并改善稀疏奖励下的收敛性。在多种合作式多智能体任务上的大量实验表明，该方法提升了样本效率、可解释性，并展现出强大的泛化能力。

摘要 (Abstract)

Large language model (LLM) agents struggle to autonomously evolve coordination strategies in dynamic environments, largely because coarse global outcomes obscure the causal signals needed for local policy refinement. We identify this bottleneck as a multi-agent credit assignment problem, which has long been studied in classical multi-agent reinforcement learning (MARL) but remains underaddressed in LLM-based systems. Building on this observation, we propose LangMARL, a framework that brings credit assignment and policy gradient evolution from cooperative MARL into the language space. LangMARL introduces agent-level language credit assignment, pioneers gradient evolution in language space for policy improvement, and summarizes task-relevant causal relations from replayed trajectories to provide dense feedback and improve convergence under sparse rewards. Extensive experiments across diverse cooperative multi-agent tasks demonstrate improved sample efficiency, interpretability, and strong generalization.

关键词: Large Language Models, Multi-Agent Reinforcement Learning, Credit Assignment, Policy Gradient, Cooperative Tasks, Sample Efficiency, Interpretability, Generalization

129. ❌ AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages

作者: Israel Abebe Azime, Jesujoba Oluwadara Alabi, Crystina Zhang, Iffat Maab, Atnafu Lambebo Tonja, Tadesse Destaw Belay, Folasade Peace Alabi, Salomey Osei, Saminu Mohammad Aliyu, Nkechinyere Faith Aguobi, Bontu Fufa Balcha, Blessing Kudzaishe Sibanda, Davis David, Mouhamadane Mboup, Daud Abolade, Neo Putini, Philipp Slusallek, David Ifeoluwa Adelani, Dietrich Klakow 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00706v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究非洲语言的事实核查，涉及信息检索、证据提取和事实核查三个步骤。与LLMs高度相关（8分），因为论文评估了LLMs在非洲语言事实核查中的表现，并进行了微调实验。与RAG高度相关（8分），因为论文涉及信息检索和证据提取，这是RAG的核心组成部分。与事实核查高度相关（10分），因为这是论文的核心研究问题。与监督微调有一定关联（5分），因为论文提到任务特定微调提高了准确性。与上下文学习有一定关联（5分），因为论文使用了少样本提示。其他关键词如MoE、量化、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对非洲语言的事实核查问题，构建了AfrIFact数据集，评估了LLMs在跨语言检索和事实核查中的表现，发现少样本提示和任务特定微调能显著提升性能。

摘要翻译

评估在线言论的真实性是一项复杂且具有现实意义的重要任务。当这些言论针对信息获取渠道有限的社群，且内容涉及医疗保健和文化等议题时，其影响尤为显著，特别是在资源匮乏的语言环境中。本研究推出了AfrIFact数据集，该数据集涵盖自动事实核查的必要步骤（即信息检索、证据提取与事实核查），涵盖十种非洲语言及英语。评估结果表明，即使在大型语料库或单一文档中，最佳嵌入模型仍缺乏跨语言检索能力，且文化类与新闻类文档比医疗领域文档更易被检索。我们发现大型语言模型在非洲语言中缺乏稳健的多语言事实核查能力，而少量示例提示（few-shot prompting）可将AfriqueQwen-14B模型的性能提升高达43%，针对特定任务的微调（fine-tuning）则能进一步将事实核查准确率提升达26%。这些发现连同我们发布的AfrIFact数据集，将推动资源匮乏环境下的信息检索、证据检索与事实核查相关研究。

摘要 (Abstract)

Assessing the veracity of a claim made online is a complex and important task with real-world implications. When these claims are directed at communities with limited access to information and the content concerns issues such as healthcare and culture, the consequences intensify, especially in low-resource languages. In this work, we introduce AfrIFact, a dataset that covers the necessary steps for automatic fact-checking (i.e., information retrieval, evidence extraction, and fact checking), in ten African languages and English. Our evaluation results show that even the best embedding models lack cross-lingual retrieval capabilities, and that cultural and news documents are easier to retrieve than healthcare-domain documents, both in large corpora and in single documents. We show that LLMs lack robust multilingual fact-verification capabilities in African languages, while few-shot prompting improves performance by up to 43% in AfriqueQwen-14B, and task-specific fine-tuning further improves fact-checking accuracy by up to 26%. These findings, along with our release of the AfrIFact dataset, encourage work on low-resource information retrieval, evidence retrieval, and fact checking.

关键词: fact-checking, African languages, information retrieval, evidence extraction, low-resource languages, LLMs, few-shot prompting, fine-tuning

130. ❌ Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

作者: Zeyad Ahmed, Paul Sheridan, Michael McIsaac, Aitazaz A. Farooque 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00672v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究TF-IDF变体在词突发性检验统计量中的统计基础，属于传统自然语言处理和信息检索领域，与所有评分关键词（均聚焦大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文从统计假设检验角度解释了TF-IDF变体如何作为词突发性惩罚似然比检验的统计量组成部分，并证明其性能与TF-IDF在文档分类任务中相当。

摘要翻译

TF-IDF是一种广泛应用于识别文档关键术语的经典公式。我们证明，类TF-IDF评分可自然源自捕获词突发性（亦称词过离散）的惩罚似然比检验统计量框架。在本研究中，备择假设通过采用带伽马精度参数惩罚项的贝塔-二项分布族对文档集建模，从而捕捉词突发性特征；与之相对，原假设假定词语在文档集中服从二项分布，这种建模方法无法解释词突发性现象。我们发现，基于该检验统计量衍生的术语加权方案在文档分类任务中表现出与TF-IDF相当的性能。本文从统计学视角为TF-IDF提供了新的理论阐释，并强调了假设检验框架在推进术语加权方案发展方面的潜力。

摘要 (Abstract)

TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

关键词: TF-IDF, word burstiness, penalized likelihood-ratio test, beta-binomial distribution, term-weighting scheme, document classification, statistical hypothesis testing

131. ❌ OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

作者: Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, Daniel Povey 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00688v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出OmniVoice，一个基于扩散语言模型架构的大规模多语言零样本文本到语音（TTS）模型，覆盖600多种语言。核心创新包括：1）新颖的扩散语言模型式离散非自回归（NAR）架构，直接映射文本到多码本声学标记，简化传统两阶段流程；2）全码本随机掩码策略以高效训练；3）从预训练LLM初始化以确保高可懂度。因此，与’Large Language Models’高度相关（8分），因为模型初始化自预训练LLM；与’Pre-training’相关（8分），因为涉及从LLM初始化及大规模数据集预训练；与’Scaling Laws AND Data Quality’有一定关联（5分），因使用581k小时多语言数据集，涉及数据规模和质量；其他关键词如MoE、SFT、RLHF、RAG等与TTS架构和训练方法无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出OmniVoice，一种基于扩散语言模型架构的大规模多语言零样本文本到语音模型，通过创新架构和从预训练LLM初始化，在覆盖600多种语言的数据集上实现了最广泛的语音合成覆盖和先进性能。

摘要翻译

我们推出OmniVoice，这是一个大规模多语言零样本文本转语音（TTS）模型，可扩展至超过600种语言。其核心是一种新颖的扩散语言模型风格离散非自回归（NAR）架构。与传统的离散NAR模型在复杂的两阶段（文本到语义到声学）流程中存在性能瓶颈不同，OmniVoice直接将文本映射到多码本声学标记。这种简化方法得益于两项关键技术创新：（1）用于高效训练的全码本随机掩码策略，以及（2）从预训练大语言模型（LLM）初始化以确保卓越的清晰度。通过利用完全从开源数据整理的581千小时多语言数据集，OmniVoice实现了迄今为止最广泛的语言覆盖范围，并在中文、英语及多样化的多语言基准测试中取得了最先进的性能。我们的代码和预训练模型已在https://github.com/k2-fsa/OmniVoice 公开提供。

摘要 (Abstract)

We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly available at https://github.com/k2-fsa/OmniVoice.

关键词: multilingual text-to-speech, diffusion language model, zero-shot TTS, non-autoregressive architecture, acoustic tokens, pre-trained LLM initialization, large-scale dataset

132. ❌ English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization

作者: Mohammad Mohammadamini, Daban Q. Jaff, Josep Crego, Marie Tahon, Antoine Laurent 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究英语到中库尔德语的语音翻译，包括语料库创建、评估和正字法标准化。论文涉及语音识别（ASR）、机器翻译（MT）和语音到文本翻译（S2TT）任务，使用了Transformer模型和Seamless模型进行微调。因此，仅与"Post-training OR Supervised Fine-tuning OR SFT"有一定关联（5分），因为论文提到了对Seamless模型进行微调。其他关键词均与论文内容无关（0分），因为论文未涉及大模型技术原理、推理、对齐、压缩、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文创建了英语到中库尔德语的语音翻译语料库KUTED，并通过正字法标准化方法显著提升了翻译性能，在FLEURS基准上将Seamless基线提高了3.0 BLEU。

摘要翻译

我们推出KUTED——一个针对中库尔德语的语音到文本翻译数据集，该数据集源自TED及TEDx演讲。该语料库包含9.1万句对，涵盖170小时的英语音频、165万英语词元及140万中库尔德语词元。我们在语音到文本翻译任务上对KUTED进行评估，发现正字法变异会显著降低库尔德语翻译质量，导致非标准输出。为解决此问题，我们提出一种系统性文本标准化方法，该方法显著提升了翻译性能并产生更一致的译文。在从TED演讲分离的测试集上，经微调的Seamless模型达到15.18 BLEU值，并在FLEURS基准测试中将Seamless基线提升了3.0 BLEU。我们还从头训练了一个Transformer模型，并评估了结合Seamless（自动语音识别）与NLLB（机器翻译）的级联系统。

摘要 (Abstract)

We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).

关键词: speech-to-text translation, Central Kurdish, corpus creation, orthographic standardization, Seamless model, Transformer model, BLEU score, FLEURS benchmark

133. ❌ Speech LLMs are Contextual Reasoning Transcribers

作者: Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00610v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出CoT-ASR方法，将链式思维（Chain of Thought）推理应用于语音识别，核心创新在于让LLMs先分析语音生成上下文，再进行转录。因此与’Large Language Models’和’Chain of Thought’高度相关（10分），与’System 2 Thinking’有一定关联（5分），因为涉及深度推理过程。其他关键词如MoE、SFT、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出链式思维ASR（CoT-ASR）方法，通过让大语言模型先分析语音上下文再进行转录，解决了LLMs在语音识别中知识利用不足的问题，实验显示相比标准LLM-based ASR，词错误率相对降低8.7%，实体错误率降低16.9%。

摘要翻译

尽管已扩展至语音输入，在自动语音识别任务中有效利用大语言模型丰富的知识与上下文理解能力仍非易事，因为该任务主要涉及直接的语音到文本映射。为此，本文提出思维链语音识别方法，通过构建推理链使大语言模型能够先分析输入语音并生成上下文分析，从而充分发挥其生成能力。借助这种上下文推理，该方法随后进行更具信息量的语音识别，并在单次过程中同步完成推理与转写。此外，该方法天然支持用户引导的转录：在自主生成推理的设计基础上，还能无缝整合用户提供的上下文信息来指导转录，进一步扩展了自动语音识别的功能。为减少模态差异，本文引入基于CTC的模态适配器，利用CTC非空白标记概率对大语言模型嵌入向量进行加权，从而高效地将语音编码器输出与大语言模型的文本潜在空间对齐。实验表明，相较于基于大语言模型的标准语音识别系统，思维链语音识别方法在词错误率上实现了8.7%的相对降低，在实体错误率上实现了16.9%的相对降低。

摘要 (Abstract)

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM’s textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

关键词: Chain-of-Thought ASR, Large Language Models, Automatic Speech Recognition, Contextual Reasoning, Modality Adapter, Word Error Rate, Entity Error Rate, Speech-to-Text

134. ❌ More Human, More Efficient: Aligning Annotations with Quantized SLMs

作者: Jiayu Wang, Junyoung Lee 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00586v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究小型语言模型（SLM）的量化微调用于文本标注任务，与’Small Language Models’（10分）、‘Quantization’（10分）、‘Post-training/SFT’（10分）和’Alignment’（10分）高度相关。论文提到LLM作为背景和对比，与’Large Language Models’有一定关联（8分）。论文涉及数据质量和事实性问题，与’Scaling Laws AND Data Quality’（5分）和’Hallucination Mitigation’（5分）有中等关联。其他关键词如MoE、RAG、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过量化微调小型语言模型（SLM）来替代专有大型语言模型进行文本标注和评估，实现了更高的人类对齐性和可复现性。

摘要翻译

随着大型语言模型（LLM）能力的进步，对指数级增长文本语料进行高质量标注的需求已超出人类处理能力，导致LLM在自动评估与标注中被广泛采用。然而，专有LLM常表现出系统性偏差，偏离人类专家共识，缺乏可复现性，并引发数据隐私担忧。本研究探讨了在有限人工标注数据上微调一个1.7B参数的量化小型语言模型，使其成为高度对齐、确定性的评估与标注工具的可行性。通过实施自定义的多维评估框架及简单的数据增强与正则化技术，所提方法在标注者间一致性指标（克里彭多夫α系数提升0.23分）上优于性能最优的现有专有LLM。我们还在独立的情感分类任务上验证了该训练流程的泛化能力。结果表明，针对特定任务的对齐与高效的4位量化微调技术，为评估与标注任务提供了优于专有模型的开源替代方案。我们的微调方法已公开于https://github.com/jylee-k/slm-judge。

摘要 (Abstract)

As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and annotation. However, proprietary LLMs often exhibit systematic biases that diverge from human expert consensus, lacks reproducibility, and raises data privacy concerns. Our work examines the viability of finetuning a quantized Small Language Model of 1.7B parameter size on limited human-annotated data to serve as a highly aligned, deterministic evaluator and annotator. By implementing a custom, multi-dimensional rubric framework and simple augmentation and regularization techniques, the proposed approach achieves higher inter-annotator agreement (0.23 points increase in Krippendorff’s $α$) than the best performing state-of-the-art proprietary LLM. We also demonstrate the generalizability of the proposed training pipeline on a separate emotion classification task. The results show that task-specific alignment and efficient 4-bit quantized fine-tuning provide superior open-source alternative to using proprietary models for evaluation and annotation. Our finetuning approach is publicly available at https://github.com/jylee-k/slm-judge.

关键词: Small Language Models, Quantization, Fine-tuning, Alignment, Annotation, Evaluation, 4-bit quantization, Human-annotated data

作者: Taihei Shiotani, Masahiro Kaneko, Naoaki Okazaki 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00568v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的社会偏见评估，特别是推理过程中的偏见检测，与’Large Language Models’和’Chain of Thought’高度相关（10分）。论文涉及公平性评估，与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为对齐包括减少偏见。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文的评估基准研究无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该研究构建了一个基于归因理论的日本文化特定社会偏见评估基准JUBAKU-v2，用于检测LLMs在推理过程中的群体归因偏见，实验表明该基准比现有方法更敏感地检测模型性能差异。

摘要翻译

在提升大型语言模型（LLM）的公平性时，评估植根于特定语言区域文化背景的社会偏见至关重要。然而，现有的大多数日语基准测试严重依赖英语数据的翻译，这未必能提供适合日本文化的评估。此外，这些基准仅评估结论中的偏见，未能捕捉推理过程中潜藏的偏见。在本研究中，基于社会心理学中的归因理论，我们构建了一个新的数据集“JUBAKU-v2”，该数据集在固定结论的同时，评估了推理中将行为归因于内群体与外群体时存在的偏见。该数据集包含216个反映日本特有文化偏见的实例。实验结果证实，与现有基准相比，该数据集能更敏感地检测出不同模型之间的性能差异。

摘要 (Abstract)

In enhancing the fairness of Large Language Models (LLMs), evaluating social biases rooted in the cultural contexts of specific linguistic regions is essential. However, most existing Japanese benchmarks heavily rely on translating English data, which does not necessarily provide an evaluation suitable for Japanese culture. Furthermore, they only evaluate bias in the conclusion, failing to capture biases lurking in the reasoning. In this study, based on attribution theory in social psychology, we constructed a new dataset, ``JUBAKU-v2,’’ which evaluates the bias in attributing behaviors to in-groups and out-groups within reasoning while fixing the conclusion. This dataset consists of 216 examples reflecting cultural biases specific to Japan. Experimental results verified that it can detect performance differences across models more sensitively than existing benchmarks.

关键词: Large Language Models, social bias evaluation, reasoning bias, attribution theory, Japanese cultural context, benchmark dataset, fairness, JUBAKU-v2

136. ❌ MF-QAT: Multi-Format Quantization-Aware Training for Elastic Inference

作者: Zifei Xu, Sayeh Sharify, Hesham Mostafa 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00529v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于量化感知训练（QAT）技术，特别是多格式QAT，以实现弹性推理。核心贡献在于提出了一种训练方法，使单个模型能在多种量化格式下保持性能，并设计了Slice-and-Scale转换流程，支持运行时动态调整精度。论文与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为量化是核心主题；与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为弹性推理涉及加速技术；其他关键词（如LLMs、MoE、Alignment等）未在论文中提及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了多格式量化感知训练（MF-QAT），提出了一种训练方法使单个模型能适应多种量化格式，并设计了Slice-and-Scale转换流程，实现了在运行时动态调整精度而无需重新训练，从而支持弹性推理部署。

摘要翻译

量化感知训练通常针对单一目标数值格式进行，而实际部署往往需要根据硬件支持或运行时约束在推理时选择数值精度。本研究探讨多格式量化感知训练，即训练单一模型使其对多种量化格式均保持鲁棒性。研究发现，多格式量化感知训练在各目标精度上均可匹配单格式训练效果，从而获得一个在不同格式间整体表现优异的模型，甚至能泛化至训练时未见的格式。为实现实际部署，我们针对MXINT与MXFP格式提出切片缩放转换方法，该方法可将高精度表示转换为低精度格式而无需重新训练。在此基础上，我们构建了一套完整流程：（1）通过多格式量化感知训练模型，（2）存储单一锚点格式检查点（MXINT8/MXFP8），（3）支持在运行时动态转换为更低精度的MXINT或MXFP格式，且精度损失可忽略甚至为零。这些组件共同为弹性精度缩放提供了实用路径，使得能够在多样化部署目标中根据推理需求灵活选择运行时格式。

摘要 (Abstract)

Quantization-aware training (QAT) is typically performed for a single target numeric format, while practical deployments often need to choose numerical precision at inference time based on hardware support or runtime constraints. We study multi-format QAT, where a single model is trained to be robust across multiple quantization formats. We find that multi-format QAT can match single-format QAT at each target precision, yielding one model that performs well overall across different formats, even formats that were not seen during training. To enable practical deployment, we propose the Slice-and-Scale conversion procedure for both MXINT and MXFP that converts a high-precision representation into lower-precision formats without re-training. Building on this, we introduce a pipeline that (i) trains a model with multi-format QAT, (ii) stores a single anchor format checkpoint (MXINT8/MXFP8), and (iii) allows on-the-fly conversion to lower MXINT or MXFP formats at runtime with negligible-or no-additional accuracy degradation. Together, these components provide a practical path to elastic precision scaling and allow selecting the runtime format at inference time across diverse deployment targets.

关键词: Quantization-aware training, Multi-format QAT, Elastic inference, Model compression, MXINT, MXFP, Slice-and-Scale conversion, Runtime format selection

137. ❌ TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

作者: Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao Liu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00438v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出TR-ICRL框架，核心是In-Context Reinforcement Learning (ICRL)，直接涉及LLMs和In-context Learning，因此这两个关键词高度相关（10分）。框架通过检索、候选生成、多数投票伪标签、奖励反馈和迭代优化实现自我改进，与Self-Correction/Self-Improvement和LLM Agents相关（8分）。检索步骤与RAG有一定关联（5分），迭代优化涉及推理过程，与Chain of Thought和System 2 Thinking相关（5分）。在MedQA和AIME2024上的评估涉及科学领域应用，与AI for Science相关（5分）。其他关键词如MoE、Scaling Laws、Pre-training等未在论文中提及或直接相关，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对In-Context Reinforcement Learning中奖励估计的挑战，提出了TR-ICRL框架，通过检索、伪标签生成和迭代优化，显著提升了LLM在推理和知识密集型任务上的性能，如在MedQA上平均提升21.23%。

摘要翻译

情境强化学习（In-Context Reinforcement Learning, ICRL）使得大语言模型（Large Language Models, LLMs）能够在上下文窗口内直接根据外部奖励进行在线学习。然而，ICRL的一个核心挑战在于奖励估计，因为模型在推理过程中通常无法获取真实标签。为应对这一局限，我们提出了用于情境强化学习的测试时再思考框架（Test-Time Rethinking for In-Context Reinforcement Learning, TR-ICRL），这是一个专为推理与知识密集型任务设计的新型ICRL框架。TR-ICRL的运行机制如下：首先，针对给定查询，从一个无标签的评估集中检索出最相关的实例。在每一轮ICRL迭代中，LLM为每个检索到的实例生成一组候选答案。接着，通过多数投票从该组答案中得出一个伪标签。此标签随后作为代理，用于提供奖励信息并生成形成性反馈，从而引导LLM进行迭代优化。最终，这些合成的上下文信息与原始查询结合，形成一个综合提示，并通过最后一轮多数投票确定答案。我们在主流的推理与知识密集型任务上评估了TR-ICRL，结果表明其带来了显著的性能提升。值得注意的是，TR-ICRL将Qwen2.5-7B在MedQA上的平均表现提升了21.23%，在AIME2024上甚至提升了137.59%。广泛的消融实验与分析进一步验证了我们方法的有效性与鲁棒性。代码已发布于https://github.com/pangpang-xuan/TR_ICRL。

摘要 (Abstract)

In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://github.com/pangpang-xuan/TR_ICRL.

关键词: In-Context Reinforcement Learning, Large Language Models, Test-Time Rethinking, Reward Estimation, Iterative Refinement, Majority Voting, Reasoning Tasks, Knowledge-intensive Tasks

138. ❌ Locally Confident, Globally Stuck: The Quality-Exploration Dilemma in Diffusion Language Models

作者: Liancheng Fang, Aiwei Liu, Henry Peng Zou, Yankai Chen, Enze Ma, Leyi Pan, Chunyu Miao, Wei-Chieh Huang, Xue Liu, Philip S. Yu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00375v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散大语言模型(dLLMs)的解码策略，核心关注推理路径探索与生成质量之间的权衡。与’Large Language Models’高度相关(10分)，因为论文专门研究扩散大语言模型。与’Chain of Thought’和’System 2 Thinking’高度相关(10分)，因为论文研究推理路径探索、多步推理和深度推理，并在MATH500、AIME等推理基准上测试。其他关键词如MoE、SLMs、Scaling Laws、训练方法、对齐、RAG、加速技术、代理、量化等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

论文研究了扩散大语言模型中解码顺序灵活性带来的质量-探索困境，提出了一个平衡质量与探索的最优分布，并通过Independent Metropolis-Hastings采样器在多个推理基准上实现了更好的权衡。

摘要翻译

扩散大语言模型（dLLMs）在理论上允许以任意顺序进行词元解码，这种灵活性可能使其比自回归（AR）大语言模型具备更丰富的推理路径探索能力。然而在实践中，随机顺序解码往往会损害生成质量。为缓解此问题，低置信度重掩码技术通过优先选择高置信度词元来提升单样本生成质量（例如Pass@$1$），但同时也抑制了探索性，并限制了多样本收益（例如Pass@$k$），从而形成了根本性的质量-探索困境。本文对这一困境给出了统一解释：我们证明低置信度重掩码技术虽然能改进对质量的短视代理指标，但可证明地约束了所诱导序列分布的熵。为突破这一局限，我们刻画了显式平衡质量与探索的最优分布特征，并开发了一种简单的独立Metropolis–Hastings采样器，在解码过程中近似逼近该目标分布。在包括MATH500、AIME24/25、HumanEval和MBPP在内的一系列推理基准测试中，实验表明我们的方法相比随机重掩码和低置信度重掩码，能实现更优的探索-质量权衡。

摘要 (Abstract)

Diffusion large language models (dLLMs) theoretically permit token decoding in arbitrary order, a flexibility that could enable richer exploration of reasoning paths than autoregressive (AR) LLMs. In practice, however, random-order decoding often hurts generation quality. To mitigate this, low-confidence remasking improves single-sample quality (e.g., Pass@$1$) by prioritizing confident tokens, but it also suppresses exploration and limits multi-sample gains (e.g., Pass@$k$), creating a fundamental quality–exploration dilemma. In this paper, we provide a unified explanation of this dilemma. We show that low-confidence remasking improves a myopic proxy for quality while provably constraining the entropy of the induced sequence distribution. To overcome this limitation, we characterize the optimal distribution that explicitly balances quality and exploration, and develop a simple Independent Metropolis–Hastings sampler that approximately targets this distribution during decoding. Experiments across a range of reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP show that our approach yields better exploration-quality tradeoff than both random and low-confidence remasking.

关键词: diffusion large language models, decoding order, reasoning paths, quality-exploration dilemma, low-confidence remasking, Independent Metropolis-Hastings, reasoning benchmarks, sequence distribution

139. ❌ Signals: Trajectory Sampling and Triage for Agentic Interactions

作者: Shuguang Chen, Adil Hafeez, Salman Paracha 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于大语言模型的智能体应用系统，重点关注智能体交互轨迹的采样和筛选方法。与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为论文明确研究基于LLM的智能体应用。与"LLM Agents OR Autonomous Agents OR Agentic Workflow"高度相关（10分），因为论文研究智能体交互轨迹管理。与"Tool Use OR Function Calling OR API Tool Use"高度相关（10分），因为论文在工具增强的智能体评估基准（τ-bench）上进行实验。其他关键词如MoE、量化、推理加速、对齐等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于信号的轻量级框架，用于高效筛选基于大语言模型的智能体交互轨迹，在τ-bench基准测试中实现了82%的信息率，比随机采样提高了效率。

摘要翻译

基于大语言模型的智能体应用日益依赖包含规划、行动执行与环境反馈的多步骤交互循环。尽管此类系统目前已大规模部署，但在部署后对其进行改进仍具挑战性。智能体轨迹数据量庞大且具有非确定性，通过人工审核或辅助大语言模型逐条审查的方式效率低下且成本高昂。我们提出一种轻量级、基于信号的智能体交互轨迹分类框架。该方法从实时交互中计算廉价且广泛适用的信号，并将其作为结构化属性附加于轨迹之上以进行分类筛选，从而在不影响在线智能体行为的前提下识别可能蕴含信息的交互。我们将信号组织成一个粗粒度分类体系，涵盖交互（错位、停滞、脱离、满意度）、执行（失败、循环）与环境（资源耗尽）等维度，该体系设计为无需调用模型即可计算。在广泛使用的工具增强智能体评估基准τ-bench上进行的受控标注研究表明，基于信号的采样实现了82%的信息获取率，而启发式过滤和随机采样分别为74%和54%，且每条信息轨迹的效率提升达1.52倍。该优势在不同奖励层级和任务领域中均保持稳健，证实信号能真正提升单条轨迹的信息价值，而非仅过度采样明显失败案例。这些结果表明，轻量级信号可作为智能体系统的实用采样基础设施，并为偏好数据构建与部署后优化提供了可行路径。

摘要 (Abstract)

Agentic applications based on large language models increasingly rely on multi-step interaction loops involving planning, action execution, and environment feedback. While such systems are now deployed at scale, improving them post-deployment remains challenging. Agent trajectories are voluminous and non-deterministic, and reviewing each one, whether through human review or auxiliary LLMs, is slow and cost-prohibitive. We propose a lightweight, signal-based framework for triaging agentic interaction trajectories. Our approach computes cheap, broadly applicable signals from live interactions and attaches them as structured attributes for trajectory triage, identifying interactions likely to be informative without affecting online agent behavior. We organize signals into a coarse-grained taxonomy spanning interaction (misalignment, stagnation, disengagement, satisfaction), execution (failure, loop), and environment (exhaustion), designed for computation without model calls. In a controlled annotation study on $τ$-bench, a widely used benchmark for tool-augmented agent evaluation, we show that signal-based sampling achieves an 82% informativeness rate compared to 74% for heuristic filtering and 54% for random sampling, with a 1.52x efficiency gain per informative trajectory. The advantage is robust across reward strata and task domains, confirming that signals provide genuine per-trajectory informativeness gains rather than merely oversampling obvious failures. These results show that lightweight signals can serve as practical sampling infrastructure for agentic systems, and suggest a path toward preference data construction and post-deployment optimization.

关键词: agentic interactions, trajectory sampling, signal-based triage, LLM agents, tool-augmented agents, post-deployment optimization, τ-bench, informativeness rate

140. ❌ Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

作者: Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, Ying Nian Wu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00344v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM多智能体系统的拓扑优化问题，使用强化学习框架（QMIX）解决智能体选择和连接问题。因此与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）。论文涉及复杂问题解决和推理任务，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、效率优化、科学应用等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出Agent Q-Mix强化学习框架，解决LLM多智能体系统中如何有效选择和连接智能体的拓扑优化问题，在多个基准测试中实现了最高准确率和优越的token效率。

摘要翻译

大语言模型（LLM）在完成各类任务中展现出卓越性能。然而，解决复杂问题通常需要多个智能体协同工作，这引发了一个根本性问题：如何有效地选择并连接这些智能体。本文提出 Agent Q-Mix，一个将拓扑选择重构为合作式多智能体强化学习（MARL）问题的强化学习框架。该方法利用QMIX价值分解学习去中心化的通信决策，其中每个智能体从一组通信动作中进行选择，这些动作共同诱导出轮次式通信图。其核心在于，Agent Q-Mix在“中心化训练与去中心化执行”（CTDE）范式下，结合了拓扑感知的图神经网络（GNN）编码器、门控循环单元（GRU）记忆模块以及每个智能体的Q值头。该框架优化了一个平衡任务准确性与令牌成本的奖励函数。在编码、推理和数学领域的七个核心基准测试中，与现有方法相比，Agent Q-Mix取得了最高的平均准确率，同时展现出更优的令牌效率和针对智能体故障的鲁棒性。值得注意的是，在以Gemini-3.1-Flash-Lite为骨干的极具挑战性的“人类终极考试”（HLE）上，Agent Q-Mix实现了20.8%的准确率，超越了微软智能体框架（19.2%）和LangGraph（19.2%），随后是AutoGen和OpenClaw的Lobster。这些结果凸显了通过学习的、去中心化的拓扑优化在推进多智能体推理边界方面的有效性。

摘要 (Abstract)

Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity’s Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8% accuracy, outperforming Microsoft Agent Framework (19.2%) and LangGraph (19.2%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

关键词: Large Language Models, Multi-agent Systems, Reinforcement Learning, QMIX, Topology Selection, Decentralized Communication, Agent Coordination, Token Efficiency

141. ❌ Large Language Models in the Abuse Detection Pipeline

作者: Suraj Kath, Sanket Badhe, Preet Shah, Ashwin Sampathkumar, Shivani Gupta 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00323v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是LLMs在滥用检测领域的应用综述，核心围绕LLMs如何集成到滥用检测生命周期（ADL）的四个阶段，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术细节、方法或应用，如MoE、SLMs、训练技术、推理优化、代理系统、科学AI等，因此这些关键词均为0分。

!!! tip deepseek-chat TL;DR

该综述论文探讨了大型语言模型如何集成到滥用检测生命周期中，分析了其在标签生成、检测、审核和审计四个阶段的应用、优势、挑战及未来研究方向。

摘要翻译

在线滥用行为日益复杂，涵盖毒性言论、骚扰、操纵和欺诈等多种形式。依赖静态分类器和人工密集标注的传统机器学习方法难以适应不断演变的威胁模式与精细化的政策要求。大型语言模型（LLM）引入了情境推理、政策解读、解释生成和跨模态理解等新能力，使其能够支持现代安全系统的多个环节。本综述以生命周期为视角，系统分析了LLM如何融入滥用检测生命周期（Abuse Detection Lifecycle, ADL）——我们将其划分为四个阶段：（I）标注与特征生成，（II）检测，（III）审核与申诉，以及（IV）审计与治理。针对每个阶段，我们综合了新兴研究与行业实践，突出生产部署中的架构设计考量，并剖析LLM驱动方法的优势与局限。最后，我们概述了延迟、成本效益、确定性、对抗鲁棒性和公平性等关键挑战，并探讨了未来研究方向，以期使LLM成为大规模滥用检测与治理系统中可靠、可问责的组成部分。

摘要 (Abstract)

Online abuse has grown increasingly complex, spanning toxic language, harassment, manipulation, and fraudulent behavior. Traditional machine-learning approaches dependent on static classifiers and labor-intensive labeling struggle to keep pace with evolving threat patterns and nuanced policy requirements. Large Language Models introduce new capabilities for contextual reasoning, policy interpretation, explanation generation, and cross-modal understanding, enabling them to support multiple stages of modern safety systems. This survey provides a lifecycle-oriented analysis of how LLMs are being integrated into the Abuse Detection Lifecycle (ADL), which we define across four stages: (I) Label & Feature Generation, (II) Detection, (III) Review & Appeals, and (IV) Auditing & Governance. For each stage, we synthesize emerging research and industry practices, highlight architectural considerations for production deployment, and examine the strengths and limitations of LLM-driven approaches. We conclude by outlining key challenges including latency, cost-efficiency, determinism, adversarial robustness, and fairness and discuss future research directions needed to operationalize LLMs as reliable, accountable components of large-scale abuse-detection and governance systems.

关键词: Large Language Models, Abuse Detection, Online Abuse, Contextual Reasoning, Policy Interpretation, Safety Systems, Governance, Survey

142. ❌ Frege in the Flesh: Biolinguistics and the Neural Enforcement of Syntactic Structures

作者: Elliot Murphy 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00291v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是生物语言学（biolinguistics），探讨语言作为生物器官的理论基础、进化解释和神经机制，完全不涉及大模型、深度学习或任何AI技术。所有关键词都聚焦于大模型技术及其应用，而本文是纯理论语言学/认知科学/神经科学交叉研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文探讨了生物语言学如何将句法结构（特别是MERGE操作）视为自然的数学对象，并论证这种形式化描述对进化解释和神经机制研究具有约束和指导作用。

摘要翻译

生物语言学（Biolinguistics）是一门跨学科科学研究，旨在探索人类语言的生物学基础、演化过程及遗传依据。它将语言视为一种先天的生物器官或心智官能，而非文化工具，并对行为主义将人类语言习得视为基于刺激-反应联结的观点提出挑战。提取其最核心的要素，生物语言学认真对待这一观点：语言的数学化、代数化模型捕捉到了世界的某种自然属性。句法结构构建操作“合并”（MERGE）被认为为科学界提供了一个“自然的真实关节”、“（新的）自然面向”（Mukherji 2010），而不仅仅是一个形式化的人工产物。这种语言的数学理论进而被认为能够为生物学家、遗传学家和神经科学家提供更清晰的指引，以探索语言的本质。本章的论证分四步展开。首先，我澄清生物语言学的研究对象：不是言语、交际或通用的序列处理，而是生成具有层级结构表达式的内部计算系统。其次，我认为这种形式化描述对于演化解释至关重要，因为不同的句法概念意味着需要解释的内容具有不同的标准。第三，我提出，一个足够明确的句法代数描述会对候选的神经机制施加实质性的约束。最后，我探讨了近期的神经计算研究如何开始将这些约束转化为可通过经验检验的假说，同时也指出了当前研究纲领的推测性和可修正性。

摘要 (Abstract)

Biolinguistics is the interdisciplinary scientific study of the biological foundations, evolution, and genetic basis of human language. It treats language as an innate biological organ or faculty of the mind, rather than a cultural tool, and it challenges a behaviorist conception of human language acquisition as being based on stimulus-response associations. Extracting its most essential component, it takes seriously the idea that mathematical, algebraic models of language capture something natural about the world. The syntactic structure-building operation of MERGE is thought to offer the scientific community a “real joint of nature”, “a (new) aspect of nature” (Mukherji 2010), not merely a formal artefact. This mathematical theory of language is then seen as being able to offer biologists, geneticists and neuroscientists clearer instructions for how to explore language. The argument of this chapter proceeds in four steps. First, I clarify the object of inquiry for biolinguistics: not speech, communication, or generic sequence processing, but the internal computational system that generates hierarchically structured expressions. Second, I argue that this formal characterization matters for evolutionary explanation, because different conceptions of syntax imply different standards of what must be explained. Third, I suggest that a sufficiently explicit algebraic account of syntax places non-trivial constraints on candidate neural mechanisms. Finally, I consider how recent neurocomputational work begins to transform these constraints into empirically tractable hypotheses, while also noting the speculative and revisable character of the present program.

关键词: biolinguistics, syntactic structures, MERGE, neural mechanisms, evolutionary explanation, computational system, hierarchical structure, algebraic models

143. ❌ Asymmetric Actor-Critic for Multi-turn LLM Agents

作者: Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00304v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于多轮LLM代理的不对称actor-critic框架，其中大型专有LLM作为actor，小型开源模型作为critic进行运行时监督。核心相关关键词包括：‘Large Language Models’（论文核心研究对象）、‘Small Language Models’（用于critic的小型模型）、‘Post-training’（涉及critic的微调）、‘Self-Correction’（通过critic监督实现行为校正）、‘LLM Agents’（论文研究多轮对话代理）、‘Multi-agent Systems’（actor-critic构成的多代理系统）。其他关键词如MoE、Scaling Laws、RAG、Tool Use等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种不对称actor-critic框架，通过小型开源模型监督大型专有LLM在多轮对话中的行为，显著提高了代理的可靠性和任务成功率。

摘要翻译

大型语言模型（LLM）展现出强大的推理与对话能力，但在多轮交互中确保其行为可靠性仍具挑战。在许多现实应用中，智能体必须在无法重试的单次场景中取得成功。现有方法要么依赖需多次尝试的反思或事后评估机制，要么假设模型可完全训练而无法利用专有LLM。我们提出一种用于可靠对话智能体的非对称行动者-评判者框架：由强大的专有LLM担任行动者，同时由较小的开源评判者提供运行时监督，监控行动者行为并在同一交互轨迹内实施干预。与基于训练的演员-评判者方法不同，本框架对在开放域对话环境中运行的固定行动者进行监督。该设计利用了生成与验证的不对称性：高质量生成需要大型模型，而有效监督常可通过较小模型实现。我们进一步提出一种数据生成流程，能在不修改行动者的前提下为评判者微调提供监督信号。在$τ$-bench和UserBench上的实验表明，相较于强大的单智能体基线方法，本方案显著提升了可靠性与任务成功率。此外，轻量级开源评判者在监督角色中可媲美甚至超越大型专有模型，且评判者微调相比多种前沿方法带来额外性能提升。

摘要 (Abstract)

Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor’s actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on $τ$-bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.

关键词: asymmetric actor-critic, multi-turn LLM agents, large language models, runtime supervision, critic fine-tuning, reliability improvement, open-source critic, conversational agents

144. ❌ REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context

作者: Pawin Taechoyotin, Daniel E. Acuna 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00248v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文REM-CTX专注于使用强化学习（GRPO）训练8B参数的语言模型进行自动同行评审，核心涉及大语言模型（LLMs）在科学领域的应用。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确使用并训练了一个8B参数的LLM。同时，论文在计算机、生物和物理科学手稿上进行实验，属于’AI for Science OR Bioinformatics OR Cheminformatics’范畴，因此也高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等，论文未涉及或未提及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过强化学习结合辅助上下文（如图形和外部学术信号）来提升自动同行评审系统的质量，提出的REM-CTX系统在多个科学领域的手稿上实现了最高的整体评审质量，优于使用更大商业模型的基线。

摘要翻译

当前大多数自动化同行评审系统仅依赖稿件文本内容，未能充分利用图表等视觉元素及外部学术信号。本文提出REM-CTX系统，该系统通过引入关联感知的奖励函数，将辅助上下文信息整合到评审生成过程中。REM-CTX采用分组相对策略优化（Group Relative Policy Optimization, GRPO）训练一个80亿参数的语言模型，结合了多维度质量奖励与两项关联奖励——这些奖励函数明确鼓励模型输出与辅助上下文保持对齐。在计算机科学、生物科学和物理科学领域的稿件实验中，REM-CTX在六个基线系统中取得了最高的综合评审质量，其表现优于使用更大规模商业模型的其他系统，并在质量与上下文关联度指标上均超越次优的强化学习基线。消融实验证实两项关联奖励具有互补性：每项奖励能针对性提升其目标关联度指标，同时保持所有质量维度不衰减，而完整模型在所有部分变体模型中表现最优。对训练动态的分析表明，在训练过程中“批判性”维度与其他指标呈负相关，这提示未来研究应将多维度奖励进行分组处理以优化评审生成。

摘要 (Abstract)

Most automated peer review systems rely on textual manuscript content alone, leaving visual elements such as figures and external scholarly signals underutilized. We introduce REM-CTX, a reinforcement-learning system that incorporates auxiliary context into the review generation process via correspondence-aware reward functions. REM-CTX trains an 8B-parameter language model with Group Relative Policy Optimization (GRPO) and combines a multi-aspect quality reward with two correspondence rewards that explicitly encourage alignment with auxiliary context. Experiments on manuscripts across Computer, Biological, and Physical Sciences show that REM-CTX achieves the highest overall review quality among six baselines, outperforming other systems with substantially larger commercial models, and surpassing the next-best RL baseline across both quality and contextual grounding metrics. Ablation studies confirm that the two correspondence rewards are complementary: each selectively improves its targeted correspondence reward while preserving all quality dimensions, and the full model outperforms all partial variants. Analysis of training dynamics reveals that the criticism aspect is negatively correlated with other metrics during training, suggesting that future studies should group multi-dimension rewards for review generation.

关键词: automated peer review, reinforcement learning, language model, auxiliary context, Group Relative Policy Optimization, review generation, scientific manuscripts, correspondence rewards

145. ❌ LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias

作者: Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00259v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM在教育评估（作文评分）中的应用，直接涉及’Large Language Models’和’Instruction Tuning’（论文明确提到instruction-tuned LLMs），因此这两项给10分。论文属于AI在教育领域的应用，与’AI for Science’有一定关联（教育可视为科学应用的一个子领域），但非核心，给5分。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未在摘要中提及或相关，给0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了指令调优的大型语言模型在作文评分任务中与人类评分的一致性，发现模型在整体评分上表现良好但在分析性评分（尤其是语法等低级关注点）上存在稳定负向偏差，并提出了基于小样本偏差校正的部署策略。

摘要翻译

尽管利用大型语言模型（LLM）进行教育评估的兴趣日益增长，但其与人工评分的一致性程度仍不明确。本研究对指令微调后的LLM在三个开放式作文评分数据集（ASAP 2.0、ELLIPSE和DREsS）上进行了系统评估，涵盖整体性评分与分析性评分。我们分析了模型与人类共识评分的一致性、方向性偏差以及偏差估计的稳定性。结果显示，在整体性评分上，强大的开源权重模型与人类评分者达到了中等到高度的一致性（二次加权卡帕系数约为0.6），但这种一致性并未均匀体现在分析性评分中。我们尤其观察到，在低阶关注特征（如语法与规范）上存在显著且稳定的负向偏差，这意味着模型对这些特征的评分往往比人类评分者更为严苛。同时发现，在多维度分析性评分中，简洁的关键词提示通常优于长篇量规式提示。为量化检测这些系统性偏差所需的数据量，我们计算了使平均偏差的95%自助置信区间排除零值所需的最小样本量。分析表明，低阶关注特征的偏差通常可通过极小的验证集检测到，而高阶关注特征则通常需要更大的样本量。这些发现支持一种“偏差校正优先”的部署策略：无需依赖原始零样本评分，亦无需大规模微调，仅需通过少量人工标注的偏差估计集即可估算并校正系统性评分偏移。

摘要 (Abstract)

Despite growing interest in using Large Language Models (LLMs) for educational assessment, it remains unclear how closely they align with human scoring. We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring. We analyze agreement with human consensus scores, directional bias, and the stability of bias estimates. Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring. In particular, we observe large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions, meaning that models often score these traits more harshly than human raters. We also find that concise keyword-based prompts generally outperform longer rubric-style prompts in multi-trait analytic scoring. To quantify the amount of data needed to detect these systematic deviations, we compute the minimum sample size at which a 95% bootstrap confidence interval for the mean bias excludes zero. This analysis shows that LOC bias is often detectable with very small validation sets, whereas Higher-Order Concern (HOC) traits typically require much larger samples. These findings support a bias-correction-first deployment strategy: instead of relying on raw zero-shot scores, systematic score offsets can be estimated and corrected using small human-labeled bias-estimation sets, without requiring large-scale fine-tuning.

关键词: Large Language Models, essay scoring, instruction tuning, human alignment, analytic scoring, bias detection, educational assessment, prompt effects

146. ❌ FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval

作者: Antonín Jarolím, Martin Fajčík 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00242v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是改进检索模型（ColBERT），通过从LLM蒸馏细粒度相关性信号来提升检索质量，属于检索增强生成（RAG）范畴，与’Retrieval-Augmented Generation’高度相关（10分）。论文使用LLM进行知识蒸馏，与’Large Language Models’相关（8分）。提出的FGR-ColBERT模型（110M参数）比Gemma 2（27B）小得多，体现了小型化趋势，与’Small Language Models’有一定关联（5分）。其他关键词如MoE、Scaling Laws、Alignment等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出FGR-ColBERT，一种改进的检索模型，通过从大语言模型蒸馏细粒度相关性信号来提升文档检索的精确性，在保持检索效率的同时实现了比大模型更好的细粒度检索性能。

摘要翻译

文档检索能够识别相关文档，但无法提供细粒度的证据线索，例如具体相关的文本片段。一种可能的解决方案是在检索后应用大语言模型（LLM），但这会引入显著的计算开销并限制实际部署。我们提出了FGR-ColBERT，这是一种对ColBERT检索模型的改进，它将从LLM中提炼出的细粒度相关性信号直接整合到检索函数中。在MS MARCO数据集上的实验表明，FGR-ColBERT（1.1亿参数）在词元级F1分数上达到了64.5，超过了Gemma 2（270亿参数）的62.8，尽管其模型规模大约小了245倍。同时，它保持了检索有效性（相对Recall@50为99%），并且依然高效，与原始ColBERT相比仅产生约1.12倍的延迟开销。

摘要 (Abstract)

Document retrieval identifies relevant documents but does not provide fine-grained evidence cues, such as specific relevant spans. A possible solution is to apply an LLM after retrieval; however, this introduces significant computational overhead and limits practical deployment. We propose FGR-ColBERT, a modification of ColBERT retrieval model that integrates fine-grained relevance signals distilled from an LLM directly into the retrieval function. Experiments on MS MARCO show that FGR-ColBERT (110M) achieves a token-level F1 of 64.5, exceeding the 62.8 of Gemma 2 (27B), despite being approximately 245 times smaller. At the same time, it preserves retrieval effectiveness (99% relative Recall@50) and remains efficient, incurring only a ~1.12x latency overhead compared to the original ColBERT.

关键词: document retrieval, fine-grained relevance, retrieval model, knowledge distillation, LLM, ColBERT, token-level F1, efficiency

147. ❌ A Taxonomy of Programming Languages for Code Generation

作者: Nishat Raihan, Christian Newman, Marcos Zampieri 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00239v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究编程语言的资源分类，以支持多语言代码生成LLM的评估。论文明确提到"large language models (LLMs)“是研究的核心动机和应用背景，因此该关键词得10分。其他关键词涉及具体的大模型技术（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）、应用场景（如AI for Science）或高级能力（如推理、智能体等），论文均未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对编程语言资源分布不均的问题，首次提出了一个可复现的编程语言资源分类法，将646种语言分为四个等级，为多语言代码生成大语言模型的数据集构建和评估提供了原则性框架。

摘要翻译

全球7000多种语言在自然语言处理资源可获得性方面存在显著差异，这推动学界系统性地按资源丰富程度对语言进行分类（Joshi等人，2020）。编程语言领域存在类似的不均衡现象，但尚未建立代码资源的层级分类体系。随着大语言模型生成代码的能力日益增强，建立此类分类体系变得至关重要。为填补这一空白，我们首次提出可复现的编程语言资源分类方法，将646种语言划分为四个层级。研究表明，仅占1.9%的语言（第三层级：高资源）占据了七大语料库中74.6%的代码标记，而71.7%的语言（第零层级：稀缺资源）仅贡献了1.0%的标记。通过对层级内不平等性、离散度和分布偏度的统计分析，我们证实这种不均衡现象既极端又具有系统性。本研究成果为多语言大语言模型的数据集构建和层级感知评估提供了理论框架。

摘要 (Abstract)

The world’s 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.

关键词: programming languages, code generation, large language models, resource classification, multilingual evaluation, dataset curation, taxonomy

148. ❌ Do Language Models Know When They’ll Refuse? Probing Introspective Awareness of Safety Boundaries

作者: Tanay Gondil 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00228v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的自我认知能力，特别是模型能否预测自己的拒绝行为，这直接涉及LLMs、自我反思/自我改进、安全对齐、事实性/真实性以及可解释AI等关键词。论文评估了Claude、GPT、Llama等前沿模型，使用信号检测理论分析其内省敏感性，属于大模型技术原理的创新研究。其他关键词如MoE、量化、推理加速、科学AI应用等与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究探讨大语言模型能否准确预测自己何时会拒绝有害请求，发现所有测试模型都表现出较高的内省敏感性，但在安全边界处敏感性显著下降，且置信度评分可为安全关键部署提供实用信号。

摘要翻译

大型语言模型经过训练能够拒绝有害请求，但它们能否在生成回应前准确预测自身的拒绝行为？我们通过系统性研究探讨该问题：让模型首先预测其拒绝倾向，随后在全新语境中生成实际回应。我们在涵盖300类请求的3754个数据点上评估了四个前沿模型：Claude Sonnet 4、Claude Sonnet 4.5、GPT-5.2和Llama 3.1 405B。运用信号检测理论（Signal Detection Theory, SDT）分析发现，所有模型均表现出较高的内省敏感性（d’ = 2.4-3.5），但在安全边界处敏感性显著下降。Claude系列呈现代际改进（Sonnet 4.5准确率95.7% vs Sonnet 4准确率93.0%），而GPT-5.2准确率较低（88.9%）且行为波动更大。Llama 405B虽具有高敏感性，但表现出强烈的拒绝偏见和较差的校准度，导致整体准确率偏低（80.0%）。主题分析显示武器相关查询始终是内省预测最困难的领域。关键的是，置信度评分提供了可操作的信号：对于校准良好的模型，仅采用高置信度预测可实现98.3%的准确率，这为安全关键场景下的置信度路由部署提供了实用方案。

摘要 (Abstract)

Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d’ = 2.4-3.5), but sensitivity drops substantially at safety boundaries. We observe generational improvement within Claude (Sonnet 4.5: 95.7 percent accuracy vs Sonnet 4: 93.0 percent), while GPT-5.2 shows lower accuracy (88.9 percent) with more variable behavior. Llama 405B achieves high sensitivity but exhibits strong refusal bias and poor calibration, resulting in lower overall accuracy (80.0 percent). Topic-wise analysis reveals weapons-related queries are consistently hardest for introspection. Critically, confidence scores provide actionable signal: restricting to high-confidence predictions yields 98.3 percent accuracy for well-calibrated models, enabling practical confidence-based routing for safety-critical deployments.

关键词: Large Language Models, Introspective Awareness, Safety Boundaries, Refusal Behavior, Signal Detection Theory, Model Calibration, Confidence-based Routing, Safety-critical Deployments

149. ❌ Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations

作者: Haoran Wang, Li Xiong, Kai Shu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00209v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs对上下文隐私规范的内部编码机制（与’Large Language Models’高度相关）和可解释性分析（与’Mechanistic Interpretability’高度相关），涉及隐私泄露与行为对齐问题（与’Hallucination Mitigation’、‘Self-Correction’、‘Instruction Tuning’有一定关联）。其他关键词如MoE、量化、推理加速等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs是否在内部编码了上下文隐私规范，发现隐私参数在线性可分的方向上被表示，但模型仍会泄露隐私，通过引入CI参数化干预可更有效地减少隐私违规。

摘要翻译

大型语言模型（LLM）正日益被部署于高风险场景中，但它们经常违反情境隐私，在人类会审慎行事的场合泄露私人信息。这引发了一个根本性问题：LLM是否在内部编码了情境隐私规范？如果是，为何违规行为仍然持续？我们基于情境完整性（CI）理论，首次对情境隐私作为LLM中的结构化潜在表征进行了系统性研究。通过对多个模型的探测，我们发现决定规范的三个CI参数（信息类型、接收者与传递原则）在激活空间中被编码为线性可分且功能独立的方向。尽管存在这种内部结构，模型在实践中仍会泄露私人信息，这揭示了概念表征与模型行为之间存在明显差距。为弥合这一差距，我们提出了CI参数化引导方法，该方法可沿每个CI维度独立进行干预。这种结构化控制相比整体性引导，能更有效、更可预测地减少隐私违规行为。我们的研究结果表明，情境隐私失效源于表征与行为之间的错位，而非认知缺失；利用CI的组合结构能够实现更可靠的情境隐私控制，这为改进LLM的情境隐私理解指明了潜在路径。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.

关键词: Large Language Models, Contextual Privacy, Contextual Integrity, Privacy Norms, Representation Probing, Activation Space, Privacy Violations, Steering Interventions

150. ❌ Polish phonology and morphology through the lens of distributional semantics

作者: Paula Orzechowska, R. Harald Baayen 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00174v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究波兰语的语言学问题（音系学、形态学与分布语义学的关系），使用传统的统计和计算技术（如t-SNE、LDA、LDL），完全不涉及大模型、深度学习或AI for Science等现代AI技术。所有关键词均与大模型技术原理、训练方法、应用或科学AI应用相关，而本文是纯粹的计算语言学/语言学研究，与这些关键词无任何关联。

!!! tip deepseek-chat TL;DR

该研究探讨了波兰语单词的音系和形态结构与其在分布语义空间中的意义之间的关系，发现语义向量可以预测音系复杂性和形态句法类别，并支持高精度的理解和生成计算建模。

摘要翻译

本研究运用分布语义学探讨波兰语词汇的音系与形态结构同其意义之间的关系。在本次分析中，我们探究包含辅音丛的词汇形式属性与其意义之间是否存在关联：复杂词汇的音系及形态音系结构是否在语义空间中有所映照？我们以波兰语为研究对象展开探讨，该语言具有复杂的形态系统及大量由形态驱动的辅音丛。通过采用t-SNE、线性判别分析与线性判别学习等统计与计算技术，我们证明——除了编码丰富的形态句法信息（如时态、数、格）外——语义向量还能捕捉诸如音素串等亚词汇语言单位的信息。首先，仅通过词嵌入即可预测波兰语中的音位组合复杂度、形态配列透明度以及广泛的形态句法范畴（格、性、体、时态、数），而无需任何关于词汇形式的信息。其次，我们认为基于词嵌入的判别性词汇模型之所以能对语言理解与产出作出高度精准的计算建模预测，正是因为语义空间中存在大量与形式空间结构高度同构的信息。

摘要 (Abstract)

This study investigates the relationship between the phonological and morphological structure of Polish words and their meanings using Distributional Semantics. In the present analysis, we ask whether there is a relationship between the form properties of words containing consonant clusters and their meanings. Is the phonological and morphonological structure of complex words mirrored in semantic space? We address these questions for Polish, a language characterized by non-trivial morphology and an impressive inventory of morphologically-motivated consonant clusters. We use statistical and computational techniques, such as t-SNE, Linear Discriminant Analysis and Linear Discriminative Learning, and demonstrate that – apart from encoding rich morphosyntactic information (e.g. tense, number, case) – semantic vectors capture information on sub-lexical linguistic units such as phoneme strings. First, phonotactic complexity, morphotactic transparency, and a wide range of morphosyntactic categories available in Polish (case, gender, aspect, tense, number) can be predicted from embeddings without requiring any information about the forms of words. Second, we argue that computational modelling with the discriminative lexicon model using embeddings can provide highly accurate predictions for comprehension and production, exactly because of the existence of extensive information in semantic space that is to a considerable extent isomorphic with structure in the form space.

关键词: Polish phonology, Polish morphology, distributional semantics, consonant clusters, semantic vectors, computational modeling, discriminative lexicon model, morphosyntactic categories

151. ❌ ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

作者: Annette Taberner-Miller 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00136v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM服务中的自适应路由系统（ParetoBandit），核心涉及LLM部署、成本-质量权衡和在线适应。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文直接处理LLM服务组合和路由。其他关键词（如MoE、SLMs、训练技术、推理方法、AI for Science等）均未在标题或摘要中提及，与论文内容无关，故评0分。

!!! tip deepseek-chat TL;DR

论文提出了ParetoBandit，一种基于成本感知上下文赌博机的自适应路由器，用于非平稳LLM服务中预算控制下的多模型路由，实现了在线适应、运行时模型集成和成本不超预算0.4%的性能。

摘要翻译

生产级大语言模型服务通常依赖横跨约530倍成本区间的多模型组合，其路由决策需在质量与成本间进行权衡。这种权衡具有非稳态特性：服务提供商会调整定价、模型质量可能静默退化，且新模型必须在不中断服务的情况下完成集成。本文提出ParetoBandit——一个基于成本感知上下文赌博机算法的开源自适应路由器，首次实现了同时执行美元计价的预算控制、在线适应动态变化以及在运行时无缝接入新模型三大功能。
ParetoBandit通过三种机制解决上述挑战。在线原始-对偶预算调节器可在无限请求流中执行单请求成本上限控制，以闭环调控替代离线惩罚参数调优；基于充分统计量的几何遗忘机制能够快速适应价格与质量波动，同时利用离线先验信息进行冷启动；热插拔注册表允许运维人员在运行时动态增删模型，每个新模型会经历短暂的强制探索阶段，随后仅通过实际流量即可利用UCB选择机制发现其质量-成本生态位。
我们在包含三个模型的组合上使用1,824条提示词对ParetoBandit进行了四种部署场景的评估。在七种预算上限设置下，平均单请求成本始终未超过目标值0.4%以上。当环境发生变化时，系统展现出适应性：最昂贵模型降价一个数量级可带来最高+0.071的质量提升，静默质量退化现象能在预算内被检测并重路由。冷启动模型仅需约142步即可实现有效部署且未突破成本上限。该路由器具备判别能力而非盲目采用：昂贵模型受预算门控约束，低质量模型在有限探索后即被弃用。端到端路由延迟在CPU上为9.8毫秒——不足典型推理时间的0.4%——其中路由决策本身仅耗时22.5微秒。

摘要 (Abstract)

Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU – less than 0.4% of typical inference time – with the routing decision itself taking just 22.5us.

关键词: LLM serving, adaptive routing, budget pacing, contextual bandits, multi-model portfolios, cost-quality trade-off, online adaptation, runtime model integration

152. ❌ Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation

作者: Ashish Rana, Chia-Chien Hung, Qumeng Sun, Julian Martin Kunkel, Carolin Lawrence 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体的记忆控制框架Oblivion，通过衰减驱动的激活机制实现自适应记忆管理。高度相关关键词：LLM Agents（核心研究对象）、Retrieval-Augmented Generation（涉及记忆检索机制）、Large Language Models（基础模型）。中等相关关键词：Context Window Extension（处理长历史）、Chain of Thought/System 2 Thinking（涉及推理过程）、Self-Correction（记忆强化与自适应）。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

论文针对LLM智能体在长历史交互中面临的内存干扰和延迟问题，提出了Oblivion记忆控制框架，通过衰减驱动的激活机制实现自适应记忆访问和强化，在动态长视野交互基准测试中有效平衡了学习与遗忘。

摘要翻译

人类记忆通过选择性遗忘进行适应：经验随时间推移可及性降低，但可通过强化或情境线索重新激活。相比之下，基于记忆增强的大语言模型智能体依赖“持续在线”检索与“扁平化”记忆存储，随着历史记录增长易引发高干扰与延迟。我们提出Oblivion记忆控制框架，将遗忘视为由衰减驱动的可及性降低而非显式删除。Oblivion将记忆控制解耦为读取路径与写入路径：读取路径根据智能体不确定性及记忆缓冲区充足性决定何时调用记忆，避免冗余的持续在线访问；写入路径通过强化对形成响应有贡献的记忆来决定强化内容。二者协同实现分层记忆组织，在保持持久高层策略的同时动态加载所需细节。我们在静态与动态长程交互基准测试中进行评估，结果表明Oblivion能动态调节记忆访问与强化机制，在情境变化中平衡学习与遗忘，凸显记忆控制对实现高效大语言模型智能体推理的关键作用。源代码发布于https://github.com/nec-research/oblivion。

摘要 (Abstract)

Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues. In contrast, memory-augmented LLM agents rely on “always-on” retrieval and “flat” memory storage, causing high interference and latency as histories grow. We introduce Oblivion, a memory control framework that casts forgetting as decay-driven reductions in accessibility, not explicit deletion. Oblivion decouples memory control into read and write paths. The read path decides when to consult memory, based on agent uncertainty and memory buffer sufficiency, avoiding redundant always-on access. The write path decides what to strengthen, by reinforcing memories contributing to forming the response. Together, this enables hierarchical memory organization that maintains persistent high-level strategies while dynamically loading details as needed. We evaluate on both static and dynamic long-horizon interaction benchmarks. Results show that Oblivion dynamically adapts memory access and reinforcement, balancing learning and forgetting under shifting contexts, highlighting that memory control is essential for effective LLM-agentic reasoning. The source code is available at https://github.com/nec-research/oblivion.

关键词: LLM agents, memory control, forgetting mechanism, retrieval-augmented generation, adaptive memory access, long-horizon interaction, decay-driven activation, hierarchical memory organization

153. ❌ Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency

作者: Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00130v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Chain-of-Thought（CoT）提示方法的改进，提出Hierarchical Chain-of-Thought（Hi-CoT）方法，因此与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分）。论文明确研究LLMs推理能力，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。Hi-CoT旨在增强复杂多步推理的逻辑连贯性，与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（8分）。CoT提示属于上下文学习范畴，与’In-context Learning OR Many-shot Learning’有一定关联（5分）。其他关键词如MoE、量化、对齐等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对传统Chain-of-Thought提示方法在复杂多步推理中存在的冗余和性能不佳问题，提出了Hierarchical Chain-of-Thought（Hi-CoT）提示方法，实验表明该方法能显著提升LLMs在数学推理任务上的准确率（平均提升6.2%）并减少推理轨迹长度（减少13.9%）。

摘要翻译

思维链提示显著提升了大语言模型的推理能力。然而，传统的思维链通常依赖于非结构化、扁平化的推理链条，存在冗余且性能欠佳。本研究提出了分层思维链提示，这是一种专为应对复杂多步推理挑战而设计的结构化推理范式。分层思维链通过交替进行指令规划和分步执行，将推理过程分解为层次化的子步骤。这种分解使大语言模型能够更好地管理长推理跨度并保持逻辑连贯性。在多种大语言模型和数学推理基准上的广泛评估表明，与思维链提示相比，分层思维链持续将平均准确率提升了6.2%（在某些模型和任务上最高达61.4%），同时将推理轨迹长度减少了13.9%。我们进一步证明，当模型严格遵循分层结构时，准确率和效率达到最大化。我们的代码公开于 https://github.com/XingshuaiHuang/Hi-CoT。

摘要 (Abstract)

Chain-of-Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we introduce Hierarchical Chain-of-Thought (Hi-CoT) prompting, a structured reasoning paradigm specifically designed to address the challenges of complex, multi-step reasoning. Hi-CoT decomposes the reasoning process into hierarchical substeps by alternating between instructional planning and step-by-step execution. This decomposition enables LLMs to better manage long reasoning horizons and maintain logical coherence. Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9% compared to CoT prompting. We further show that accuracy and efficiency are maximized when models strictly adhere to the hierarchical structure. Our code is available at https://github.com/XingshuaiHuang/Hi-CoT.

关键词: Chain-of-Thought, Hierarchical Chain-of-Thought, LLM reasoning, multi-step reasoning, mathematical reasoning, prompting, reasoning efficiency, logical coherence

154. ❌ One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

作者: Yuxing Lu, Yushuhong Lin, Jason Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00085v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在临床预测中的应用，属于AI for Science领域，因此相关关键词得高分。论文提出CAMP框架，涉及多智能体系统（Multi-agent Systems）和LLM Agents，这些是核心方法，得10分。论文关注复杂病例的推理过程，涉及Chain of Thought、System 2 Thinking、Self-Correction等推理相关概念，但非核心，得5分。同时，论文强调透明决策审计，与Hallucination Mitigation和Explainable AI有一定关联，得5分。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对大语言模型在临床预测中存在的病例级异质性问题，提出了一种案例自适应的多智能体审议框架CAMP，该框架通过动态组建专家小组、三值投票和混合路由机制，在MIMIC-IV数据集上实现了优于基线方法的诊断预测性能，同时提供了透明的决策审计。

摘要翻译

应用于临床预测的大语言模型存在病例层面的异质性：简单病例能产生一致的输出，而复杂病例在轻微提示变动下会产生分歧性预测。现有的单智能体策略从单一角色条件分布中采样，多智能体框架则采用固定角色配合简单多数投票，丢弃了分歧中的诊断信号。我们提出CAMP（病例自适应多智能体专家组），其中主治医师智能体根据每个病例的诊断不确定性动态组建定制化的专科医生小组。每位专科医生通过三值投票（保留/拒绝/中立）评估候选诊断，从而在其专业领域外实现有原则的弃权。混合路由机制将每个诊断导向三种路径：强共识路径、回退至主治医师判断路径，或基于证据的仲裁路径——该路径更注重论证质量而非票数统计。在MIMIC-IV数据集上针对诊断预测和简要病程生成的实验中，基于四种大语言模型架构的CAMP均持续优于强基线模型，同时比大多数竞争性多智能体方法消耗更少的计算量，其投票记录和仲裁轨迹为决策过程提供了透明的审计依据。

摘要 (Abstract)

Large language models applied to clinical prediction exhibit case-level heterogeneity: simple cases yield consistent outputs, while complex cases produce divergent predictions under minor prompt changes. Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement. We propose CAMP (Case-Adaptive Multi-agent Panel), where an attending-physician agent dynamically assembles a specialist panel tailored to each case’s diagnostic uncertainty. Each specialist evaluates candidates via three-valued voting (KEEP/REFUSE/NEUTRAL), enabling principled abstention outside one’s expertise. A hybrid router directs each diagnosis through strong consensus, fallback to the attending physician’s judgment, or evidence-based arbitration that weighs argument quality over vote counts. On diagnostic prediction and brief hospital course generation from MIMIC-IV across four LLM backbones, CAMP consistently outperforms strong baselines while consuming fewer tokens than most competing multi-agent methods, with voting records and arbitration traces offering transparent decision audits.

关键词: Large Language Models, Clinical Prediction, Multi-agent Systems, Case-Adaptive, Diagnostic Uncertainty, Transparent Decision Audits, MIMIC-IV, CAMP Framework

155. ❌ Terminal Agents Suffice for Enterprise Automation

作者: Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam, Srinivas Sunkara, Vikas Yadav, Sai Rajeswar 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00073v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究企业自动化中的智能体架构，核心涉及LLM Agents和Tool Use/API Tool Use（通过终端直接与平台API交互），与这两个关键词高度相关（10分）。论文提到"strong foundation models”，与Large Language Models/Foundation Models相关（10分）。其他关键词如MoE、Scaling Laws、Training方法、推理优化、科学AI应用等，论文未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究企业自动化中智能体架构的必要性，发现仅配备终端和文件系统的编码智能体通过直接与平台API交互，能够比复杂智能体架构更有效地解决许多企业任务。

摘要翻译

近年来，构建能够与数字平台交互以自主执行有意义的企业任务的智能体日益受到关注。目前已探索的方法包括基于模型上下文协议（Model Context Protocol，MCP）等抽象框架构建的工具增强型智能体，以及通过图形界面操作的网页智能体。然而，考虑到其成本与运维开销，此类复杂智能体系统是否确有必要仍不明确。我们认为，仅配备终端和文件系统的编程智能体通过直接与平台API交互，能够更有效地解决众多企业任务。我们在多样化的真实世界系统中验证了这一假设，并证明此类底层终端智能体的表现与更复杂的智能体架构相当或更优。我们的研究结果表明，简单的程序化接口结合强大的基础模型，已足以实现实际的企业自动化需求。

摘要 (Abstract)

There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.

关键词: enterprise automation, LLM agents, terminal agents, API interaction, coding agent, agentic systems, foundation models, programmatic interfaces

156. ❌ TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking

作者: Jiyuan Hu, Zechuan Zhang, Zongxin Yang, Yi Yang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01207v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TRACE专注于3D场景编辑技术，通过3D高斯泼溅（3DGS）和视频扩散模型实现高保真场景转换，核心涉及计算机视觉、3D重建和视频生成领域。所有评分关键词均围绕大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），或特定科学AI应用（如生物信息学）。论文未提及任何LLM、深度学习模型技术原理或科学领域AI应用，与所有关键词完全无关，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

论文提出了TRACE框架，通过3D几何锚定和上下文视频掩码技术，解决了现有3D场景编辑方法在局部精细操作和结构保持方面的不足，实现了高保真、时空稳定的3D场景自动编辑。

摘要翻译

本文提出TRACE，一种基于网格引导的3D高斯泼溅（3DGS）编辑框架，能够实现自动化、高保真的场景变换。通过将视频扩散模型与显式三维几何结构锚定，TRACE首次实现了细粒度的部件级操控（例如局部姿态调整或组件替换），同时保持核心主体的结构完整性，这一能力在现有编辑方法中普遍缺失。我们的方法包含三个关键阶段：（1）多视图三维锚点合成：利用在我们构建的MV-TRACE数据集（首个专注于场景连贯物体添加与修改的多视图一致性数据集）上训练的稀疏视图编辑器，生成空间一致的三维锚点；（2）实体几何锚定：通过两阶段配准确保插入网格与3DGS场景间的精确空间同步；（3）上下文视频掩蔽：将三维投影整合至自回归视频生成流程，实现时间稳定、符合物理规律的渲染效果。大量实验表明，TRACE在编辑多样性与结构完整性方面均显著优于现有方法。

摘要 (Abstract)

We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE uniquely enables fine-grained, part-level manipulatio–such as local pose shifting or component replacemen–while preserving the structural integrity of the central subject, a capability largely absent in existing editing methods. Our approach comprises three key stages: (1) Multi-view 3D-Anchor Synthesis, which leverages a sparse-view editor trained on our MV-TRACE datase–the first multi-view consistent dataset dedicated to scene-coherent object addition and modificatio–to generate spatially consistent 3D-anchors; (2) Tangible Geometry Anchoring (TGA), which ensures precise spatial synchronization between inserted meshes and the 3DGS scene via two-phase registration; and (3) Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline to achieve temporally stable, physically-grounded rendering. Extensive experiments demonstrate that TRACE consistently outperforms existing methods especially in editing versatility and structural integrity.

关键词: 3D scene editing, 3D Gaussian Splatting, mesh-guided framework, tangible geometry anchoring, contextual video masking, multi-view consistency, structural integrity, video diffusion

157. ❌ Open-Set Supervised 3D Anomaly Detection: An Industrial Dataset and a Generalisable Framework for Unknown Defects

作者: Hanzhe Liang, Luocheng Zhang, Junyang Xia, HanLiang Zhou, Bingyang Guo, Yingxi Xie, Can Gao, Ruiyun Yu, Jinbao Wang, Pan Li 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01171v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于3D点云异常检测，提出了Open-Industry数据集和Open3D-AD框架，属于计算机视觉和工业AI应用领域。论文内容与绝大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大语言模型、训练技术、推理优化、智能体等特定方向。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及工业制造中的AI应用（可视为AI在工业科学领域的应用），但并非核心生物信息学或化学信息学，因此给5分（有一定关联）。加权总分计算为5.0分，远低于动态及格分26.6分。

!!! tip deepseek-chat TL;DR

该论文研究了开放集监督3D异常检测问题，提出了Open-Industry数据集和Open3D-AD框架，通过建模正常和异常数据的概率密度分布，有效识别未知缺陷，并在多个数据集上验证了其性能。

摘要翻译

尽管自监督三维异常检测通常认为获取高精度点云的计算成本高昂，但在实际制造场景中，收集有限数量的异常样本往往是可行的。因此，我们研究开放集监督三维异常检测，该模型仅使用正常样本和少量已知异常样本进行训练，旨在测试时识别未知异常。我们提出了Open-Industry——一个高质量的工业数据集，包含15个类别，每个类别包含从生产线上收集的五种真实异常类型。我们首先将通用的开放集异常检测方法进行调整，以更好地适应三维点云输入。在此基础上，我们提出了Open3D-AD，这是一种面向点云的方法，利用正常样本、模拟异常和部分观测到的真实异常来建模正常数据与异常数据的概率密度分布。随后，我们引入了一种简单的对应分布子采样方法，以减少正常分布与非正常分布之间的重叠，从而实现更强的双分布建模。基于这些贡献，我们建立了一个全面的基准测试，并在Open-Industry以及现有数据集（包括Real3D-AD和Anomaly-ShapeNet）上对所提方法进行了广泛评估。基准测试结果与消融研究证明了Open3D-AD的有效性，并进一步揭示了开放集监督三维异常检测的潜力。

摘要 (Abstract)

Although self-supervised 3D anomaly detection assumes that acquiring high-precision point clouds is computationally expensive, in real manufacturing scenarios it is often feasible to collect a limited number of anomalous samples. Therefore, we study open-set supervised 3D anomaly detection, where the model is trained with only normal samples and a small number of known anomalous samples, aiming to identify unknown anomalies at test time. We present Open-Industry, a high-quality industrial dataset containing 15 categories, each with five real anomaly types collected from production lines. We first adapt general open-set anomaly detection methods to accommodate 3D point cloud inputs better. Building upon this, we propose Open3D-AD, a point-cloud-oriented approach that leverages normal samples, simulated anomalies, and partially observed real anomalies to model the probability density distributions of normal and anomalous data. Then, we introduce a simple Correspondence Distributions Subsampling to reduce the overlap between normal and non-normal distributions, enabling stronger dual distributions modeling. Based on these contributions, we establish a comprehensive benchmark and evaluate the proposed method extensively on Open-Industry as well as established datasets including Real3D-AD and Anomaly-ShapeNet. Benchmark results and ablation studies demonstrate the effectiveness of Open3D-AD and further reveal the potential of open-set supervised 3D anomaly detection.

关键词: 3D anomaly detection, open-set supervised, point cloud, industrial dataset, Open-Industry, Open3D-AD, probability density distributions, unknown defects

158. ❌ Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling

作者: Zhantao Chen, Dongyi He, Jin Fang, Xi Chen, Yisuo Liu, Xiaozhen Zhong, Xuejun Hu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01130v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是基于骨骼生物力学分析和运动建模的个性化飞镖训练数据驱动框架，属于计算机视觉、运动分析和生物力学交叉领域。论文未涉及任何大语言模型、深度学习技术原理或大模型在不同领域的应用创新。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文有一定关联（5分），因为论文属于AI在体育科学领域的应用，但论文未使用大模型或深度学习技术，而是使用传统计算机视觉和数据分析方法。其他26个关键词均与大模型技术相关，与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于骨骼生物力学分析和运动建模的数据驱动框架，用于个性化飞镖训练，通过生成个性化最优投掷轨迹和运动偏差诊断模型，实现了从统一标准偏差到个体最优控制范围偏差的评估转变。

摘要翻译

随着体育训练日益数据化，主要依赖经验和视觉观察的传统飞镖训练方法已难以满足高精度、目标导向型运动的需求。尽管先前研究强调了出手参数、关节运动与协调性在飞镖投掷中的重要性，但多数量化方法仍局限于局部变量、单一出手指标或静态模板匹配。这些方法对个性化训练的支持有限，且常忽略有用的动作变异性。本文提出一种数据驱动的飞镖训练辅助系统。该系统构建了一个涵盖动作捕捉、特征建模与个性化反馈的闭环框架。研究使用Kinect 2.0深度传感器与光学相机在无标记条件下采集飞镖投掷数据，从三环节协调性、出手速度、多关节角度配置和姿势稳定性四个生物力学维度提取了十八项运动学特征。系统开发了两个核心模块：一是融合历史高质量样本与最小加加速度准则的个性化最优投掷轨迹模型，二是基于Z分数与分层逻辑的动作偏差诊断与建议模型。研究共采集了专业与非专业运动员的2,396次投掷样本。结果表明，该系统能生成符合人体自然运动的平滑个性化参考轨迹。案例研究表明，系统可检测躯干稳定性不足、肘部位移异常及速度控制失衡等问题，并提供针对性建议。该框架将飞镖评估从偏离统一标准转向偏离个体最优控制范围，提升了飞镖及其他高精度目标运动训练的个性化程度与可解释性。

摘要 (Abstract)

As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual’s optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.

关键词: personalized darts training, skeleton-based biomechanical analysis, motion modeling, data-driven framework, kinematic features, optimal throwing trajectory, motion deviation diagnosis, sports training

159. ❌ ReinDriveGen: Reinforcement Post-Training for Out-of-Distribution Driving Scene Generation

作者: Hao Zhang, Lue Fan, Weikang Bian, Zehuan Wu, Lewei Lu, Zhaoxiang Zhang, Hongsheng Li 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01129v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶场景生成，核心贡献是提出了一种基于强化学习的后训练策略（RL-based post-training strategy），用于提升视频扩散模型在分布外条件下的生成质量。这与关键词"Post-training OR Supervised Fine-tuning OR SFT"高度相关（10分），因为论文明确提出了"post-training"方法。其他关键词均未在论文标题或摘要中提及，与论文的计算机视觉、自动驾驶和强化学习应用焦点无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了ReinDriveGen框架，通过强化学习后训练策略解决了在分布外条件下编辑和生成逼真驾驶视频的挑战，并在新颖的自我视角合成任务上取得了最先进的结果。

摘要翻译

本文提出ReinDriveGen框架，该框架实现了对动态驾驶场景的完全可控性，允许用户自由编辑交通参与者的运动轨迹，以模拟安全关键性边缘场景，如前车碰撞、车辆漂移、失控旋转、行人乱穿马路及自行车切道等。我们的方法从多帧激光雷达（LiDAR）数据构建动态三维点云场景，引入车辆补全模块以从局部观测重建完整的360°几何结构，并将编辑后的场景渲染为二维条件图像，用以指导视频扩散模型合成逼真的驾驶视频。由于此类编辑场景必然超出训练数据分布范围，我们进一步提出基于强化学习（RL）的后训练策略，结合成对偏好模型与成对奖励机制，从而在无真实标注监督的情况下，实现对分布外场景的鲁棒性质量提升。大量实验表明，ReinDriveGen在编辑驾驶场景上优于现有方法，并在新颖的自主视角合成任务中取得了最先进的性能。

摘要 (Abstract)

We present ReinDriveGen, a framework that enables full controllability over dynamic driving scenes, allowing users to freely edit actor trajectories to simulate safety-critical corner cases such as front-vehicle collisions, drifting cars, vehicles spinning out of control, pedestrians jaywalking, and cyclists cutting across lanes. Our approach constructs a dynamic 3D point cloud scene from multi-frame LiDAR data, introduces a vehicle completion module to reconstruct full 360° geometry from partial observations, and renders the edited scene into 2D condition images that guide a video diffusion model to synthesize realistic driving videos. Since such edited scenarios inevitably fall outside the training distribution, we further propose an RL-based post-training strategy with a pairwise preference model and a pairwise reward mechanism, enabling robust quality improvement under out-of-distribution conditions without ground-truth supervision. Extensive experiments demonstrate that ReinDriveGen outperforms existing approaches on edited driving scenarios and achieves state-of-the-art results on novel ego viewpoint synthesis.

关键词: driving scene generation, reinforcement learning, post-training, out-of-distribution, video diffusion model, point cloud, LiDAR, autonomous driving

160. ❌ ProTPS: Prototype-Guided Text Prompt Selection for Continual Learning

作者: Jie Mei, Li-Leng Peng, Keith Fuller, Jenq-Neng Hwang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01116v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《ProTPS: Prototype-Guided Text Prompt Selection for Continual Learning》专注于持续学习中的文本提示方法，提出了一种原型引导的文本提示选择技术，并在类增量、跨数据集持续学习以及真实海洋物种数据集上进行了评估。论文的核心是计算机视觉和机器学习中的持续学习技术，特别是基于文本提示的方法，用于解决灾难性遗忘问题。所有给定的关键词都直接与大模型（LLMs）、深度学习技术原理或AI在科学领域的应用相关，而本文的研究内容属于传统的计算机视觉和机器学习领域，未涉及大模型技术、深度学习原理创新或AI在科学（如生物信息学）中的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种原型引导的文本提示选择方法（ProTPS），用于解决持续学习中的灾难性遗忘问题，通过在类增量、跨数据集和真实海洋物种数据集上的实验，证明其性能优于现有方法。

摘要翻译

在持续学习中，基于文本提示的方法利用文本编码器和可学习的提示词，对随时间顺序到达的类别进行语义特征编码。现有研究面临的一个共同挑战是如何学习独特的文本提示词——这些提示词隐式地承载新类别的语义信息，从而使新到达类别的语义特征与已训练类别的特征不发生重叠，进而缓解灾难性遗忘问题。为应对这一挑战，我们提出了一种新颖的方法“原型引导的文本提示选择（Prototype-guided Text Prompt Selection, ProTPS）”，旨在有意识地增加训练灵活性，从而促进独特文本提示词的学习。具体而言，我们的ProTPS学习类别特定的视觉原型和文本提示词。视觉原型引导每个类别的文本提示词选择与学习。我们首先在类别增量（Class Incremental, CI）设定和跨数据集持续（Cross-Datasets Continual, CDC）学习设定下评估ProTPS。由于我们的方法达到了接近理论上限的性能，我们进一步收集了一个真实世界数据集Marine112，其中包含在六年时间跨度内采集的112种海洋物种，旨在为该领域带来新的挑战。Marine112真实地适用于类别与域增量（Class and Domain Incremental, CDI）学习设定，且处于自然长尾分布状态。在三种设定下的实验结果表明，我们的ProTPS相较于近期先进方法具有优越性能。本论文的实现代码与Marine112数据集将在论文录用后公开。

摘要 (Abstract)

For continual learning, text-prompt-based methods leverage text encoders and learnable prompts to encode semantic features for sequentially arrived classes over time. A common challenge encountered by existing works is how to learn unique text prompts, which implicitly carry semantic information of new classes, so that the semantic features of newly arrived classes do not overlap with those of trained classes, thereby mitigating the catastrophic forgetting problem. To address this challenge, we propose a novel approach Prototype-guided Text Prompt Selection (ProTPS)’’ to intentionally increase the training flexibility thus encouraging the learning of unique text prompts. Specifically, our ProTPS learns class-specific vision prototypes and text prompts. Vision prototypes guide the selection and learning of text prompts for each class. We first evaluate our ProTPS in both class incremental (CI) setting and cross-datasets continual (CDC) learning setting. Because our ProTPS achieves performance close to the upper bounds, we further collect a real-world dataset with 112 marine species collected over a span of six years, named Marine112, to bring new challenges to the community. Marine112 is authentically suited for the class and domain incremental (CDI) learning setting and is under natural long-tail distribution. The results under three settings show that our ProTPS performs favorably against the recent state-of-the-art methods. The implementation code and Marine112 dataset will be released upon the acceptance of our paper.

关键词: Continual Learning, Text Prompt Selection, Prototype-Guided, Catastrophic Forgetting, Class Incremental Learning, Cross-Datasets Continual Learning, Marine112 Dataset, Vision Prototypes

161. ❌ ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data

作者: Yaoqin Ye, Yiteng Xu, Qin Sun, Xinge Zhu, Yujing Sun, Yuexin Ma 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01082v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ReMoGen专注于计算机视觉和图形学领域的人体运动生成，特别是实时交互反应生成。虽然它使用了深度学习技术（如模块化学习框架、运动先验、元交互模块），但其核心内容与提供的关键词列表完全无关。所有关键词都特定于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文研究的是基于视觉/几何输入的运动生成，没有涉及任何语言模型、文本处理或LLM技术。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了ReMoGen框架，用于解决实时人体交互反应生成中数据碎片化和低延迟的挑战，通过模块化学习和帧级细化实现了高质量、连贯且响应迅速的运动生成。

摘要翻译

现实环境中的人类行为本质上是交互性的，个体的运动由周围智能体与场景共同塑造。这种能力对于虚拟化身、交互式动画和人机协作等应用至关重要。本研究聚焦实时人类交互至反应生成任务，即从动态多源线索（包括他人动作、场景几何结构及可选的高层语义输入）中生成主体未来的运动。该任务面临两大根本性挑战：（一）交互数据有限且碎片化，分散在异构的单人、人-人及人-场景数据域中；（二）需要在持续在线交互过程中生成低延迟且高保真的运动响应。为应对这些挑战，我们提出ReMoGen（反应运动生成框架），这是一个面向实时交互至反应生成的模块化学习框架。ReMoGen利用从大规模单人运动数据集中学习的通用运动先验，并通过独立训练的元交互模块将其适配至目标交互域，从而在数据稀缺和异构监督条件下实现鲁棒的泛化能力。为支持响应式在线交互，ReMoGen采用片段级生成策略，并配备轻量级的帧级片段优化模块，该模块能在帧级别融入新观测到的交互线索，在不依赖昂贵全序列推理的前提下，同步提升系统响应速度与时间连贯性。在人-人、人-场景及多模态交互场景中的大量实验表明，ReMoGen能够生成高质量、连贯且响应迅速的运动反应，并在多样化的交互情境中展现出有效的泛化能力。

摘要 (Abstract)

Human behaviors in real-world environments are inherently interactive, with an individual’s motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego’s future motion from dynamic multi-source cues, including others’ actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.

关键词: human motion generation, interaction-to-reaction, real-time generation, modular learning, motion prior, online interaction, heterogeneous data, frame-wise refinement

162. ❌ ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction

作者: Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Di Wen, Danda Pani Paudel, Luc Van Gool, Kailun Yang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01081v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ProOOD专注于3D语义占据预测和OOD检测，属于计算机视觉和自动驾驶领域，未涉及大语言模型、深度学习技术原理创新或科学AI应用。所有关键词均与大模型、深度学习技术或AI for Science相关，而本文研究的是3D视觉任务，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ProOOD的轻量级方法，通过原型引导的语义插补、尾部挖掘和EchoOOD评分，解决了自动驾驶中3D语义占据预测的长尾类别偏差和OOD输入问题，在多个数据集上实现了最先进的性能。

摘要翻译

三维语义占据预测是自动驾驶领域的核心任务，但现有方法易受长尾类别偏差和分布外（OOD）输入的影响，常将异常点过度自信地归类为稀有类别。本文提出ProOOD，一种轻量级即插即用方法，它将原型引导的优化与免训练的OOD评分机制相结合。ProOOD包含三个部分：（i）原型引导的语义填补，利用类别一致的特征填充被遮挡区域；（ii）原型引导的尾部挖掘，通过增强稀有类别的表征来抑制OOD吸收；（iii）EchoOOD，该方法融合局部逻辑一致性与局部及全局原型匹配，以生成可靠的体素级OOD评分。在五个数据集上的大量实验表明，ProOOD在分布内三维占据预测和OOD检测任务上均达到了最先进的性能。在SemanticKITTI数据集上，其整体平均交并比（mIoU）超越基线方法+3.57%，尾部类别mIoU提升+24.80%；在VAA-KITTI数据集上，其AuPRCr指标提高了19.34个百分点，且在各基准测试中均取得稳定增益。这些改进为安全关键的城市驾驶场景提供了更校准的占据估计和更可靠的OOD检测能力。源代码已公开于https://github.com/7uHeng/ProOOD。

摘要 (Abstract)

3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code is publicly available at https://github.com/7uHeng/ProOOD.

关键词: 3D semantic occupancy prediction, out-of-distribution detection, prototype-guided refinement, autonomous driving, long-tailed class bias, voxel-level OOD scoring, SemanticKITTI, VAA-KITTI

163. ❌ A global dataset of continuous urban dashcam driving

作者: Md Shadab Alam, Olena Bazilinska, Pavlo Bazilinskyy 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01044v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文介绍了一个名为CROWD的全球城市行车记录仪数据集，专注于数据收集、标注和基准测试，用于计算机视觉任务（如目标检测和跟踪）。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用创新，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究创建了一个全球性的城市行车记录仪数据集CROWD，包含超过2万小时的连续驾驶视频片段，用于支持跨域鲁棒性和交互分析，并提供了机器生成的检测和跟踪注释以降低基准测试门槛。

摘要翻译

我们推出CROWD（City Road Observations With Dashcams）数据集，这是一个从公开YouTube视频中人工筛选并分割出的、由普通分钟级、时间连续、未经编辑、前置视角的城市行车记录仪片段构成的手工标注数据集。CROWD旨在通过优先选择日常驾驶场景，并明确排除碰撞事故、事故后果及其他经过编辑或聚焦突发事件的内容，以支持跨领域鲁棒性与交互分析。该版本包含51,753条片段记录，总时长20,275.56小时（对应42,032个视频），覆盖全球六大有人居住大洲（非洲、亚洲、欧洲、北美洲、南美洲和大洋洲）238个国家及地区的7,103个命名居民点，每条片段均含人工标注的时段（日间/夜间）与车辆类型标签。为降低基准测试门槛，我们提供了基于YOLOv11x模型生成的、涵盖全部80个MS-COCO类别的逐片段机器检测CSV文件（例如行人、自行车、摩托车、小汽车、公交车、卡车、交通信号灯、停车标志等），并附有片段级多目标跟踪数据（BoT-SORT）。CROWD以视频标识符配合片段时间边界及衍生注释的形式发布，可在不重新分发原始视频的前提下支持可复现的研究。

摘要 (Abstract)

We introduce CROWD (City Road Observations With Dashcams), a manually curated dataset of ordinary, minute scale, temporally contiguous, unedited, front facing urban dashcam segments screened and segmented from publicly available YouTube videos. CROWD is designed to support cross-domain robustness and interaction analysis by prioritising routine driving and explicitly excluding crashes, crash aftermath, and other edited or incident-focused content. The release contains 51,753 segment records spanning 20,275.56 hours (42,032 videos), covering 7,103 named inhabited places in 238 countries and territories across all six inhabited continents (Africa, Asia, Europe, North America, South America and Oceania), with segment level manual labels for time of day (day or night) and vehicle type. To lower the barrier for benchmarking, we provide per-segment CSV files of machine-generated detections for all 80 MS-COCO classes produced with YOLOv11x, together with segment-local multi-object tracks (BoT-SORT); e.g. person, bicycle, motorcycle, car, bus, truck, traffic light, stop sign, etc. CROWD is distributed as video identifiers with segment boundaries and derived annotations, enabling reproducible research without redistributing the underlying videos.

关键词: dashcam dataset, urban driving, computer vision, object detection, multi-object tracking, cross-domain robustness, benchmarking, YOLOv11x

164. ❌ ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

作者: Fengyuan Yang, Luying Huang, Jiazhi Guan, Quanwei Yang, Dongwei Pan, Jianglin Fu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Angela Yao 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01043v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频基础模型（VFMs）中的人与环境合成，属于计算机视觉和视频生成领域。与评分关键词列表中的大语言模型（LLMs）及相关技术（如MoE、Scaling Laws、RLHF、RAG等）无直接关联。唯一相关的是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’，因为论文提到’parameter-efficient framework’，但未明确使用LoRA等具体PEFT技术，因此给5分（有一定关联）。其他关键词均不涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了ONE-SHOT框架，通过空间解耦的运动注入和混合上下文集成，解决了视频基础模型中细粒度、独立编辑主体和场景的挑战，实现了更好的结构控制和生成多样性。

摘要翻译

视频基础模型（VFMs）的最新进展彻底改变了以人为中心的视频合成领域，然而对主体与场景进行细粒度且独立的编辑仍是一个关键挑战。近期尝试通过刚性的三维几何组合来融入更丰富的环境控制，往往在精确控制与生成灵活性之间面临显著权衡。此外，繁重的三维预处理仍限制了实际应用的可扩展性。本文提出ONE-SHOT，一个参数高效的组合式人-环境视频生成框架。我们的核心见解是将生成过程分解为解耦的信号。具体而言，我们引入了一种规范空间注入机制，通过交叉注意力将人体动态与环境线索分离。我们还提出了动态接地旋转位置编码（Dynamic-Grounded-RoPE），这是一种新颖的位置嵌入策略，可在无需任何启发式三维对齐的情况下，在不同空间域之间建立空间对应关系。为支持长时序合成，我们引入了混合上下文集成机制，以在分钟级生成过程中保持主体与场景的一致性。实验表明，我们的方法显著优于现有先进技术，为视频合成提供了卓越的结构控制与创意多样性。项目已发布于：https://martayang.github.io/ONE-SHOT/。

摘要 (Abstract)

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.

关键词: Video Foundation Models, human-environment video synthesis, spatial-decoupled motion injection, canonical-space injection, Dynamic-Grounded-RoPE, Hybrid Context Integration, parameter-efficient framework, compositional video generation

165. ❌ Foundation Model-guided Iteratively Prompting and Pseudo-Labeling for Partially Labeled Medical Image Segmentation

作者: Qiaochu Zhao, Wei Wei, David Horowitz, Richard Bakst, Yading Yuan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01038v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究医学图像分割中的部分标注问题，提出了一种结合基础模型和可训练分割网络的迭代框架。论文与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为明确使用了冻结的基础模型（foundation model）作为通用专家。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为专注于医学图像分割，属于生物信息学/科学AI应用。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术、推理优化、代理系统等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对医学图像分割中因部分标注导致性能下降的问题，提出了一个迭代提示和伪标注框架（IPnP），通过可训练分割网络与冻结基础模型的协作，在公开和私有数据集上显著提升了分割性能并接近全标注参考水平。

摘要翻译

自动化医学图像分割在拥有全标注数据的情况下已取得显著进展。然而，特定临床站点的优先需求与高昂的人工标注成本常导致扫描图像仅标注了部分器官，从而产生部分标注问题并降低模型性能。为解决此问题，我们提出IPnP（迭代提示与伪标注框架），用于部分标注的医学图像分割。IPnP通过可训练的分割网络（专家模型）与冻结的基础模型（通用模型）协同合作，迭代生成并优化未标注器官的伪标签，逐步恢复全器官监督。在采用模拟部分标注设置的公开数据集AMOS上，IPnP相较于现有方法持续提升分割性能，并接近全标注基准模型的表现。我们进一步在包含210例头颈癌患者的私有部分标注数据集上进行评估，验证了本方法在真实临床场景中的有效性。

摘要 (Abstract)

Automated medical image segmentation has achieved remarkable progress with fully labeled data. However, site-specific clinical priorities and the high cost of manual annotation often yield scans with only a subset of organs labeled, leading to the partially labeled problem that degrades performance. To address this issue, we propose IPnP, an Iteratively Prompting and Pseudo-labeling framework, for partially labeled medical image segmentation. IPnP iteratively generates and refines pseudo-labels for unlabeled organs through collaboration between a trainable segmentation network (specialist) and a frozen foundation model (generalist), progressively recovering full-organ supervision. On the public dataset AMOS with the simulated partial-label setting, IPnP consistently improves segmentation performance over prior methods and approaches the performance of the fully labeled reference. We further evaluate on a private, partially labeled dataset of 210 head-and-neck cancer patients and demonstrate our effectiveness in real-world clinical settings.

关键词: medical image segmentation, partially labeled data, foundation model, pseudo-labeling, iterative prompting, AMOS dataset, head-and-neck cancer, clinical application

166. ❌ Sub-metre Lunar DEM Generation and Validation from Chandrayaan-2 OHRC Multi-View Imagery Using Open-Source Photogrammetry

作者: Aaranay Aadi, Jai Singla, Nitant Dube, Oleg Alexandrov 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01032v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用开源摄影测量技术从Chandrayaan-2 OHRC多视角图像生成亚米级月球数字高程模型（DEM），属于行星科学和遥感领域。论文内容涉及图像处理、几何分析、点云生成和DEM验证，完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术及其应用相关，而本论文是纯粹的遥感数据处理研究，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该研究首次使用完全开源的光束法平差流程，从Chandrayaan-2轨道器高分辨率相机的非配对多视角图像中生成亚米级月球数字高程模型，并通过与参考地形数据对比验证了其垂直精度为5.85米、水平精度小于30厘米。

摘要翻译

月球表面的高分辨率数字高程模型（Digital Elevion Models, DEMs）对于表面移动规划、着陆点特征描述以及行星科学研究至关重要。月船二号（Chandrayaan-2）搭载的轨道器高分辨率相机（Orbiter High Resolution Camera, OHRC）通过获取分辨率约为每像素20-30厘米的全色影像，具备当前所有在役月球轨道成像设备中最优的地面采样能力。本研究首次提出了一种完全基于开源流程的方法，利用OHRC多视角影像生成亚米级DEM。通过对影像元数据进行几何分析，运用基线高度比（B/H ratio）计算和交会角估计，从非成对的OHRC存档数据中识别出候选立体像对。随后应用密集立体匹配与光线三角测量法生成点云，并在五个地理分布不同的月球区域将其格网化为有效空间分辨率介于约24至54厘米之间的DEM。通过以月球勘测轨道飞行器窄角相机（Lunar Reconnaissance Orbiter Narrow Angle Camera, NAC）数字地形模型为基准进行迭代最近点（Iterative Closest Point, ICP）配准，并实施恒定偏差偏移校正，确立了绝对高程一致性。以NAC参考地形为标准的验证结果显示，垂直方向均方根误差为5.85米（基于OHRC原始分辨率），通过平面特征匹配评估的水平精度优于30厘米。

摘要 (Abstract)

High-resolution digital elevation models (DEMs) of the lunar surface are essential for surface mobility planning, landing site characterization, and planetary science. The Orbiter High Resolution Camera (OHRC) on board Chandrayaan-2 has the best ground sampling capabilities of any lunar orbital imaging currently in use by acquiring panchromatic imagery at a resolution of roughly 20-30 cm per pixel. This work presents, for the first time, the generation of sub-metre DEMs from OHRC multi-view imagery using an exclusively open-source pipeline. Candidate stereo pairs are identified from non-paired OHRC archives through geometric analysis of image metadata, employing baseline-to-height (B/H) ratio computation and convergence angle estimation. Dense stereo correspondence and ray triangulation are then applied to generate point clouds, which are gridded into DEMs at effective spatial resolutions between approximately 24 and 54 cm across five geographically distributed lunar sites. Absolute elevation consistency is established through Iterative Closest Point (ICP) alignment against Lunar Reconnaissance Orbiter Narrow Angle Camera (NAC) Digital Terrain Models, followed by constant-bias offset correction. Validation against NAC reference terrain yields a vertical RMSE of 5.85 m (at native OHRC resolution), and a horizontal accuracy of less than 30 cm assessed by planimetric feature matching.

关键词: lunar DEM, Chandrayaan-2 OHRC, open-source photogrammetry, multi-view imagery, sub-metre resolution, stereo correspondence, point cloud generation, terrain validation

167. ❌ Diff3R: Feed-forward 3D Gaussian Splatting with Uncertainty-aware Differentiable Optimization

作者: Yueh-Cheng Liu, Jozef Hladký, Matthias Nießner, Angela Dai 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D高斯泼溅（3DGS）的计算机视觉和图形学领域，提出了一种结合前馈预测和测试时优化的新框架Diff3R。论文的核心贡献在于通过可微优化层、隐函数定理和不确定性模型来改进3D重建，所有关键词均涉及大模型、深度学习技术原理或科学AI应用，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了Diff3R框架，通过将可微的3D高斯泼溅优化层集成到训练中，使网络能够预测测试时优化的最优初始化，从而在稀疏视图设置中结合前馈模型的快速推理和每场景优化的高质量渲染优势。

摘要翻译

三维高斯泼溅（3DGS）领域的最新进展呈现出两个主要方向：前馈模型能够在稀疏视角设置下实现快速推理，而逐场景优化方法虽能生成高质量渲染结果，但计算成本高昂。为融合两者的优势，我们提出了Diff3R这一创新框架，它显式地构建了前馈预测与测试时优化之间的桥梁。通过将可微分3DGS优化层直接嵌入训练循环，我们的网络学习预测测试时优化的最佳初始化状态，而非传统的零次射击结果。为克服反向传播过程中优化步骤带来的计算负担，我们提出基于隐函数定理计算梯度，并采用专为3DGS优化设计的可扩展无矩阵预处理共轭梯度（PCG）求解器。此外，我们通过自适应控制优化过程中参数允许的调整幅度，将数据驱动的不确定性模型融入优化流程。该方法有效缓解了欠约束区域的过拟合问题，并增强了对输入异常值的鲁棒性。由于我们提出的优化层与模型无关，实验证明其可无缝集成到现有前馈式3DGS架构中，适用于姿态给定与姿态自由两类方法，显著提升了测试时优化的性能。

摘要 (Abstract)

Recent advances in 3D Gaussian Splatting (3DGS) present two main directions: feed-forward models offer fast inference in sparse-view settings, while per-scene optimization yields high-quality renderings but is computationally expensive. To combine the benefits of both, we introduce Diff3R, a novel framework that explicitly bridges feed-forward prediction and test-time optimization. By incorporating a differentiable 3DGS optimization layer directly into the training loop, our network learns to predict an optimal initialization for test-time optimization rather than a conventional zero-shot result. To overcome the computational cost of backpropagating through the optimization steps, we propose computing gradients via the Implicit Function Theorem and a scalable, matrix-free PCG solver tailored for 3DGS optimization. Additionally, we incorporate a data-driven uncertainty model into the optimization process by adaptively controlling how much the parameters are allowed to change during optimization. This approach effectively mitigates overfitting in under-constrained regions and increases robustness against input outliers. Since our proposed optimization layer is model-agnostic, we show that it can be seamlessly integrated into existing feed-forward 3DGS architectures for both pose-given and pose-free methods, providing improvements for test-time optimization.

关键词: 3D Gaussian Splatting, feed-forward models, differentiable optimization, uncertainty-aware, test-time optimization, Implicit Function Theorem, matrix-free PCG solver, pose-given and pose-free methods

168. ❌ Forecasting Motion in the Wild

作者: Neerja Thakkar, Shiry Ginosar, Jacob Walker, Jitendra Malik, Joao Carreira, Carl Doersch 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Forecasting Motion in the Wild》专注于计算机视觉领域，特别是动物行为的运动预测，使用密集点轨迹作为视觉标记和扩散变换器模型。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文的核心内容（视觉表示、轨迹预测、扩散模型）与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出使用密集点轨迹作为视觉标记来预测野外动物的复杂运动模式，通过扩散变换器模型在300小时动物视频数据上实现了类别无关、数据高效的预测，并超越了现有基线方法。

摘要翻译

视觉智能需要预测智能体的未来行为，然而视觉系统缺乏对运动与行为的通用表征。我们提出将密集点轨迹作为行为视觉标记，这是一种解耦运动与外观、并能泛化至各类非刚性智能体（如野外动物）的结构化中层表征。基于此抽象，我们设计了一个扩散变换器模型，该模型对无序轨迹集合进行建模，并显式推理遮挡关系，从而实现对复杂运动模式的连贯预测。为进行大规模评估，我们构建了300小时的无约束动物视频数据集，包含鲁棒的镜头检测与相机运动补偿。实验表明，基于轨迹标记的预测方法实现了类别无关且数据高效的预测，其性能优于当前最先进的基线模型，并能泛化至稀有物种与特殊形态，为野外环境中的预测性视觉智能奠定了基础。

摘要 (Abstract)

Visual intelligence requires anticipating the future behavior of agents, yet vision systems lack a general representation for motion and behavior. We propose dense point trajectories as visual tokens for behavior, a structured mid-level representation that disentangles motion from appearance and generalizes across diverse non-rigid agents, such as animals in-the-wild. Building on this abstraction, we design a diffusion transformer that models unordered sets of trajectories and explicitly reasons about occlusion, enabling coherent forecasts of complex motion patterns. To evaluate at scale, we curate 300 hours of unconstrained animal video with robust shot detection and camera-motion compensation. Experiments show that forecasting trajectory tokens achieves category-agnostic, data-efficient prediction, outperforms state-of-the-art baselines, and generalizes to rare species and morphologies, providing a foundation for predictive visual intelligence in the wild.

关键词: motion forecasting, dense point trajectories, diffusion transformer, animal behavior, visual intelligence, occlusion reasoning, category-agnostic prediction, wild video analysis

169. ❌ AutoMIA: Improved Baselines for Membership Inference Attack via Agentic Self-Exploration

作者: Ruhao Liu, Weiqi Huang, Qi Li, Xinchao Wang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AutoMIA框架，将成员推理攻击重新定义为自主探索和策略演化的自动化过程，核心是agentic框架（与LLM Agents高度相关），涉及自我探索和策略演化（与Self-Correction有一定关联）。论文主要针对大型模型（LLMs）的成员推理攻击，因此与LLMs相关。其他关键词如MoE、SLMs、训练方法、推理加速、科学AI等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出AutoMIA框架，通过自主探索和策略演化的agentic方法改进大型模型的成员推理攻击，在消除手动特征工程需求的同时达到或超越现有最佳基线性能。

摘要翻译

成员推理攻击（Membership Inference Attacks, MIAs）作为一种基础性审计工具，用于评估机器学习模型中训练数据的泄露风险。然而，现有方法主要依赖于静态、人工设计的启发式规则，其缺乏适应性，在跨不同大模型迁移时往往导致性能欠佳。本研究提出AutoMIA，一种智能体框架，将成员推理重新构建为一个自我探索与策略演化的自动化过程。给定高层场景描述，AutoMIA通过生成可执行的逻辑层面策略来自主探索攻击空间，并借助闭环评估反馈逐步优化这些策略。通过将抽象策略推理与底层执行解耦，我们的框架实现了对攻击搜索空间的系统性、模型无关的遍历。大量实验表明，AutoMIA在无需人工特征工程的情况下，始终达到或超越当前最优基线方法的性能。

摘要 (Abstract)

Membership Inference Attacks (MIAs) serve as a fundamental auditing tool for evaluating training data leakage in machine learning models. However, existing methodologies predominantly rely on static, handcrafted heuristics that lack adaptability, often leading to suboptimal performance when transferred across different large models. In this work, we propose AutoMIA, an agentic framework that reformulates membership inference as an automated process of self-exploration and strategy evolution. Given high-level scenario specifications, AutoMIA self-explores the attack space by generating executable logits-level strategies and progressively refining them through closed-loop evaluation feedback. By decoupling abstract strategy reasoning from low-level execution, our framework enables a systematic, model-agnostic traversal of the attack search space. Extensive experiments demonstrate that AutoMIA consistently matches or outperforms state-of-the-art baselines while eliminating the need for manual feature engineering.

关键词: Membership Inference Attack, AutoMIA, agentic framework, self-exploration, strategy evolution, large models, model-agnostic, training data leakage

170. ❌ PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

作者: Jingning Xu, Haochen Luo, Chen Liu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01010v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLM）的对抗性防御框架PDA，专注于图像对抗攻击的鲁棒性增强。所有评分关键词均针对纯文本大语言模型（LLM）的技术原理、训练方法、推理优化、应用范式等，而本文核心是视觉-语言多模态模型（VLM）的防御方法，未涉及LLM的模型架构、训练技术、推理加速、对齐方法、智能体等具体技术点。虽然VLM通常包含语言模型组件，但本文未探讨LLM本身的技术创新，也未涉及科学领域的AI应用。因此，所有关键词与论文内容无直接关联，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型易受对抗性图像攻击的问题，提出了一种无需训练的文本增强防御框架PDA，通过提示改写、问题分解和一致性聚合在推理时提升模型鲁棒性，并在多个基准测试中实现了稳定的防御效果且保持较高的干净准确率。

摘要翻译

视觉语言模型（VLMs）易受对抗性图像扰动的影响。现有基于针对特定任务对抗样本的对抗训练方法计算成本高昂，且往往难以泛化至未见过的攻击类型。为应对这些局限，我们提出释义-分解-聚合（Paraphrase-Decomposition-Aggregation, PDA）框架，这是一种无需训练、利用文本增强来提升视觉语言模型在多种对抗性图像攻击下鲁棒性的防御方法。PDA在测试阶段完全通过提示词释义、问题分解和一致性聚合进行操作，无需对底层模型进行任何修改。为平衡鲁棒性与效率，我们将PDA实例化为若干不变组件，在保留大部分鲁棒性增益的同时降低推理成本。在多种视觉语言模型架构以及视觉问答、分类和描述任务的基准测试上的实验表明，PDA能针对各类对抗性扰动取得一致的鲁棒性提升，同时保持具有竞争力的干净样本准确率，从而为视觉语言模型的推理阶段建立了一个通用、强大且实用的防御框架。

摘要 (Abstract)

Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.

关键词: Vision-Language Models, Adversarial Image Attacks, Robustness Defense, Training-free Framework, Text Augmentation, Prompt Paraphrasing, Question Decomposition, Consistency Aggregation

171. ❌ Customizing Large Vision Model-Guided Low-Rank Approximation for Ground-Roll Denoise

作者: Jiacheng Liao, Feng Qian, Ziyin Fan, Yongjian Guo 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00998v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种基于大型视觉模型（Large Vision Model）引导的地震数据去噪方法，属于AI在科学领域的应用（地震数据处理）。论文核心是使用可提示的大型视觉模型提取语义先验，并嵌入到低秩逆问题求解中，实现无训练的去噪。这与关键词列表中的’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（评分8分），因为地震数据处理是地球科学中的一个重要应用领域，属于AI for Science的范畴。其他关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化、代理系统等，而本文使用的是视觉模型（非语言模型），且未涉及MoE、缩放定律、微调、对齐、RAG、注意力优化、思维链、代理、量化等具体技术，因此其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于大型视觉模型引导的低秩近似方法，用于解决地震数据中地面滚动噪声的去除问题，实现了无需训练、自适应且能保持反射信号连续性的去噪效果。

摘要翻译

地滚波是陆地地震勘探与垂直地震剖面（VSP）数据中主要的相干噪声源，严重掩盖反射波同相轴并影响后续成像与解释效果。传统的衰减方法，包括变换域滤波、稀疏表示和深度学习，常受限于适应性不足、信号泄漏或对标注训练数据的依赖，尤其在强信号-噪声重叠区域更为突出。为解决这些挑战，本文提出一种无需训练的框架，将地滚波衰减重新定义为语义引导的信号分离问题。具体而言，该框架采用可提示的大视觉模型，通过将地震道集转换为视觉表征，并借助文本或图像提示定位地滚波主导区域，从而提取高层语义先验。生成的语义响应被转化为连续的软掩膜，并将其嵌入到一个掩膜约束的低秩反演公式中，以实现空间自适应的噪声压制与反射波保持的重建。为进一步求解所提出的反演问题，本文开发了一种基于交替方向乘子法（ADMM）的高效求解器，能够在无需任务特定训练或人工标注的情况下，实现稳定且物理一致的信号恢复。在合成与实测VSP数据集上的大量实验表明，该方法在保持反射连续性与波形保真度的同时，实现了优越的地滚波衰减效果，其性能持续优于代表性的变换域滤波与隐式神经表示方法。

摘要 (Abstract)

Ground-roll is a dominant source of coherent noise in land and vertical seismic profiling (VSP) data, severely masking reflection events and degrading subsequent imaging and interpretation. Conventional attenuation methods, including transform-domain filtering, sparse representation, and deep learning, often suffer from limited adaptability, signal leakage, or dependence on labeled training data, especially under strong signal-noise overlap. To address these challenges, we propose a training-free framework that reformulates ground-roll attenuation as a semantic-guided signal separation problem. Specifically, a promptable large vision model is employed to extract high-level semantic priors by converting seismic gathers into visual representations and localizing ground-roll-dominant regions via text or image prompts. The resulting semantic response is transformed into a continuous soft mask, which is embedded into a mask-conditioned low-rank inverse formulation to enable spatially adaptive suppression and reflection-preserving reconstruction. An efficient alternating direction method of multipliers (ADMM)-based solver is further developed to solve the proposed inverse problem, enabling stable and physically consistent signal recovery without requiring task-specific training or manual annotation. Extensive experiments on both synthetic and field VSP datasets demonstrate that the proposed method achieves superior ground-roll attenuation while preserving reflection continuity and waveform fidelity, consistently outperforming representative transform-domain filtering and implicit neural representation methods.

关键词: ground-roll attenuation, large vision model, semantic-guided separation, low-rank approximation, training-free framework, seismic data processing, ADMM solver, VSP datasets

172. ❌ Maximizing T2-Only Prostate Cancer Localization from Expected Diffusion Weighted Imaging

作者: Weixi Yi, Yipei Wang, Wen Yan, Hanyuan Zhang, Natasha Thorley, Alexander Ng, Shonit Punwani, Fernando Bianco, Mark Emberton, Veeru Kasivisvanathan, Dean C. Barratt, Shaheer U. Saeed, Yipeng Hu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00985v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学影像分析（前列腺癌定位），使用机器学习方法（生成模型、期望最大化算法）处理MRI数据。所有关键词均与大模型/深度学习技术原理或其在科学领域的应用相关，但论文未涉及任何大模型（LLM/Foundation Models）、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、工具使用、多代理、量化、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等具体技术。唯一相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在生物医学（医学影像分析）领域的应用，但并非核心创新于大模型技术，因此给5分（有一定关联）。其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种仅使用T2加权MRI图像进行前列腺癌定位的新方法，通过期望最大化算法结合生成模型来利用训练中的DWI图像作为潜在模态，在内部和外部数据集上实现了优于多序列基线方法的性能。

摘要翻译

多参数磁共振成像日益被推荐作为检测和定位前列腺癌的一线无创方法，其至少需要扩散加权成像和T2加权成像序列。早期仅使用T2加权图像的机器学习尝试在分割放射科医生标注的病灶方面已展现出有前景的诊断性能。这种单一模态的仅T2方法通过降低获取其他序列所需的成本与专业要求，带来了显著的临床效益。本研究探讨了一个更具挑战性的应用：在推断阶段仅使用T2加权图像，但依据独立的组织病理学标签来定位个体癌症。我们将扩散加权图像构建为一种潜在模态（在训练阶段易于获取），在仅以T2加权图像作为输入的情况下，对局部Barzell分区的癌症存在性进行分类。在由此产生的期望最大化算法中，一个潜在模态生成器（通过基于流匹配的生成模型实现）在E步中近似潜在扩散加权图像的后验分布，而在M步中，癌症定位器与生成模型同步优化，以最大化癌症存在的期望似然。所提出的方法为从特权扩散加权模态中学习提供了一个新颖的理论框架，与缺乏训练扩散加权图像的方法或现有的特权学习及不完整模态框架相比，实现了更优的癌症定位性能。所提出的仅T2方法在使用多输入序列的基线方法对比中表现出相当或更优的性能（例如，在患者层面F1分数提升14.4%，在分区层面QWK提升5.3%，优于T2加权+扩散加权成像基线）。我们利用来自4,133名具有组织病理学验证标签的前列腺癌患者的内部和外部数据集进行了定量评估。

摘要 (Abstract)

Multiparametric MRI is increasingly recommended as a first-line noninvasive approach to detect and localize prostate cancer, requiring at minimum diffusion-weighted (DWI) and T2-weighted (T2w) MR sequences. Early machine learning attempts using only T2w images have shown promising diagnostic performance in segmenting radiologist-annotated lesions. Such uni-modal T2-only approaches deliver substantial clinical benefits by reducing costs and expertise required to acquire other sequences. This work investigates an arguably more challenging application using only T2w at inference, but to localize individual cancers based on independent histopathology labels. We formulate DWI images as a latent modality (readily available during training) to classify cancer presence at local Barzell zones, given only T2w images as input. In the resulting expectation-maximization algorithm, a latent modality generator (implemented using a flow matching-based generative model) approximates the latent DWI image posterior distribution in the E-steps, while in M-steps a cancer localizer is simultaneously optimized with the generative model to maximize the expected likelihood of cancer presence. The proposed approach provides a novel theoretical framework for learning from a privileged DWI modality, yielding superior cancer localization performance compared to approaches that lack training DWI images or existing frameworks for privileged learning and incomplete modalities. The proposed T2-only methods perform competitively or better than baseline methods using multiple input sequences (e.g., improving the patient-level F1 score by 14.4% and zone-level QWK by 5.3% over the T2w+DWI baseline). We present quantitative evaluations using internal and external datasets from 4,133 prostate cancer patients with histopathology-verified labels.

关键词: prostate cancer localization, T2-weighted MRI, diffusion-weighted imaging, expectation-maximization algorithm, generative model, privileged modality learning, medical image analysis, histopathology labels

173. ❌ ACT Now: Preempting LVLM Hallucinations via Adaptive Context Integration

作者: Bei Yan, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00983v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于解决大型视觉语言模型（LVLMs）的幻觉问题，提出了一种无需训练的推理干预方法ACT。与关键词的相关性分析如下：1）与’Large Language Models OR LLMs OR Foundation Models’（8分）相关，因为LVLMs是LLMs的视觉扩展；2）与’Hallucination Mitigation OR Factuality OR Truthfulness’（10分）高度相关，这是论文的核心研究问题；3）与’Self-Correction OR Self-Improvement OR Self-Reflection’（5分）有一定关联，因为ACT通过自适应上下文整合来纠正幻觉；4）与’Mechanistic Interpretability OR Explainable AI’（5分）有一定关联，因为论文涉及注意力机制分析和视觉-语言对齐；其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型（LVLMs）的严重幻觉问题，提出了一种无需训练的推理干预方法ACT，通过自适应上下文整合显著减少了幻觉，并在多个基准测试中取得了有竞争力的结果。

摘要翻译

大型视觉语言模型（LVLMs）普遍存在严重的幻觉问题。现有的缓解策略主要依赖于孤立、单步的状态来增强视觉聚焦或抑制过强的语言先验。然而，这些静态方法忽视了生成过程中动态的上下文变化，难以纠正已形成的信息损失。为应对这一局限，我们提出了自适应上下文集成方法（Adaptive Context inTegration, ACT），这是一种无需训练、通过自适应集成上下文信息来缓解幻觉的推理干预方法。具体而言，我们首先提出视觉上下文探索，该方法利用时空分析来自适应地放大负责视觉探索的注意力头。为进一步促进视觉-语言对齐，我们提出了语义上下文聚合，该方法通过边缘化潜在的语义查询来有效聚合视觉证据，从而解决由词元预测的离散性所导致的信息损失问题。在多种LVLMs上进行的大量实验表明，ACT能显著减少幻觉，并在判别性和生成性基准测试中取得具有竞争力的结果，成为一种在不损害基础生成能力前提下的鲁棒且高度自适应的解决方案。

摘要 (Abstract)

Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this limitation, we propose Adaptive Context inTegration (ACT), a training-free inference intervention method that mitigates hallucination through the adaptive integration of contextual information. Specifically, we first propose visual context exploration, which leverages spatio-temporal profiling to adaptively amplify attention heads responsible for visual exploration. To further facilitate vision-language alignment, we propose semantic context aggregation that marginalizes potential semantic queries to effectively aggregate visual evidence, thereby resolving the information loss caused by the discrete nature of token prediction. Extensive experiments across diverse LVLMs demonstrate that ACT significantly reduces hallucinations and achieves competitive results on both discriminative and generative benchmarks, acting as a robust and highly adaptable solution without compromising fundamental generation capabilities.

关键词: Large Vision-Language Models, Hallucination Mitigation, Adaptive Context Integration, Training-free Inference, Visual Context Exploration, Semantic Context Aggregation, Vision-Language Alignment, Attention Heads

174. ❌ DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving

作者: Yiyao Zhu, Ying Xue, Haiming Zhang, Guangfeng Jiang, Wending Zhou, Xu Yan, Jiantao Gao, Yingjie Cai, Bingbing Liu, Zhen Li, Shaojie Shen 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于自动驾驶领域，提出了一种名为DLWM的双潜在世界模型范式，用于实现以高斯为中心的预训练。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等），而本文研究的是计算机视觉和自动驾驶中的世界模型。然而，论文与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为它明确提出了一个两阶段的预训练方法。论文与’World Models AND General World Models’高度相关（10分），因为其核心创新就是’Dual Latent World Models’。论文与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为自动驾驶可以被视为AI在工程/机器人学领域的一个应用，属于广义的’AI for Science’范畴，但并非其核心的生物信息学或化学信息学子领域。

!!! tip deepseek-chat TL;DR

该论文提出了DLWM，一种用于自动驾驶的双潜在世界模型范式，通过两阶段预训练实现了以3D高斯为中心的全面表示，并在3D占用感知、4D占用预测和运动规划任务上取得了显著的性能提升。

摘要翻译

基于视觉的自动驾驶技术因其低成本与卓越性能备受关注。相较于稠密鸟瞰图或稀疏查询模型，以高斯为中心的方法通过三维语义高斯描述场景，形成了一种全面而稀疏的表征。本文提出DLWM，一种专为自动驾驶设计的双潜在世界模型新范式，通过两阶段实现以高斯为中心的整体预训练。在第一阶段，DLWM通过自监督重建多视角语义与深度图像，从查询中预测三维高斯。在获得细粒度上下文特征后，第二阶段分别训练两个潜在世界模型进行时序特征学习：包括为下游占据感知与预测任务设计的高斯流引导潜在预测，以及为运动规划设计的自车规划引导潜在预测。在SurroundOcc和nuScenes基准测试中的大量实验表明，DLWM在以高斯为中心的三维占据感知、四维占据预测及运动规划任务中均取得显著性能提升。

摘要 (Abstract)

Vision-based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird’s Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.

关键词: Autonomous Driving, World Models, Pre-training, 3D Gaussians, Occupancy Perception, Motion Planning, Latent Prediction, Self-supervised Learning

175. ❌ Enhancing Gradient Inversion Attacks in Federated Learning via Hierarchical Feature Optimization

作者: Hao Fang, Wenbo Yu, Bin Chen, Xuan Wang, Shu-Tao Xia, Qing Liao, Ke Xu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00955v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究联邦学习中的梯度反演攻击，使用GAN作为先验知识进行数据重构，属于隐私安全领域。所有关键词均与大模型、深度学习技术原理或科学应用相关，而本文专注于联邦学习的隐私攻击方法，未涉及大模型技术、训练方法、推理优化、对齐、代理系统或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GIFD的梯度反演攻击方法，通过分层特征优化在联邦学习中重构敏感数据，解决了现有方法在表达能力和泛化性上的限制，并在多种场景下实现了像素级重构效果。

摘要翻译

联邦学习（Federated Learning, FL）已成为一种极具吸引力的隐私保护分布式机器学习范式，允许多个客户端通过向中央服务器传输本地计算的梯度来协作训练全局模型，而无需暴露其私有数据。然而，近期研究发现，联邦学习系统中交换的梯度同样存在隐私泄露风险，例如攻击者可以利用预训练的生成对抗网络（Generative Adversarial Networks, GAN）作为先验知识，反转共享梯度以重建敏感数据。然而，现有攻击方法仅在GAN模型的潜在空间中进行梯度反转，这限制了其表达能力和泛化性。为应对这些挑战，我们提出了一种基于特征域的梯度反转方法（Gradient Inversion over Feature Domains, GIFD），该方法解构GAN模型并在其中间层中搜索分层特征。我们并非仅对初始潜在编码进行优化，而是逐步改变优化层，从初始潜在空间逐渐过渡到更接近输出图像的中间层。此外，我们设计了一种正则化器，通过在搜索范围中添加一个小的${l_1}$球约束来避免生成不真实的图像。我们还将GIFD扩展到分布外（Out-of-Distribution, OOD）设置，从而弱化了GAN训练集与联邦学习任务数据服从相同分布这一假设。进一步，我们考虑了标签不一致这一更具挑战性的OOD场景，并提出了一种标签映射技术作为有效解决方案。大量实验表明，我们的方法能够实现像素级重建，并在多种联邦学习场景中优于现有基线方法。

摘要 (Abstract)

Federated Learning (FL) has emerged as a compelling paradigm for privacy-preserving distributed machine learning, allowing multiple clients to collaboratively train a global model by transmitting locally computed gradients to a central server without exposing their private data. Nonetheless, recent studies find that the gradients exchanged in the FL system are also vulnerable to privacy leakage, e.g., an attacker can invert shared gradients to reconstruct sensitive data by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. However, existing attacks simply perform gradient inversion in the latent space of the GAN model, which limits their expression ability and generalizability. To tackle these challenges, we propose \textbf{G}radient \textbf{I}nversion over \textbf{F}eature \textbf{D}omains (GIFD), which disassembles the GAN model and searches the hierarchical features of the intermediate layers. Instead of optimizing only over the initial latent code, we progressively change the optimized layer, from the initial latent space to intermediate layers closer to the output images. In addition, we design a regularizer to avoid unreal image generation by adding a small ${l_1}$ ball constraint to the searching range. We also extend GIFD to the out-of-distribution (OOD) setting, which weakens the assumption that the training sets of GANs and FL tasks obey the same data distribution. Furthermore, we consider the challenging OOD scenario of label inconsistency and propose a label mapping technique as an effective solution. Extensive experiments demonstrate that our method can achieve pixel-level reconstruction and outperform competitive baselines across a variety of FL scenarios.

关键词: Federated Learning, Gradient Inversion Attack, Privacy Leakage, Generative Adversarial Networks, Hierarchical Feature Optimization, Out-of-Distribution, Data Reconstruction, Privacy-Preserving

176. ❌ YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction

作者: Miro Miranda, Deepak Pathak, Patrick Helber, Benjamin Bischke, Hiba Najjar, Francisco Mena, Cristhian Sanchez, Akshay Pai, Diego Arenas, Matias Valdenegro-Toro, Marcela Charfuelan, Marlon Nuske, Andreas Dengel 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于农业领域的作物产量预测，提出了一个多模态数据集YieldSAT，并比较了各种深度学习模型和数据融合架构。论文的核心是计算机视觉和遥感技术在农业科学中的应用，属于AI for Science的范畴。然而，论文并未涉及大语言模型（LLM）、模型架构创新（如MoE、量化）、训练方法（如预训练、微调、对齐）、推理技术（如RAG、CoT）、代理系统或任何其他与大模型技术直接相关的主题。因此，除了’AI for Science OR Bioinformatics OR Cheminformatics’关键词因涉及科学应用而获得5分（有一定关联）外，其他所有关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为YieldSAT的大规模、高质量、多模态数据集，用于高分辨率作物产量预测，并通过比较深度学习模型和数据融合架构，展示了其作为像素回归任务的潜力，同时探索了领域感知的深度集成方法来缓解真实世界数据中的分布偏移问题。

摘要翻译

作物产量预测需要大量数据来训练可扩展模型。然而，产量预测数据集的构建受到数据获取成本高昂、数据质量参差不齐以及数据隐私法规的限制。因此，现有数据集往往稀缺、质量较低，或仅限于区域层面或单一作物类型，这阻碍了可扩展的数据驱动解决方案的发展。本研究发布了YieldSAT——一个用于高分辨率作物产量预测的大规模、高质量多模态数据集。YieldSAT覆盖了阿根廷、巴西、乌拉圭和德国等多个国家的不同气候区，包含玉米、油菜、大豆和小麦等主要作物类型，涉及2,173个经专家筛选的田块。该数据集总计提供超过1,220万个产量样本，每个样本的空间分辨率为10米。每个田块均配有多光谱卫星影像，共包含113,555张带标注的卫星图像，并辅以辅助环境数据。通过比较多种深度学习模型与数据融合架构，我们证明了将大规模高分辨率作物产量预测作为像素回归任务的潜力。此外，我们强调了真实世界条件下地面实况数据中存在的严重分布偏移所带来的开放挑战。为缓解此问题，我们探索了一种领域知识引导的深度集成方法，该方法展现出显著的性能提升。数据集可通过https://yieldsat.github.io/获取。

摘要 (Abstract)

Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at https://yieldsat.github.io/.

关键词: crop yield prediction, multimodal dataset, satellite imagery, deep learning, data fusion, domain adaptation, agricultural AI, pixel regression

177. ❌ EmoScene: A Dual-space Dataset for Controllable Affective Image Generation

作者: Li He, Longtai Zhang, Wenqiang Zhang, Yan Wang, Lizhe Qi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00933v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文本到图像扩散模型的情感控制，构建了一个包含情感维度和感知属性的双空间数据集，并提出了一个轻量级的基准方法。所有给定的关键词都直接与大语言模型（LLMs）或深度学习技术原理相关，而本文的研究对象是扩散模型（一种生成模型），并非大语言模型。论文未涉及任何LLM技术、训练方法、推理优化、对齐、代理系统、模型压缩、科学AI应用等主题，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文构建了一个大规模双空间情感数据集EmoScene，用于解决文本到图像扩散模型在场景语义和细粒度情感色调控制方面的挑战，并通过浅层交叉注意力调制方法实现了情感可控的图像生成。

摘要翻译

文本到图像扩散模型已实现较高的视觉保真度，但对场景语义与细粒度情感基调的精确控制仍具挑战性。人类视觉情感产生于语境意义（包括效价、唤醒度和支配度）与感知线索（如色彩协调、亮度对比、纹理变化、曲率和空间布局）的快速整合。然而，当前的文本到图像模型很少在统一表征中同时呈现情感维度与感知要素，这限制了其合成具有连贯且细腻情感意图场景的能力。为弥补这一不足，我们构建了EmoScene——一个大规模双空间情感数据集，该数据集联合编码情感维度与感知属性，并以辅助标注形式提供语境语义。EmoScene包含超过三百个真实场景类别的120万张图像，每张图像均标注有离散情感标签、连续VAD（效价-唤醒度-支配度）值、感知描述符及文本描述。多空间分析揭示了离散情感在VAD空间中的分布规律，以及情感如何与场景级感知因子形成系统性关联。为建立评估基准，我们提供了一个轻量级参考基线，通过浅层交叉注意力调制将双空间控制注入冻结的扩散模型主干，以此作为双空间监督所实现情感可控性的可复现验证方案。

摘要 (Abstract)

Text-to-image diffusion models have achieved high visual fidelity, yet precise control over scene semantics and fine-grained affective tone remains challenging. Human visual affect arises from the rapid integration of contextual meaning, including valence, arousal, and dominance, with perceptual cues such as color harmony, luminance contrast, texture variation, curvature, and spatial layout. However, current text-to-image models rarely represent affective and perceptual factors within a unified representation, which limits their ability to synthesize scenes with coherent and nuanced emotional intent. To address this gap, we construct EmoScene, a large-scale dual-space emotion dataset that jointly encodes affective dimensions and perceptual attributes, with contextual semantics provided as supporting annotations. EmoScene contains 1.2M images across more than three hundred real-world scene categories, each annotated with discrete emotion labels, continuous VAD values, perceptual descriptors and textual captions. Multi-space analyses reveal how discrete emotions occupy the VAD space and how affect systematically correlates with scene-level perceptual factors. To benchmark EmoScene, we provide a lightweight reference baseline that injects dual-space controls into a frozen diffusion backbone via shallow cross-attention modulation, serving as a reproducible probe of affect controllability enabled by dual-space supervision.

关键词: affective image generation, text-to-image diffusion models, dual-space emotion dataset, VAD values, perceptual attributes, scene semantics, cross-attention modulation, emotion controllability

178. ❌ Autoregressive Appearance Prediction for 3D Gaussian Avatars

作者: Michael Steiner, Zhang Chen, Alexander Richard, Vasu Agrawal, Markus Steinberger, Michael Zollhöfer 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00928v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D高斯泼溅（3D Gaussian Splatting）和空间MLP（Spatial MLP）在人体化身（avatar）建模中的应用，旨在解决高保真、稳定的人体化身驱动问题。论文的核心技术涉及计算机视觉、图形学和3D重建，而非大语言模型（LLM）或深度学习技术原理的创新。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而本文的研究内容与这些关键词无直接关联。因此，所有关键词的相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于3D高斯泼溅和空间MLP的人体化身模型，通过自回归预测外观潜变量，解决了在训练数据中相似姿势对应不同外观时导致的过拟合和不稳定问题，实现了高保真且稳定的化身驱动。

摘要翻译

要实现逼真且沉浸式的人类数字化身体验，需要捕捉精细的、个性化的细节，例如衣物与头发动态、细微的面部表情以及特征性运动模式。这通常需要大规模、高质量的数据集，但当极其相似的姿态对应不同外观时，此类数据集往往会引入模糊性和虚假相关性。在训练过程中拟合这些细节的模型容易过拟合，并在面对新姿态时产生不稳定、突兀的外观变化。我们提出了一种基于空间多层感知机（MLP）主干网络的3D高斯溅射（3D Gaussian Splatting）化身模型，该模型同时以姿态和外观隐变量（appearance latent）为条件。该隐变量在训练期间由编码器学习得到，形成一个紧凑的表示，这既提升了重建质量，也有助于消除姿态驱动渲染中的歧义。在驱动阶段，我们的预测器以自回归方式推断该隐变量，从而产生时间上平滑的外观演变并提升稳定性。总体而言，我们的方法为实现高保真、稳定的化身驱动提供了一条鲁棒且实用的路径。

摘要 (Abstract)

A photorealistic and immersive human avatar experience demands capturing fine, person-specific details such as cloth and hair dynamics, subtle facial expressions, and characteristic motion patterns. Achieving this requires large, high-quality datasets, which often introduce ambiguities and spurious correlations when very similar poses correspond to different appearances. Models that fit these details during training can overfit and produce unstable, abrupt appearance changes for novel poses. We propose a 3D Gaussian Splatting avatar model with a spatial MLP backbone that is conditioned on both pose and an appearance latent. The latent is learned during training by an encoder, yielding a compact representation that improves reconstruction quality and helps disambiguate pose-driven renderings. At driving time, our predictor autoregressively infers the latent, producing temporally smooth appearance evolution and improved stability. Overall, our method delivers a robust and practical path to high-fidelity, stable avatar driving.

关键词: 3D Gaussian Splatting, avatar model, spatial MLP, appearance latent, autoregressive prediction, temporal smoothness, high-fidelity, stable driving

179. ❌ ProCap: Projection-Aware Captioning for Spatial Augmented Reality

作者: Zimo Cao, Yuchen Deng, Haibin Ling, Bingyao Huang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ProCap专注于空间增强现实（SAR）中的视觉语言模型（VLM）应用，解决虚拟与物理场景的语义区分问题。它涉及计算机视觉（如分割、检索）和数据集构建，但未涉及大模型技术原理创新（如LLM架构、训练方法、推理优化等）或AI在科学领域的应用（如生物信息学）。所有关键词均与大模型技术或科学AI应用直接相关，而本文核心是VLM在特定AR场景的适应性改进，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

论文提出了ProCap框架，通过视觉分割和区域感知检索解决空间增强现实中虚拟与物理场景的语义混淆问题，并创建了RGBP数据集和双描述评估协议。

摘要翻译

空间增强现实（SAR）通过投影仪将数字内容直接投射至物理场景，无需头戴显示器即可创造沉浸式体验。然而，要使SAR支持智能交互（如场景推理或用户查询应答），系统必须从语义上区分物理场景与投射内容。标准的视觉语言模型（VLMs）难以处理这种虚实模糊性，常混淆两种情境。为解决此问题，我们提出ProCap——一种显式解耦投射内容与物理场景的新型框架。ProCap采用两阶段流程：首先通过自动分割在视觉上分离虚拟层与物理层；随后利用区域感知检索避免因投影变形导致的语义上下文歧义。为此，我们构建了首个大规模SAR语义基准数据集RGBP（RGB + Projections），包含65个多样化物理场景及超过18万项投射内容，并配有密集的解耦标注。最后，我们建立了基于任务特定标记的双重描述评估协议，以独立评估物理场景与投射内容的描述质量。实验表明，ProCap为未来SAR研究提供了坚实的语义基础。项目源代码、预训练模型及RGBP数据集已发布于项目页面：https://ZimoCao.github.io/ProCap/。

摘要 (Abstract)

Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: https://ZimoCao.github.io/ProCap/.

关键词: Spatial Augmented Reality, Vision Language Models, Virtual-Physical Ambiguity, Projection-Aware Captioning, RGBP Dataset, Dual-Captioning Evaluation, Semantic Decoupling, Region-Aware Retrieval

180. ❌ JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

作者: Issa Sugiura, Koki Maeda, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Naoaki Okazaki 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00909v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视觉语言模型（VLM）的评估基准构建，特别是针对日语VQA任务的数据集精炼和质量提升。虽然涉及大模型评估，但所有关键词均针对文本大模型（LLM）的技术原理、训练方法、推理优化、应用场景等，而论文研究的是视觉语言模型（VLM）的评估基准，属于不同的模型类型和技术范畴。论文未涉及任何LLM相关的技术原理、训练方法、推理优化或特定应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对日语视觉问答（VQA）基准数据集存在的质量问题，通过两轮人工标注系统性地精炼了七个现有数据集，构建了JAMMEval评估集合，实验表明该精炼后的基准能更可靠地评估视觉语言模型的能力并降低评估方差。

摘要翻译

可靠评估对于视觉语言模型（VLMs）的发展至关重要。然而，日语视觉问答（VQA）基准数据集相较于其英语对应版本，经历的迭代优化明显不足。因此，许多现有基准存在诸如问题表述模糊、答案错误以及无需视觉基础即可解答的实例等问题，这削弱了评估的可靠性，并在模型比较中导致误导性结论。为应对这些局限，我们提出了JAMMEval——一个经过精炼的日语基准数据集集合，旨在实现可靠的VLM评估。该数据集通过两轮人工标注，对七个现有日语基准数据集进行了系统性优化，从而提升了数据质量和评估可靠性。在我们的实验中，我们使用JAMMEval对开源权重模型和专有VLMs进行了评估，并分析了近期模型在日语VQA任务上的能力。我们进一步证明了优化工作的有效性：优化后的基准数据集产生的评估分数能更好地反映模型能力，表现出更低的运行间方差，并提升了区分不同能力水平模型的能力。我们将公开数据集和代码，以推动VLMs的可靠评估。

摘要 (Abstract)

Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading conclusions in model comparisons. To address these limitations, we introduce JAMMEval, a refined collection of Japanese benchmarks for reliable VLM evaluation. It is constructed by systematically refining seven existing Japanese benchmark datasets through two rounds of human annotation, improving both data quality and evaluation reliability. In our experiments, we evaluate open-weight and proprietary VLMs on JAMMEval and analyze the capabilities of recent models on Japanese VQA. We further demonstrate the effectiveness of our refinement by showing that the resulting benchmarks yield evaluation scores that better reflect model capability, exhibit lower run-to-run variance, and improve the ability to distinguish between models of different capability levels. We release our dataset and code to advance reliable evaluation of VLMs.

关键词: vision-language models, VLM evaluation, Japanese benchmarks, VQA, data quality, human annotation, model capability, evaluation reliability

181. ❌ IDDM: Identity-Decoupled Personalized Diffusion Models with a Tunable Privacy-Utility Trade-off

作者: Linyan Dai, Xinwei Zhang, Haoyang Li, Qingqing Ye, Haibo Hu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00903v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于个性化文本到图像扩散模型（如DreamBooth、LoRA）的隐私保护问题，提出IDDM方法以平衡身份可链接性和生成效用。仅与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为论文明确提及LoRA作为个性化扩散模型的代表技术，并涉及参数高效微调。其他关键词均未在标题或摘要中提及，与论文内容无关，故评0分。

!!! tip deepseek-chat TL;DR

论文针对个性化文本到图像扩散模型在社交媒体上输出时身份信息泄露的问题，提出了IDDM方法，通过身份解耦优化在授权个性化中降低身份可链接性，同时保持高质量的生成效果。

摘要翻译

个性化文本到图像扩散模型（如DreamBooth、LoRA）允许用户通过少量参考照片合成高保真虚拟形象以用于社交表达。然而，一旦这些生成内容在社交媒体平台（如Instagram、Facebook）上被分享，它们可能通过人脸识别系统与真实用户关联，从而引发身份追踪与画像分析。现有防御方法主要遵循反个性化策略，即通过干扰模型微调来保护公开发布的参考照片。尽管这类方法能有效防范未经授权的个性化使用，但未能解决另一种实际场景：即个性化过程已获授权，但由此产生的公开输出仍会泄露身份信息。
为解决这一问题，我们提出一种新的防御设定，称为模型侧输出免疫，其目标是构建一个支持授权个性化、同时降低公开生成内容身份可关联性的个性化模型，并通过可调节的隐私-效用权衡机制来适应多样化的隐私需求。为此，我们提出身份解耦个性化扩散模型（Identity-Decoupled personalized Diffusion Models, IDDM），这是一种集成身份解耦机制到个性化流程中的模型侧防御方法。具体而言，IDDM采用交替优化流程，将短时个性化更新与身份解耦数据优化交错进行，并通过两阶段调度策略平衡身份可关联性抑制与生成效用。在多个数据集、多样化文本提示及前沿人脸识别系统上的大量实验表明，IDDM在保持高质量个性化生成的同时，能持续降低身份可关联性。

摘要 (Abstract)

Personalized text-to-image diffusion models (e.g., DreamBooth, LoRA) enable users to synthesize high-fidelity avatars from a few reference photos for social expression. However, once these generations are shared on social media platforms (e.g., Instagram, Facebook), they can be linked to the real user via face recognition systems, enabling identity tracking and profiling. Existing defenses mainly follow an anti-personalization strategy that protects publicly released reference photos by disrupting model fine-tuning. While effective against unauthorized personalization, they do not address another practical setting in which personalization is authorized, but the resulting public outputs still leak identity information. To address this problem, we introduce a new defense setting, termed model-side output immunization, whose goal is to produce a personalized model that supports authorized personalization while reducing the identity linkability of public generations, with tunable control over the privacy-utility trade-off to accommodate diverse privacy needs. To this end, we propose Identity-Decoupled personalized Diffusion Models (IDDM), a model-side defense that integrates identity decoupling into the personalization pipeline. Concretely, IDDM follows an alternating procedure that interleaves short personalization updates with identity-decoupled data optimization, using a two-stage schedule to balance identity linkability suppression and generation utility. Extensive experiments across multiple datasets, diverse prompts, and state-of-the-art face recognition systems show that IDDM consistently reduces identity linkability while preserving high-quality personalized generation.

关键词: personalized diffusion models, identity linkability, privacy-utility trade-off, LoRA, face recognition, model-side defense, identity decoupling, personalization

182. ❌ Super-Resolving Coarse-Resolution Weather Forecasts With Flow Matching

作者: Aymeric Delefosse, Anastase Charantonis, Dominique Béréziat 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00897v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究基于机器学习的天气预测模型，使用流匹配（flow matching）进行生成式超分辨率处理，属于AI在科学领域的应用（气象学）。所有关键词中，仅“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联（5分），因为该关键词涵盖AI在科学领域的应用，而论文属于气象科学应用。其他关键词均涉及大语言模型（LLM）相关技术、训练方法、推理优化、对齐、代理系统等，与论文的机器学习天气预测和超分辨率主题完全无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于流匹配的生成式超分辨率框架，用于提升粗分辨率天气预测的空间分辨率，在保持大尺度结构的同时生成物理一致的小尺度变异性，以较低计算成本实现了与操作集合基线竞争的0.25°分辨率概率预测技能。

摘要翻译

基于机器学习的天气预报模型现已超越最先进的数值天气预报系统，但在高空间分辨率下训练和运行这些模型仍面临高昂的计算成本。我们提出一种模块化框架，通过将学习式生成性超分辨率作为粗分辨率预报轨迹的后处理步骤，实现预报过程与空间分辨率的解耦。我们将超分辨率构建为一个随机逆问题，采用残差公式以在重建未解析变异性的同时保持大尺度结构。该模型完全基于再分析数据通过流匹配进行训练，并应用于全球中期预报。我们通过以下两方面进行评估：（一）设计一致性：将超分辨率预报重新粗化并与原始粗分辨率轨迹对比；（二）高分辨率预报质量：使用标准集合验证指标和谱诊断方法。结果表明，超分辨率方法在重新粗化后能保持大尺度结构和方差，引入物理一致的小尺度变异性，并在0.25°分辨率上相对于业务化集合基准取得了具有竞争力的概率预报技巧，同时相较于端到端高分辨率预报仅需适度的额外训练成本。

摘要 (Abstract)

Machine learning-based weather forecasting models now surpass state-of-the-art numerical weather prediction systems, but training and operating these models at high spatial resolution remains computationally expensive. We present a modular framework that decouples forecasting from spatial resolution by applying learned generative super-resolution as a post-processing step to coarse-resolution forecast trajectories. We formulate super-resolution as a stochastic inverse problem, using a residual formulation to preserve large-scale structure while reconstructing unresolved variability. The model is trained with flow matching exclusively on reanalysis data and is applied to global medium-range forecasts. We evaluate (i) design consistency by re-coarsening super-resolved forecasts and comparing them to the original coarse trajectories, and (ii) high-resolution forecast quality using standard ensemble verification metrics and spectral diagnostics. Results show that super-resolution preserves large-scale structure and variance after re-coarsening, introduces physically consistent small-scale variability, and achieves competitive probabilistic forecast skill at 0.25° resolution relative to an operational ensemble baseline, while requiring only a modest additional training cost compared with end-to-end high-resolution forecasting.

关键词: weather forecasting, super-resolution, flow matching, generative model, machine learning, spatial resolution, probabilistic forecast, reanalysis data

183. ❌ Adversarial Attenuation Patch Attack for SAR Object Detection

作者: Yiming Zhang, Weibo Qin, Feng Wang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00887v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究合成孔径雷达（SAR）目标检测中的对抗攻击方法，具体提出了一种能量约束优化的对抗衰减补丁（AAP）方法。论文主题属于计算机视觉和对抗机器学习领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型、语言模型、模型训练、推理优化、AI代理或科学AI应用相关内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对SAR目标检测系统的对抗衰减补丁攻击方法，通过能量约束优化实现了攻击效果与隐蔽性的平衡，并展示了物理实现的潜力。

摘要翻译

深度神经网络在合成孔径雷达目标检测任务中已展现出卓越性能，但仍易受对抗性攻击影响。现有的SAR专用攻击方法虽能有效欺骗检测器，却常引入明显扰动，且大多局限于数字域，忽视了攻击SAR系统时物理实现的约束。本文提出一种新颖的对抗性衰减补丁方法，该方法采用能量约束优化策略并结合基于衰减的部署框架，在攻击效能与隐蔽性之间实现了无缝平衡。更重要的是，AAP通过与信号级电子干扰机制相契合，展现出强大的物理实现潜力。实验结果表明，AAP能有效降低检测性能，同时保持高度不可感知性，并在不同模型间表现出良好的可迁移性。本研究为SAR目标检测系统的对抗性攻击提供了物理基础视角，有助于设计更隐蔽且实际可部署的攻击策略。源代码已公开于https://github.com/boremycin/SAAP。

摘要 (Abstract)

Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at https://github.com/boremycin/SAAP.

关键词: Adversarial Attack, SAR Object Detection, Attenuation Patch, Energy-constrained Optimization, Physical Realization, Transferability, Stealthiness, Electronic Jamming

184. ❌ A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video

作者: Maximilian Fehrentz, Nicolas Stellwag, Robert Wiebe, Nicole Thorisch, Fabian Grob, Patrick Remerscheid, Ken-Joel Simmoteit, Benjamin D. Killeen, Christian Heiliger, Nassir Navab 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00867v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文提出了一种基于4D表示的手术智能体框架，核心是使用多模态大语言模型（MLLM）作为智能体进行时空推理，无需微调。因此，与"Large Language Models”、“LLM Agents”、“Tool Use”、“Chain of Thought”、“System 2 Thinking"高度相关（10分）。同时，论文专注于手术AI，属于"AI for Science"在生物医学领域的应用（10分）。其他关键词如MoE、量化、RAG、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于4D表示的手术智能体框架，通过结合2D多模态大语言模型和3D计算机视觉模型，无需额外训练即可实现手术视频的时空推理和智能体决策，显著提升了时空理解能力。

摘要翻译

时空推理是人工智能在软组织手术中的一项基础能力，为智能辅助系统和自主机器人技术铺平了道路。尽管二维视觉语言模型在理解手术视频方面展现出日益广阔的前景，但手术场景的空间复杂性表明，推理系统可能受益于显式的四维表征。本文提出一个框架，旨在基于显式四维表征为手术智能体配备时空推理工具，使人工智能系统能够将其自然语言推理锚定于时间和三维空间之中。通过利用点追踪、深度估计和分割模型，我们构建了一个具有时空一致的手术器械与组织语义的连贯四维模型。随后，一个多模态大语言模型作为智能体，直接基于从显式四维表征（如轨迹）中提取的工具进行推理，而无需任何微调。我们在一个包含134个临床相关问题的新数据集上评估了我们的方法，发现通用推理主干与我们的四维表征相结合，能显著提升时空理解能力并实现四维锚定。我们证明，时空智能可以从二维多模态大语言模型和三维计算机视觉模型中“组装”而来，无需额外训练。代码、数据及示例可在 https://tum-ai.github.io/surg4d/ 获取。

摘要 (Abstract)

Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise at understanding surgical video, the spatial complexity of surgical scenes suggests that reasoning systems may benefit from explicit 4D representations. Here, we propose a framework for equipping surgical agents with spatiotemporal tools based on an explicit 4D representation, enabling AI systems to ground their natural language reasoning in both time and 3D space. Leveraging models for point tracking, depth, and segmentation, we develop a coherent 4D model with spatiotemporally consistent tool and tissue semantics. A Multimodal Large Language Model (MLLM) then acts as an agent on tools derived from the explicit 4D representation (e.g., trajectories) without any fine-tuning. We evaluate our method on a new dataset of 134 clinically relevant questions and find that the combination of a general purpose reasoning backbone and our 4D representation significantly improves spatiotemporal understanding and allows for 4D grounding. We demonstrate that spatiotemporal intelligence can be “assembled” from 2D MLLMs and 3D computer vision models without additional training. Code, data, and examples are available at https://tum-ai.github.io/surg4d/

关键词: 4D representation, surgical agents, Multimodal Large Language Model (MLLM), spatiotemporal reasoning, training-free, agentic reasoning, laparoscopic video, AI for surgery

185. ❌ Shape Representation using Gaussian Process mixture models

作者: Panagiotis Sapoutzoglou, George Terzakis, Georgios Floros, Maria Pateraki 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00862v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文提出了一种基于高斯过程混合模型的新型3D形状表示方法，属于计算机视觉和几何处理领域。论文的核心技术是高斯过程（GP）和混合模型，用于从稀疏点云学习连续距离场。所有给定的评分关键词均与大语言模型（LLM）、深度学习技术原理、AI for Science应用或相关训练/推理技术直接相关。该论文的研究内容（3D形状表示、高斯过程、几何建模）与这些关键词的主题（如LLM、MoE、Scaling Laws、Alignment、RAG、Agents等）完全不同，没有任何重叠或关联。因此，所有关键词的相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于高斯过程混合模型的轻量级、对象特定的功能形状表示方法，用于从稀疏点云高效、准确地表示复杂3D几何形状。

摘要翻译

传统的显式三维表示（如点云和网格）需要大量存储空间以捕捉精细几何细节，并依赖复杂的索引系统进行表面查询，这使得函数式表示成为一种高效、紧凑且连续的替代方案。本研究提出一种新颖的、面向特定物体的函数式形状表示方法，该方法采用高斯过程混合模型对表面几何进行建模。与依赖计算密集型神经架构的方法不同，本方法通过高斯过程从稀疏采样的点云中学习连续方向距离场，实现了轻量化建模。我们通过在策略性参考点处锚定局部高斯过程先验来捕捉复杂拓扑结构，这些参考点可通过任意结构分解方法（如骨架化、基于距离的聚类）灵活提取。在ShapeNetCore和IndustryShapes数据集上的大量实验表明，本方法能够高效且精确地表示复杂几何结构。

摘要 (Abstract)

Traditional explicit 3D representations, such as point clouds and meshes, demand significant storage to capture fine geometric details and require complex indexing systems for surface lookups, making functional representations an efficient, compact, and continuous alternative. In this work, we propose a novel, object-specific functional shape representation that models surface geometry with Gaussian Process (GP) mixture models. Rather than relying on computationally heavy neural architectures, our method is lightweight, leveraging GPs to learn continuous directional distance fields from sparsely sampled point clouds. We capture complex topologies by anchoring local GP priors at strategic reference points, which can be flexibly extracted using any structural decomposition method (e.g. skeletonization, distance-based clustering). Extensive evaluations on the ShapeNetCore and IndustryShapes datasets demonstrate that our method can efficiently and accurately represent complex geometries.

关键词: 3D shape representation, Gaussian Process mixture models, functional representation, directional distance fields, sparse point clouds, geometric modeling, ShapeNetCore, IndustryShapes

186. ❌ Sparkle: A Robust and Versatile Representation for Point Cloud based Human Motion Capture

作者: Yiming Ren, Yujing Sun, Aoru Xue, Kwok-Yan Lam, Yuexin Ma 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00857v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文专注于点云人体运动捕捉的表示学习，提出了一种结合骨骼关节和表面锚点的结构化表示方法。论文内容涉及计算机视觉、几何处理和运动分析，但完全不涉及大语言模型、深度学习技术原理创新或任何评分关键词中的技术（如MoE、RLHF、RAG、量化等）。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本论文的研究领域（点云运动捕捉）与这些关键词无任何关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了点云人体运动捕捉中表达性与鲁棒性难以平衡的问题，提出了一种名为Sparkle的结构化表示方法，通过显式解耦内部运动学结构和外部表面几何，在精度、鲁棒性和泛化性方面实现了最先进的性能。

摘要翻译

基于点云的运动捕捉技术利用其丰富的空间几何特性与隐私保护感知优势，但如何从噪声干扰、非结构化的点云数据中学习鲁棒表征仍具挑战。现有方法常面临基于点的方法（几何细节丰富但噪声敏感）与基于骨架的方法（鲁棒但过度简化）之间的艰难权衡。本文致力于解决核心问题：如何构建一种能兼顾表达力与鲁棒性的人体运动捕捉表征。我们提出Sparkle——一种通过显式运动学-几何分解统一骨骼关节点与表面锚点的结构化表征。我们的框架SparkleMotion通过嵌入几何连续性与运动学约束的层次化模块学习该表征。通过显式解耦内部运动学结构与外部表面几何，SparkleMotion不仅在精度上达到最优性能，更在严重域偏移、噪声及遮挡条件下展现出关键的鲁棒性与泛化能力。大量实验验证了我们在多种传感器类型及复杂现实场景中的优越性。

摘要 (Abstract)

Point cloud-based motion capture leverages rich spatial geometry and privacy-preserving sensing, but learning robust representations from noisy, unstructured point clouds remains challenging. Existing approaches face a struggle trade-off between point-based methods (geometrically detailed but noisy) and skeleton-based ones (robust but oversimplified). We address the fundamental challenge: how to construct an effective representation for human motion capture that can balance expressiveness and robustness. In this paper, we propose Sparkle, a structured representation unifying skeletal joints and surface anchors with explicit kinematic-geometric factorization. Our framework, SparkleMotion, learns this representation through hierarchical modules embedding geometric continuity and kinematic constraints. By explicitly disentangling internal kinematic structure from external surface geometry, SparkleMotion achieves state-of-the-art performance not only in accuracy but crucially in robustness and generalization under severe domain shifts, noise, and occlusion. Extensive experiments demonstrate our superiority across diverse sensor types and challenging real-world scenarios.

关键词: Point Cloud, Human Motion Capture, Representation Learning, Kinematic-Geometric Factorization, Robustness, Generalization, Sparkle, SparkleMotion

187. ❌ Perturb-and-Restore: Simulation-driven Structural Augmentation Framework for Imbalance Chromosomal Anomaly Detection

作者: Yilan Zhang, Hanbiao Chen, Changchun Yang, Yuetan Chu, Siyuan Chen, Jing Wu, Jingdong Hu, Na Li, Junkai Su, Yuxuan Chen, Ao Xu, Xin Gao, Aihua Yin 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00854v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于染色体异常检测的深度学习应用，属于生物信息学领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为它直接应用AI技术解决生物医学问题。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，而本文未提及任何大模型、LLM、MoE、训练技术（如预训练、微调、对齐）、推理方法（如CoT、注意力优化）、或代理相关概念，因此均评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Perturb-and-Restore的模拟驱动结构增强框架，以解决染色体异常检测中数据不平衡和稀缺问题，实验表明该方法在敏感度、精确度和F1分数上均显著超越现有方法。

摘要翻译

检测染色体结构异常对于遗传性疾病的精准诊断与管理至关重要。然而，在临床实践中，收集充足的结构异常数据极具挑战性且成本高昂，并非所有异常类型都能轻易获取。这导致深度学习模型因异常染色体数据的严重不平衡与稀缺而面临显著的性能下降。为应对这一挑战，我们提出一种模拟驱动的结构增强框架——扰动与恢复（Perturb-and-Restore, P&R），以有效缓解染色体异常检测中的数据不平衡问题。P&R框架包含两个核心组件：（1）结构扰动与恢复模拟：通过扰动正常染色体的染色体带型生成合成异常染色体，随后利用恢复扩散网络重建连续的染色体内容与边缘，从而消除对稀有异常样本的依赖；（2）能量引导的自适应采样：一种基于能量分数的在线选择策略，通过参考真实样本的能量分布，动态筛选高质量的合成样本。为评估本方法，我们构建了一个包含超过26万张染色体图像的综合结构异常数据集，其中涵盖24个类别的4,242个异常样本。实验结果表明，P&R框架实现了最先进的性能，在所有类别中其敏感性平均提升8.92%，精确度平均提升8.89%，F1分数平均提升13.79%，显著优于现有方法。

摘要 (Abstract)

Detecting structural chromosomal abnormalities is crucial for accurate diagnosis and management of genetic disorders. However, collecting sufficient structural abnormality data is extremely challenging and costly in clinical practice, and not all abnormal types can be readily collected. As a result, deep learning approaches face significant performance degradation due to the severe imbalance and scarcity of abnormal chromosome data. To address this challenge, we propose a Perturb-and-Restore (P&R), a simulation-driven structural augmentation framework that effectively alleviates data imbalance in chromosome anomaly detection. The P&R framework comprises two key components: (1) Structure Perturbation and Restoration Simulation, which generates synthetic abnormal chromosomes by perturbing chromosomal banding patterns of normal chromosomes followed by a restoration diffusion network that reconstructs continuous chromosome content and edges, thus eliminating reliance on rare abnormal samples; and (2) Energy-guided Adaptive Sampling, an energy score-based online selection strategy that dynamically prioritizes high-quality synthetic samples by referencing the energy distribution of real samples. To evaluate our method, we construct a comprehensive structural anomaly dataset consisting of over 260,000 chromosome images, including 4,242 abnormal samples spanning 24 categories. Experimental results demonstrate that the P&R framework achieves state-of-the-art (SOTA) performance, surpassing existing methods with an average improvement of 8.92% in sensitivity, 8.89% in precision, and 13.79% in F1-score across all categories.

关键词: chromosomal anomaly detection, data imbalance, structural augmentation, simulation-driven framework, diffusion network, energy-guided sampling, bioinformatics, deep learning

188. ❌ MotionGrounder: Grounded Multi-Object Motion Transfer via Diffusion Transformer

作者: Samuel Teodoro, Yun Chen, Agus Gunawan, Soo Ye Kim, Jihyong Oh, Munchurl Kim 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00853v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于Diffusion Transformer的多对象运动迁移视频生成，属于计算机视觉和生成模型领域，与所有评分关键词（均聚焦于大语言模型技术原理、训练方法、推理优化、应用等）无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了MotionGrounder框架，首次解决了基于Diffusion Transformer的多对象可控运动迁移问题，通过Flow-based Motion Signal和Object-Caption Alignment Loss实现了更好的空间对齐和语义一致性，在定量、定性和人工评估中均优于现有基线。

摘要翻译

运动迁移技术通过将参考视频的时间动态特征迁移至目标描述文本引导生成的新视频，从而实现可控视频生成。然而，现有基于扩散变换器（Diffusion Transformer，DiT）的方法仅限于生成单物体视频，难以对现实场景中多物体进行细粒度控制。本研究提出MotionGrounder，这是一个基于DiT的框架，首次实现了具备多物体可控性的运动迁移。我们提出的流式运动信号（Flow-based Motion Signal，FMS）为目标视频生成提供了稳定的运动先验，而物体-描述对齐损失（Object-Caption Alignment Loss，OCAL）则将物体描述文本锚定至其对应的空间区域。我们进一步提出新的物体定位评分（Object Grounding Score，OGS），该指标联合评估：（i）源视频物体与其生成对应物之间的空间对齐度，以及（ii）每个生成物体与其目标描述文本之间的语义一致性。实验表明，在定量评估、定性分析和人工评测中，MotionGrounder均持续优于现有基线方法。

摘要 (Abstract)

Motion transfer enables controllable video generation by transferring temporal dynamics from a reference video to synthesize a new video conditioned on a target caption. However, existing Diffusion Transformer (DiT)-based methods are limited to single-object videos, restricting fine-grained control in real-world scenes with multiple objects. In this work, we introduce MotionGrounder, a DiT-based framework that firstly handles motion transfer with multi-object controllability. Our Flow-based Motion Signal (FMS) in MotionGrounder provides a stable motion prior for target video generation, while our Object-Caption Alignment Loss (OCAL) grounds object captions to their corresponding spatial regions. We further propose a new Object Grounding Score (OGS), which jointly evaluates (i) spatial alignment between source video objects and their generated counterparts and (ii) semantic consistency between each generated object and its target caption. Our experiments show that MotionGrounder consistently outperforms recent baselines across quantitative, qualitative, and human evaluations.

关键词: Motion Transfer, Diffusion Transformer, Multi-object Controllability, Video Generation, Object Grounding, Spatial Alignment, Semantic Consistency

189. ❌ Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation

作者: Shuang Li, Chao Deng, Hang Chen, Liqun Liu, Zhenyu Hu, Te Cao, Mengge Xue, Yuan Chen, Peng Shu, Huan Yu, Jie Jiang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00849v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文本到图像生成中的主题驱动编辑，提出了一种解决相似性-可控性悖论的新框架DisCo，通过解耦和再耦合视觉与文本信息来实现高保真主题保持和精确文本控制。虽然论文涉及深度学习（特别是扩散模型）和强化学习，但所有给定的关键词均与大语言模型（LLM）及其相关技术（如MoE、缩放定律、对齐、RAG、推理、代理、量化等）或特定科学领域AI应用（如生物信息学）直接相关，而本文研究的是计算机视觉中的生成模型，未涉及LLM技术或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了主题驱动文本到图像生成中的相似性-可控性悖论，提出DisCo框架通过解耦和再耦合视觉与文本信息，实现了同时高保真主题保持和精确文本控制，在实验中取得了最先进的性能。

摘要翻译

主题驱动的文本到图像生成旨在根据文本提示编辑主体所处语境的同时，保持其身份特征。该任务的一个核心挑战在于“相似度-可控性悖论”，即增强文本控制往往会降低主体的保真度，反之亦然。我们认为这一悖论源于文本提示的模糊性：其通常被同时用于描述主体和期望的修改内容，从而向模型传递相互冲突的信号。为解决此问题，我们提出DisCo框架，该框架通过先解耦再耦合的方式处理视觉与文本信息。首先，我们的文本-视觉解耦模块分离信息来源：主体身份通过实体词专门从参考图像中提取，而文本提示则简化为仅包含修改指令，其中主体指代使用通用代词，从而消除描述歧义。然而，这种严格分离可能导致主体与其语境之间的组合不自然。为此，我们设计了专用奖励信号，并运用强化学习将视觉定义的主体与文本生成的语境进行无缝重耦合。该方法有效解决了上述悖论，实现了高保真主体保持与精准文本控制的同步达成。大量实验表明，我们的方法取得了最先进的性能，能够生成高度逼真且语义连贯的图像。

摘要 (Abstract)

Subject-Driven Text-to-Image (T2I) Generation aims to preserve a subject’s identity while editing its context based on a text prompt. A core challenge in this task is the “similarity-controllability paradox”, where enhancing textual control often degrades the subject’s fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose DisCo, a novel framework that first Disntangles and then re-Couples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.

关键词: Subject-Driven Text-to-Image Generation, similarity-controllability paradox, disentangling, recoupling, reinforcement learning, visual-textual decoupling, high-fidelity subject preservation, textual control

190. ❌ Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

作者: Patrick Glandorf, Thomas Norrenbrock, Bodo Rosenhahn 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00827v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究视频实例分割任务中Vision Transformers的早期token剪枝技术，涉及视频处理、计算效率优化和稀疏化方法。所有评分关键词均与大语言模型、深度学习技术原理创新或科学AI应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种视频补丁剪枝框架，通过利用时间先验知识在Vision Transformers的早期层实现高效稀疏化，在视频实例分割任务中实现了高达60%的补丁减少，同时保持性能稳定。

摘要翻译

视觉变换器（Vision Transformers, ViTs）已在多项基准测试中展现出最先进的性能，但其高昂的计算成本阻碍了实际部署。图像块剪枝技术能显著节省计算资源，但现有方法仅将令牌缩减限制在深层网络，未探索早期阶段的压缩，这限制了其整体效率提升的潜力。本研究提出了一种新颖的视频图像块剪枝框架（Video Patch Pruning, VPP），该框架整合了时序先验知识，以实现视觉变换器早期层的高效稀疏化。我们的方法基于以下观察：从深层网络提取的先验特征表现出强烈的前景选择性。因此，我们提出了一种完全可微分的时序映射模块，用于在网络早期阶段精准选择最相关的图像块。值得注意的是，该方法在密集预测任务中可实现高达60%的图像块缩减，超越了传统基于图像的图像块剪枝方法的能力（后者通常只能实现约30%的图像块稀疏度）。VPP在高稀疏度区间表现卓越，即使图像块使用率降至55%以下，仍能保持出色的性能。具体而言，在Youtube-VIS 2021数据集上，该方法保持了稳定的结果，最大性能下降仅为0.6%。

摘要 (Abstract)

Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense prediction tasks, exceeding the capabilities of conventional image-based patch pruning, which typically operate around a 30% patch sparsity. VPP excels the high-sparsity regime, sustaining remarkable performance even when patch usage is reduced below 55%. Specifically, it preserves stable results with a maximal performance drop of 0.6% on the Youtube-VIS 2021 dataset.

关键词: Video Patch Pruning, Vision Transformers, Video Instance Segmentation, Token Reduction, Computational Efficiency, Sparsity, Temporal Prior Knowledge, Early Network Stages

191. ❌ Continual Vision-Language Learning for Remote Sensing: Benchmarking and Analysis

作者: Xingxing Weng, Ruifeng Ni, Chao Pang, XiangYu Hao, Yishan Wang, Xiaokang Zhang, Wei Xu, Gui-Song Xia 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00820v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究遥感领域的视觉-语言模型（RS VLMs）的持续学习问题，提出了CLeaRS基准并评估了现有方法的局限性。论文与大多数关键词无关，因为这些关键词主要针对纯文本大语言模型（LLMs）的技术细节（如MoE、量化、推理加速等）或特定应用（如工具调用、多智能体系统）。论文仅与两个关键词相关：1）“Pre-training OR Continual Pre-training OR Domain Adaptation”：论文明确研究持续学习（continual learning），这是持续预训练/领域适应的一个子领域，旨在让模型适应新任务和模态而不遗忘，因此给8分（高度相关但非核心）。2）“AI for Science OR Bioinformatics OR Cheminformatics”：遥感是AI for Science的一个应用领域，涉及科学数据分析和解释，因此给8分（相关但非核心生物信息学/化学信息学）。其他关键词如LLMs、SFT、RAG等均未涉及，因为论文聚焦视觉-语言模型（VLMs）而非纯文本LLMs，且未讨论这些具体技术。

!!! tip deepseek-chat TL;DR

该论文针对遥感视觉-语言模型在持续适应新任务和模态时出现的灾难性遗忘问题，提出了一个全面的持续学习基准CLeaRS，并通过实验发现现有方法在应对任务、指令和模态转换时效果有限。

摘要翻译

当前遥感视觉语言模型在图像解译方面展现出卓越性能，但其依赖静态训练数据的特点限制了其适应持续涌现的感知模态与下游任务的能力。这揭示了一个根本性挑战：如何使遥感视觉语言模型能够持续适应而不发生灾难性遗忘。尽管具有重要实践意义，遥感视觉语言模型的持续学习能力仍未得到充分探索，且目前缺乏专门的评估基准。本研究提出CLeaRS——一个面向遥感领域持续视觉语言学习的综合性基准。CLeaRS包含10个精选子集，涵盖超过20.7万组图像-文本对，涉及多样化解译任务、感知模态与应用场景。我们进一步定义三种评估协议：长周期设定、模态增量设定与任务增量设定，以系统评估持续适应能力。通过对多种视觉语言模型的大规模基准测试，发现所有设定下均存在灾难性遗忘现象。此外，当代表性持续学习方法适配至遥感视觉语言模型时，其在处理任务、指令与模态转换方面表现出有限的有效性。我们的研究结果强调，需要开发专门针对遥感视觉语言模型的持续学习方法。

摘要 (Abstract)

Current remote sensing vision-language models (RS VLMs) demonstrate impressive performance in image interpretation but rely on static training data, limiting their ability to accommodate continuously emerging sensing modalities and downstream tasks. This exposes a fundamental challenge: enabling RS VLMs to continually adapt without catastrophic forgetting. Despite its practical importance, the continual learning capability of RS VLMs remains underexplored, and no dedicated benchmark currently exists. In this work, we present CLeaRS, a comprehensive benchmark for continual vision-language learning in remote sensing. CLeaRS comprises 10 curated subsets with over 207k image-text pairs, spanning diverse interpretation tasks, sensing modalities, and application scenarios. We further define three evaluation protocols: long-horizon, modality-incremental, and task-incremental settings, to systematically assess continual adaptation. Extensive benchmarking of diverse vision-language models reveals catastrophic forgetting across all settings. Moreover, representative continual learning methods, when adapted to RS VLMs, exhibit limited effectiveness in handling task, instruction, and modality transitions. Our findings underscore the need for developing continual learning methods tailored to RS VLMs.

关键词: continual learning, vision-language models, remote sensing, catastrophic forgetting, benchmark, modality-incremental, task-incremental, image-text pairs

192. ❌ Multicentric thrombus segmentation using an attention-based recurrent network with gradual modality dropout

作者: Sofia Vargas-Ibarra, Vincent Vigneron, Hichem Maaref, Sonia Garcia-Salicetti 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00817v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像中的血栓分割，使用基于注意力的循环网络和渐进模态丢弃技术，属于AI在生物医学领域的应用。所有关键词均与大模型、深度学习技术原理或通用AI方法相关，而本文是特定领域的应用研究，未涉及大模型技术、训练方法、推理优化、对齐、代理系统等核心主题。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为涉及生物医学成像应用，但非核心内容，故给5分；其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于注意力循环网络和渐进模态丢弃的方法，用于在3D脑扫描中检测和分割小血栓，在单中心和多中心数据上分别实现了90%和80%的检测率，并展示了该方法可推广到其他小病灶医学影像任务。

摘要翻译

在三维脑部扫描中检测与勾画微小目标是医学影像领域的核心挑战，但尚未得到充分解决。以缺血性脑卒中为例，责任血栓具有体积小、对比度低的特点，且在不同模态影像（如磁敏感加权成像中的T2透射效应、弥散加权成像/表观弥散系数图中的弥散受限）中呈现差异性表达，而真实世界多中心数据更存在域偏移、各向异性及频繁的序列缺失问题。本文提出一种创新方法：将基于注意力的循环分割网络（UpAttLLSTM）与渐进式异模态学习训练策略相结合，并引入渐进式模态丢弃机制。UpAttLLSTM通过循环单元实现跨切片上下文聚合（2.5D），并利用注意力门融合可用序列间的互补信息，从而有效应对各向异性和类别不平衡问题。渐进式模态丢弃在训练中系统模拟站点异质性、噪声及模态缺失情况，兼具数据增强与正则化功能，显著提升了多中心泛化能力。在单中心队列中，本方法对血栓的检测率超过90%，Dice相似系数达0.65；在存在模态缺失的多中心场景中，仍保持约80%的检测率，Dice系数稳定在0.35左右。除脑卒中领域外，该方法可直接推广至其他三维医学影像中的微小病灶检测任务，尤其适用于目标稀少、征象细微且具有模态依赖性的应用场景。

摘要 (Abstract)

Detecting and delineating tiny targets in 3D brain scans is a central yet under-addressed challenge in medical imaging.In ischemic stroke, for instance, the culprit thrombus is small, low-contrast, and variably expressed across modalities(e.g., susceptibility-weighted T2 blooming, diffusion restriction on DWI/ADC), while real-world multi-center dataintroduce domain shifts, anisotropy, and frequent missing sequences. We introduce a methodology that couples an attention-based recurrent segmentation network (UpAttLLSTM), a training schedule that progressively increases the difficulty of hetero-modal learning, with gradual modality dropout, UpAttLLSTM aggregates context across slices via recurrent units (2.5D) and uses attention gates to fuse complementary cues across available sequences, making it robust to anisotropy and class imbalance. Gradual modality dropout systematically simulates site heterogeneity,noise, and missing modalities during training, acting as both augmentation and regularization to improve multi-center generalization. On a monocentric cohort, our approach detects thrombi in >90% of cases with a Dice score of 0.65. In a multi-center setting with missing modalities, it achieves-80% detection with a Dice score around 0.35. Beyond stroke, the proposed methodology directly transfers to other small-lesion tasks in 3D medical imaging where targets are scarce, subtle, and modality-dependent

关键词: thrombus segmentation, attention-based recurrent network, gradual modality dropout, 3D medical imaging, multi-center generalization, small lesion detection, ischemic stroke, domain shift

193. ❌ Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers

作者: Kawtar Zaher, Olivier Buisson, Alexis Joly 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00809v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于预训练Vision Transformers（ViT）的人机交互对象检索系统，属于计算机视觉领域，而非大语言模型（LLM）或深度学习技术原理的创新。仅与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为使用了预训练的ViT模型，但未涉及LLM相关技术、MoE、推理方法、对齐、微调、科学AI应用等其他关键词，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文重新审视了基于预训练Vision Transformers的人机交互对象检索任务，通过主动学习策略优化图像检索流程，并在多对象数据集上比较了不同表征策略的权衡。

摘要翻译

在现有方法基础上，我们重新审视了“人在回路对象检索”任务。该任务旨在通过用户提供的查询指令，从大规模无标注图像库中迭代检索包含目标类别对象的图像。其核心目标在于仅依赖初始查询与用户的相关性反馈，在无先验标注的情况下快速识别出目标对象类别的多样化实例。检索过程被构建为二分类任务：系统通过迭代式用户交互，持续学习区分与查询相关及不相关的图像。这一交互由主动学习循环引导——系统在每轮迭代中选择信息量丰富的样本交由用户标注，从而优化检索性能。该任务在多对象数据集中尤为挑战性，因为目标对象可能仅占据复杂杂乱场景图像中的极小区域。与全局描述符通常已足够的目标中心化场景不同，多对象图像需要更适配的局部化描述符。本研究通过利用预训练视觉Transformer表征，重新构建并探讨了人在回路对象检索任务，重点解决了以下关键设计问题：图像中应考虑哪些对象实例、标注应采取何种形式、如何应用主动选择策略，以及哪些表征策略能最优捕捉对象特征。我们在多对象数据集上比较了多种表征策略，揭示了全局上下文捕捉与细粒度局部对象细节聚焦之间的权衡关系。实验结果基于主动学习的对象类别检索系统，为设计高效的交互式检索流程提供了实践性见解。

摘要 (Abstract)

Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user’s Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object’s features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.

关键词: Human-in-the-Loop Object Retrieval, Pre-trained Vision Transformers, Active Learning, Relevance Feedback, Multi-object Datasets, Image Retrieval, Binary Classification, Localized Descriptors

194. ❌ Compact Keyframe-Optimized Multi-Agent Gaussian Splatting SLAM

作者: Monica M. Q. Li, Pierre-Yves Lajoie, Jialiang Liu, Giovanni Beltrame 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00804v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多机器人SLAM系统中的3D高斯泼溅地图压缩和通信优化，与绝大多数大模型和深度学习技术关键词完全无关。唯一相关的关键词是’Multi-agent Systems OR Agent Coordination’，因为论文研究多机器人系统的地图合并和协调问题，但并非基于大模型或深度学习技术。论文属于机器人学和计算机视觉领域，而非大模型研究。

!!! tip deepseek-chat TL;DR

该论文提出了一种改进的多机器人RGB-D高斯泼溅SLAM框架，通过压缩冗余3D高斯和优化闭环检测，在保持地图质量的同时将传输数据减少了85-95%。

摘要翻译

高效的多智能体三维建图对于在未知环境中运行的机器人团队至关重要，但稠密地图表示会阻碍其在受限通信链路上的实时交换。在多智能体同步定位与建图（SLAM）系统中，通常依赖中央服务器来融合和优化各智能体生成的局部地图。然而，共享这些大规模地图表示——特别是由高斯泼溅（Gaussian Splatting）等近期方法生成的地图——在带宽受限的实际场景中成为瓶颈。本文提出一种改进的多智能体RGB-D高斯泼溅SLAM框架，该框架在保持地图保真度的同时降低了通信负载。首先，我们在SLAM系统中引入了一个压缩步骤，以移除冗余的三维高斯元，且不降低渲染质量。其次，我们的方法无需初始猜测即可执行集中式闭环检测计算，其运行包含两种模式：一种是纯渲染深度模式，除三维高斯元外无需任何额外数据；另一种是相机深度模式，该模式包含轻量级深度图像以提升配准精度并执行额外的高斯元剪枝。在合成数据集和真实数据集上的评估表明，与两种模式下的现有先进方法相比，本方法传输数据量减少了85-95%，使得基于三维高斯的多智能体SLAM更接近实际场景的部署应用。代码：https://github.com/lemonci/coko-slam

摘要 (Abstract)

Efficient multi-agent 3D mapping is essential for robotic teams operating in unknown environments, but dense representations hinder real-time exchange over constrained communication links. In multi-agent Simultaneous Localization and Mapping (SLAM), systems typically rely on a centralized server to merge and optimize the local maps produced by individual agents. However, sharing these large map representations, particularly those generated by recent methods such as Gaussian Splatting, becomes a bottleneck in real-world scenarios with limited bandwidth. We present an improved multi-agent RGB-D Gaussian Splatting SLAM framework that reduces communication load while preserving map fidelity. First, we incorporate a compaction step into our SLAM system to remove redundant 3D Gaussians, without degrading the rendering quality. Second, our approach performs centralized loop closure computation without initial guess, operating in two modes: a pure rendered-depth mode that requires no data beyond the 3D Gaussians, and a camera-depth mode that includes lightweight depth images for improved registration accuracy and additional Gaussian pruning. Evaluation on both synthetic and real-world datasets shows up to 85-95% reduction in transmitted data compared to state-of-the-art approaches in both modes, bringing 3D Gaussian multi-agent SLAM closer to practical deployment in real-world scenarios. Code: https://github.com/lemonci/coko-slam

关键词: multi-agent SLAM, Gaussian Splatting, 3D mapping, communication reduction, loop closure, map compaction, real-time systems, robotic teams

195. ❌ HICT: High-precision 3D CBCT reconstruction from a single X-ray

作者: Wen Ma, Jiaxiang Liu, Zikai Xiao, Ziyang Wang, Feng Yang, Zuozhu Liu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00792v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文HiCT专注于医学影像领域，提出了一种从单张全景X光片重建高精度3D CBCT的两阶段深度学习框架，并构建了大规模数据集XCT。该研究属于AI在生物医学（具体为牙科影像）中的应用，因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为牙科影像可视为生物信息学或AI for Science的一个细分应用方向。然而，论文核心内容并未涉及任何大语言模型（LLM）、模型架构（如MoE）、训练技术（如预训练、微调、对齐）、推理优化（如量化、加速）、智能体系统或理论（如缩放定律、思维链）等，与其他所有关键词完全无关（评0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HiCT的两阶段深度学习框架，用于从单张低剂量全景X光片中重建高精度、几何一致的3D CBCT影像，以解决传统CBCT辐射剂量高、成本高的问题，并通过构建大规模数据集和实验验证了其最先进的性能。

摘要翻译

精确的三维牙科成像对诊断与治疗规划至关重要，然而锥形束计算机断层扫描（CBCT）的高辐射剂量与成本限制了其普及性。从单张低剂量全景X射线重建三维体积是一种前景广阔的替代方案，但由于几何不一致性与精度有限，该技术仍面临挑战。我们提出HiCT——一个两阶段框架：首先通过视频扩散模型从单张全景图像生成几何一致的多视角投影，随后利用基于射线的动态注意力网络及X射线采样策略，从投影中重建高保真CBCT。为支持此研究，我们构建了XCT数据集，该大规模数据集整合了公开CBCT数据与500组配对的全景X射线-CBCT病例。大量实验表明，HiCT实现了最先进的性能，能够生成精确且几何一致的重建结果，满足临床使用需求。

摘要 (Abstract)

Accurate 3D dental imaging is vital for diagnosis and treatment planning, yet CBCT’s high radiation dose and cost limit its accessibility. Reconstructing 3D volumes from a single low-dose panoramic X-ray is a promising alternative but remains challenging due to geometric inconsistencies and limited accuracy. We propose HiCT, a two-stage framework that first generates geometrically consistent multi-view projections from a single panoramic image using a video diffusion model, and then reconstructs high-fidelity CBCT from the projections using a ray-based dynamic attention network and an X-ray sampling strategy. To support this, we built XCT, a large-scale dataset combining public CBCT data with 500 paired PX-CBCT cases. Extensive experiments show that HiCT achieves state-of-the-art performance, delivering accurate and geometrically consistent reconstructions for clinical use.

关键词: 3D CBCT reconstruction, single panoramic X-ray, video diffusion model, ray-based dynamic attention network, X-ray sampling strategy, dental imaging, geometric consistency, large-scale dataset

196. ❌ An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

作者: Lennart Maack, Alexander Schlaefer 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00784v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究手术视频数据集生成和视觉语言模型在手术领域的应用，属于AI for Science范畴（高度相关10分）。论文涉及使用大语言模型生成数据集（LLMs相关5分），关注数据集质量（Scaling Laws AND Data Quality相关5分），采用微调方法（SFT相关8分），并评估了上下文学习能力（In-context Learning相关8分）。其他关键词如MoE、量化、推理加速等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对手术视频数据集缺乏细粒度时空关系的问题，提出了SurgSTU-Pipeline生成高质量数据集，并通过微调和上下文学习显著提升了视觉语言模型在手术视频中的时空理解能力。

摘要翻译

手术视频理解是推进计算机辅助手术的关键前提。尽管视觉语言模型近期已被应用于手术领域，但现有的手术视觉语言数据集在捕捉和评估复杂的交错时空动态方面存在不足。由于依赖昂贵的人工标注或使用大语言模型生成易出错的内容，创建能够准确表征手术视频中细粒度时空关系的大规模数据集具有挑战性。为填补这一空白，我们提出了SurgSTU-Pipeline，这是一个具有确定性的生成流程，其特点是通过时空连续性过滤机制，可靠地创建用于细粒度时空多模态理解的手术数据集。将该流程应用于公开可用的手术数据集后，我们构建了SurgSTU数据集，包含7515个视频片段，并通过15万个细粒度时空问答样本进行了密集扩展。我们的综合评估表明，尽管最先进的通用视觉语言模型在零样本设置下表现不佳，但其时空能力可以通过上下文学习得到提升。在SurgSTU训练数据集上微调的视觉语言模型在所有时空任务中取得了最高性能，验证了该数据集在提升视觉语言模型对手术视频时空理解能力方面的有效性。代码将公开提供。

摘要 (Abstract)

Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples. Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset’s efficacy to improve spatial-temporal understanding of VLMs in surgical videos. Code will be made publicly available.

关键词: surgical video understanding, vision-language models, fine-grained spatial-temporal, dataset generation, in-context learning, supervised fine-tuning, AI for surgery, multimodal understanding

197. ❌ Using predefined vector systems to speed up neural network multimillion class classification

作者: Nikita Gabdullin, Ilya Androsov 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00779v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究神经网络多类别分类的加速方法，通过预定义向量系统和潜在空间几何特性将标签预测复杂度从O(n)降至O(1)，实现高达11.6倍的加速。所有评分关键词均与大模型、深度学习技术原理或科学应用相关，而本文聚焦传统神经网络分类加速，未涉及大模型、MoE、量化、推理加速、对齐、RAG等任何指定技术，也未涉及生物信息学等科学应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用预定义向量系统和潜在空间几何特性来加速神经网络多类别分类的方法，将标签预测复杂度从O(n)降低到O(1)，实验显示最高可实现11.6倍的整体加速。

摘要翻译

神经网络（NNs）中的标签预测具有与类别数量成正比的O(n)复杂度。这在使用全连接层以及基于一组类别原型进行余弦相似度计算的分类任务中均成立。本文表明，若神经网络潜在空间（Latent Space, LS）的几何结构已知且具备特定性质，标签预测的复杂度可被显著降低。这是通过将标签预测与在向量系统中进行O(1)复杂度的最近聚类中心搜索相关联而实现的，该向量系统被用作潜在空间配置（Latent Space Configuration, LSC）的目标。所提出的方法仅需找到嵌入向量中几个最大值和最小值的索引，因而计算效率极高。我们证明，该方法不会改变神经网络训练精度的计算结果。我们还在多个数据集上测量了神经网络推理和标签预测不同计算阶段所需的时间。实验表明，与传统方法相比，所提出的方法可实现高达11.6倍的整体加速。此外，该方法具备独特性质，能够预测新类别的存在。

摘要 (Abstract)

Label prediction in neural networks (NNs) has O(n) complexity proportional to the number of classes. This holds true for classification using fully connected layers and cosine similarity with some set of class prototypes. In this paper we show that if NN latent space (LS) geometry is known and possesses specific properties, label prediction complexity can be significantly reduced. This is achieved by associating label prediction with the O(1) complexity closest cluster center search in a vector system used as target for latent space configuration (LSC). The proposed method only requires finding indexes of several largest and lowest values in the embedding vector making it extremely computationally efficient. We show that the proposed method does not change NN training accuracy computational results. We also measure the time required by different computational stages of NN inference and label prediction on multiple datasets. The experiments show that the proposed method allows to achieve up to 11.6 times overall acceleration over conventional methods. Furthermore, the proposed method has unique properties which allow to predict the existence of new classes.

关键词: neural network classification, latent space geometry, computational efficiency, vector systems, label prediction acceleration, O(1) complexity, multimillion class classification, inference acceleration

198. ❌ PrivHAR-Bench: A Graduated Privacy Benchmark Dataset for Video-Based Action Recognition

作者: Samar Ansari 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00761v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于视频隐私保护与动作识别，属于计算机视觉领域，而非大模型或深度学习技术原理的创新。论文内容涉及隐私保护数据集、动作识别、隐私-效用权衡等，与绝大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为动作识别可视为AI在科学或健康相关领域的应用，但论文未明确提及生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个多层次的隐私保护视频动作识别基准数据集PrivHAR-Bench，用于标准化评估隐私强度与识别效用之间的权衡，并通过实验验证了不同隐私级别下识别准确率的可解释性下降曲线。

摘要翻译

现有关于隐私保护人体活动识别（Human Activity Recognition, HAR）的研究通常采用二元范式进行评估：清晰视频与单一隐私变换的对比。这种范式限制了跨方法可比性，并模糊了隐私强度与识别效用之间的微妙关系。我们提出了 \textit{PrivHAR-Bench}，一个多层级基准数据集，旨在标准化基于视频的行为识别中 \textit{隐私-效用权衡} 的评估。PrivHAR-Bench 应用了一系列渐进的视觉隐私变换：从轻量级空间模糊到加密块置换，并针对人体关节多样性精选了15个活动类别。总计1,932个源视频中的每一个均被分布在9个隐私强度递增的平行层级中，同时提供了背景移除的变体，以将人体运动特征的贡献与场景上下文偏置隔离开来。我们提供了无损帧序列、逐帧边界框、带有关节级置信度分数的估计姿态关键点、标准化的基于分组的训练/测试划分，以及一个计算识别准确率和隐私度量的评估工具包。使用R3D-18模型进行的实证验证显示，各层级间存在可测量且可解释的性能衰减曲线：层级内准确率从88.8%（清晰视频）下降至53.5%（加密且背景移除），而跨域准确率则骤降至4.8%。这确立了PrivHAR-Bench作为一个受控基准，用于在标准化条件下比较不同的隐私保护HAR方法。该数据集、生成流程及评估代码均已公开。

摘要 (Abstract)

Existing research on privacy-preserving Human Activity Recognition (HAR) typically evaluates methods against a binary paradigm: clear video versus a single privacy transformation. This limits cross-method comparability and obscures the nuanced relationship between privacy strength and recognition utility. We introduce \textit{PrivHAR-Bench}, a multi-tier benchmark dataset designed to standardize the evaluation of the \textit{Privacy-Utility Trade-off} in video-based action recognition. PrivHAR-Bench applies a graduated spectrum of visual privacy transformations: from lightweight spatial obfuscation to cryptographic block permutation, to a curated subset of 15 activity classes selected for human articulation diversity. Each of the 1,932 source videos is distributed across 9 parallel tiers of increasing privacy strength, with additional background-removed variants to isolate the contribution of human motion features from contextual scene bias. We provide lossless frame sequences, per-frame bounding boxes, estimated pose keypoints with joint-level confidence scores, standardized group-based train/test splits, and an evaluation toolkit computing recognition accuracy and privacy metrics. Empirical validation using R3D-18 demonstrates a measurable and interpretable degradation curve across tiers, with within-tier accuracy declining from 88.8% (clear) to 53.5% (encrypted, background-removed) and cross-domain accuracy collapsing to 4.8%, establishing PrivHAR-Bench as a controlled benchmark for comparing privacy-preserving HAR methods under standardized conditions. The dataset, generation pipeline, and evaluation code are publicly available.

关键词: Privacy-preserving Human Activity Recognition, Video-based action recognition, Privacy-Utility Trade-off, Benchmark dataset, Visual privacy transformations, Graduated privacy spectrum, R3D-18 validation, Standardized evaluation

199. ❌ A Benchmark of State-Space Models vs. Transformers and BiLSTM-based Models for Historical Newspaper OCR

作者: Merveilles Agbeti-messan, Thierry Paquet, Clément Chatelain, Pierrick Tranouez, Stéphane Nicolas 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00725v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究历史报纸OCR，主要比较State-Space Models（SSMs）与Transformer和BiLSTM模型在OCR任务上的性能。论文内容聚焦于计算机视觉和序列建模的特定应用，未涉及大语言模型（LLMs）、深度学习技术原理创新或科学领域的AI应用。所有关键词均与大语言模型、深度学习技术原理或科学AI应用相关，而本文研究的是OCR中的序列模型比较，属于传统计算机视觉任务，与给定关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了State-Space Models（如Mamba）作为Transformer和BiLSTM的替代方案在历史报纸OCR任务中的应用，发现基于Mamba的模型在保持竞争力的准确率的同时，将推理时间减半并表现出更优的内存扩展性。

摘要翻译

历史报纸的端到端光学字符识别（OCR）仍具挑战性，因为模型必须处理长文本序列、退化的印刷质量以及复杂的版面布局。尽管基于Transformer的识别器主导了当前研究，但其二次复杂度限制了高效的段落级转录和大规模部署。本研究探索了线性时间状态空间模型（SSMs），特别是Mamba，作为基于Transformer的OCR序列建模的可扩展替代方案。
我们提出了据我们所知首个基于SSM的OCR架构，该架构结合了CNN视觉编码器与双向及自回归的Mamba序列建模，并进行了大规模基准测试，比较了SSM与基于Transformer和BiLSTM的识别器。在相同的训练条件下，我们评估了多种解码策略（CTC、自回归和非自回归），同时对比了强大的神经基线模型（VAN、DAN、DANIEL）以及广泛使用的现成OCR引擎（PERO-OCR、Tesseract OCR、TrOCR、Gemini）。
在卢森堡国家图书馆的历史报纸数据（新发布的>99%验证黄金标准标注）上进行的实验，以及对Fraktur和Antiqua字体的跨数据集测试表明，所有神经模型均实现了较低的错误率（约2%字符错误率），使得计算效率成为主要区分因素。基于Mamba的模型在保持竞争力的准确率的同时，将推理时间减半，并展现出更优的内存扩展性（1000字符时增长1.26倍 vs 2.30倍）；在严重退化的段落级别，其字符错误率达到6.07%，而DAN模型为5.24%，同时推理速度仍快2.05倍。
我们公开了代码、训练模型和标准化评估协议，以支持可重复研究，并为大规模文化遗产OCR实践提供指导。

摘要 (Abstract)

End-to-end OCR for historical newspapers remains challenging, as models must handle long text sequences, degraded print quality, and complex layouts. While Transformer-based recognizers dominate current research, their quadratic complexity limits efficient paragraph-level transcription and large-scale deployment. We investigate linear-time State-Space Models (SSMs), specifically Mamba, as a scalable alternative to Transformer-based sequence modeling for OCR. We present to our knowledge, the first OCR architecture based on SSMs, combining a CNN visual encoder with bi-directional and autoregressive Mamba sequence modeling, and conduct a large-scale benchmark comparing SSMs with Transformer- and BiLSTM-based recognizers. Multiple decoding strategies (CTC, autoregressive, and non-autoregressive) are evaluated under identical training conditions alongside strong neural baselines (VAN, DAN, DANIEL) and widely used off-the-shelf OCR engines (PERO-OCR, Tesseract OCR, TrOCR, Gemini). Experiments on historical newspapers from the Bibliothèque nationale du Luxembourg, with newly released >99% verified gold-standard annotations, and cross-dataset tests on Fraktur and Antiqua lines, show that all neural models achieve low error rates (~2% CER), making computational efficiency the main differentiator. Mamba-based models maintain competitive accuracy while halving inference time and exhibiting superior memory scaling (1.26x vs 2.30x growth at 1000 chars), reaching 6.07% CER at the severely degraded paragraph level compared to 5.24% for DAN, while remaining 2.05x faster. We release code, trained models, and standardized evaluation protocols to enable reproducible research and guide practitioners in large-scale cultural heritage OCR.

关键词: State-Space Models, Mamba, OCR, historical newspapers, Transformer, BiLSTM, sequence modeling, computational efficiency

200. ❌ TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

作者: Soumya Shamarao Jahagirdar, Edson Araujo, Anna Kukleva, M. Jehanzeb Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R. Glass, Hilde Kuehne 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00696v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于视频推理领域的测试时自适应方法，与大多数大模型技术关键词无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为涉及预训练模型和领域适应；与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’有一定关联（5分），因为提到’step-by-step reasoning’。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于测试时强化学习的视频推理自适应方法（TTA-Vid），能够在无需标注数据的情况下，通过自适应帧选择和奖励驱动的模型更新，在多个视频推理任务上超越现有方法。

摘要翻译

近期视频推理模型在时序与多模态理解方面展现出强劲性能，但其依赖于大规模监督数据与多阶段训练流程，导致训练成本高昂且难以适应新领域。本研究利用视频-语言数据上的测试时强化学习范式，使预训练模型能够在测试时无需显式标注即可适应输入的视频样本。所提出的视频测试时自适应方法包含两个同步工作的组件：（1）一种测试时自适应机制，在推理阶段对多个帧子集执行逐步推理，随后利用跨不同帧子集计算得到的基于批感知频率的奖励作为伪真值来更新模型。实验表明，仅使用数据集中单个批次甚至单一样本训练得到的模型，能够在测试时泛化至整个数据集乃至跨数据集。由于自适应过程完全在测试时进行，本方法无需真实标注或专用训练数据划分。此外，我们提出一种基于多臂赌博机策略的自适应帧选择方法，该策略通过同一奖励机制引导，学习优先选择信息量高的帧。评估结果显示，TTA-Vid 在多种视频推理任务中均能取得稳定提升，其性能可超越当前基于大规模数据训练的最先进方法。这凸显了测试时强化学习在时序多模态理解领域的潜力。

摘要 (Abstract)

Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.

关键词: Test-Time Adaptation, Video Reasoning, Reinforcement Learning, Temporal Multimodal Understanding, Frame Selection, Pseudo Ground Truth, Domain Adaptation, Batch-Aware Reward

201. ❌ TP-Seg: Task-Prototype Framework for Unified Medical Lesion Segmentation

作者: Jiawei Xu, Qiangqiang Zhou, Dandan Zhu, Yong Chen, Yugen Yi, Xiaoqi Zhao 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00684v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文TP-Seg专注于医学图像分割领域，提出了一种任务原型框架用于统一的医学病灶分割。该研究属于计算机视觉和医学图像分析领域，主要涉及深度学习在医学影像中的应用。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文内容相关，因为医学图像分割属于AI在科学（特别是生物医学）领域的应用。其他26个关键词均涉及大语言模型（LLM）相关的技术、方法或应用，而本文完全不涉及语言模型、文本生成、对齐、推理、代理等LLM相关概念，因此相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种任务原型框架TP-Seg来解决统一医学病灶分割中的特征纠缠和梯度干扰问题，在8个不同医学病灶分割任务上超越了专用、通用和统一分割方法。

摘要翻译

构建一个使用单一参数集、能高效处理多种医学病灶分割任务的统一模型，已成为AI辅助诊断的关键目标。现有的统一分割方法通常依赖于跨异构任务和模态的共享编码器，这常导致特征纠缠、梯度干扰及病灶区分能力欠佳。本研究提出TP-Seg，一种用于统一医学病灶分割的任务原型框架。一方面，任务条件适配器通过双路径专家结构有效平衡共享表征与任务特定表征，实现对不同医学影像模态与病灶类型的自适应特征提取；另一方面，原型引导的任务解码器引入可学习的任务原型作为语义锚点，并采用交叉注意力机制实现对任务特定前景与背景语义的细粒度建模。在涵盖多种影像模态的8个不同医学病灶分割任务中，TP-Seg在无需复杂技巧的情况下，始终优于专用、通用及统一分割方法，展现出强大的泛化能力、可扩展性及临床适用性。

摘要 (Abstract)

Building a unified model with a single set of parameters to efficiently handle diverse types of medical lesion segmentation has become a crucial objective for AI-assisted diagnosis. Existing unified segmentation approaches typically rely on shared encoders across heterogeneous tasks and modalities, which often leads to feature entanglement, gradient interference, and suboptimal lesion discrimination. In this work, we propose TP-Seg, a task-prototype framework for unified medical lesion segmentation. On one hand, the task-conditioned adapter effectively balances shared and task-specific representations through a dual-path expert structure, enabling adaptive feature extraction across diverse medical imaging modalities and lesion types. On the other hand, the prototype-guided task decoder introduces learnable task prototypes as semantic anchors and employs a cross-attention mechanism to achieve fine-grained modeling of task-specific foreground and background semantics. Without bells and whistles, TP-Seg consistently outperforms specialized, general and unified segmentation methods across 8 different medical lesion segmentation tasks covering multiple imaging modalities, demonstrating strong generalization, scalability and clinical applicability.

关键词: medical lesion segmentation, unified model, task-prototype framework, dual-path expert structure, cross-attention mechanism, medical imaging modalities, generalization, clinical applicability

202. ❌ MoonAnything: A Vision Benchmark with Large-Scale Lunar Supervised Data

作者: Clémentine Grethen, Yuang Shi, Simone Gasparini, Géraldine Morin 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00682v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文MoonAnything专注于构建月球表面感知的计算机视觉数据集和基准，涉及物理渲染、3D重建、姿态估计等计算机视觉任务，与绝大多数大模型/深度学习技术原理关键词（如LLM、MoE、RLHF、RAG等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（行星科学/天文学）领域的应用，但并非核心内容，因此给予5分（有一定关联）。其他26个关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对月球探索中缺乏几何和光度监督数据的问题，提出了MoonAnything基准数据集，包含LunarGeo和LunarPhoto两个子数据集，提供超过13万样本的全面监督，支持3D重建、反射率估计等任务，并为低纹理、高对比度条件下的算法提供了测试平台。

摘要翻译

对月球表面的精确感知是现代月球探测任务的关键。然而，由于缺乏同时提供几何与光度监督的数据集，开发基于学习的鲁棒感知系统受到阻碍。现有月球数据集通常缺失几何真值、光度真实性、光照多样性或大规模覆盖中的至少一项。本文介绍了MoonAnything，这是一个基于真实月球地形、采用基于物理渲染构建的统一基准数据集，首次在大规模条件下提供了多种光照下的全面几何与光度监督。该基准包含两个互补的子数据集：i) LunarGeo提供立体图像及其对应的稠密深度图与相机标定参数，支持三维重建与姿态估计；ii) LunarPhoto采用空间变化双向反射分布函数（BRDF）模型生成具有照片级真实感的图像，并提供真实太阳配置下的多光照渲染，支持反射率估计与光照鲁棒感知。这些数据集共同提供了超过13万个样本及全面监督。除月球应用外，MoonAnything为低纹理、高对比度条件下的算法提供了独特场景与挑战性测试平台，可适用于其他无大气天体并具备更广泛的泛化潜力。我们使用先进方法建立了性能基线，并发布完整数据集及生成工具以支持社区扩展：https://github.com/clementinegrethen/MoonAnything。

摘要 (Abstract)

Accurate perception of lunar surfaces is critical for modern lunar exploration missions. However, developing robust learning-based perception systems is hindered by the lack of datasets that provide both geometric and photometric supervision. Existing lunar datasets typically lack either geometric ground truth, photometric realism, illumination diversity, or large-scale coverage. In this paper, we introduce MoonAnything, a unified benchmark built on real lunar topography with physically-based rendering, providing the first comprehensive geometric and photometric supervision under diverse illumination with large scale. The benchmark comprises two complementary sub-datasets : i) LunarGeo provides stereo images with corresponding dense depth maps and camera calibration enabling 3D reconstruction and pose estimation; ii) LunarPhoto provides photorealistic images using a spatially-varying BRDF model, along with multi-illumination renderings under real solar configurations, enabling reflectance estimation and illumination-robust perception. Together, these datasets offer over 130K samples with comprehensive supervision. Beyond lunar applications, MoonAnything offers a unique setting and challenging testbed for algorithms under low-textured, high-contrast conditions and applies to other airless celestial bodies and could generalize beyond. We establish baselines using state-of-the-art methods and release the complete dataset along with generation tools to support community extension: https://github.com/clementinegrethen/MoonAnything.

关键词: lunar perception, geometric supervision, photometric supervision, physically-based rendering, 3D reconstruction, reflectance estimation, dataset benchmark, airless celestial bodies

203. ❌ CL-VISTA: Benchmarking Continual Learning in Video Large Language Models

作者: Haiyang Guo, Yichen Shi, Fei Zhu, Wenzhuo Liu, Hongbo Zhao, Fanhu Zeng, Shijie Ma, Da-Han Wang, Xu-Yao Zhang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00677v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Video-LLMs的持续学习，与’Large Language Models’高度相关（10分），因为直接研究视频大语言模型。与’Pre-training’有一定关联（8分），因为论文提到现有基准依赖未大规模预训练的模型，且评估CL方法是否增强基础智能，涉及预训练模型适应。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了CL-VISTA基准，用于评估视频大语言模型在持续学习中的表现，发现现有方法在缓解灾难性遗忘、泛化能力和计算开销之间存在根本性权衡。

摘要翻译

视频大语言模型（Video-LLMs）需要持续学习以适应非平稳的真实世界数据。然而，现有基准测试在评估现代基础模型方面存在不足：许多仍依赖未经大规模预训练的模型，且主流基准通常将单一数据集划分为多个子任务，导致任务冗余度高，并在预训练的Video-LLMs上表现出可忽略的遗忘现象。为应对这些局限，我们提出了CL-VISTA——一个专为Video-LLMs持续视频理解设计的基准测试。通过精心构建涵盖感知、理解与推理的8项多样化任务，CL-VISTA引发显著的分布偏移，从而有效暴露灾难性遗忘问题。为系统评估持续学习方法，我们建立了一个包含6种不同协议的综合评估框架，涵盖三个关键维度：性能、计算效率与内存占用。值得注意的是，性能维度引入了通用视频理解评估，以检验持续学习方法是否真正增强了基础智能，还是仅导致任务特异性过拟合。对10种主流持续学习方法的广泛测试揭示了一个根本性权衡：没有任何一种方法能在所有维度上实现全面优势。那些成功缓解灾难性遗忘的方法往往以牺牲泛化能力为代价，或产生难以承受的计算与内存开销。我们希望CL-VISTA能为推进多模态基础模型的持续学习研究提供关键见解。

摘要 (Abstract)

Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.

关键词: Video Large Language Models, Continual Learning, Benchmark, Catastrophic Forgetting, Generalization, Computational Efficiency, Memory Footprint, Multimodal Foundation Models

204. ❌ When AI and Experts Agree on Error: Intrinsic Ambiguity in Dermatoscopic Images

作者: Loris Cino, Pier Luigi Mazzeo, Alessandro Martella, Giulia Radi, Renato Rossi, Cosimo Distante 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究AI（具体为卷积神经网络CNN）在皮肤病学诊断中的应用，探讨了图像内在复杂性导致的AI和专家诊断失败，属于AI在生物医学领域的应用研究。论文未涉及大语言模型（LLM）、深度学习技术原理创新或任何评分关键词中的具体技术（如MoE、SFT、RAG等），仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其涉及AI在医学（生物信息学相关领域）的应用，但并非核心创新或技术研究，故给予5分；其他关键词均完全无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文研究了皮肤镜图像的内在复杂性如何导致AI模型和人类专家在诊断中同时出现系统性失败，发现图像质量是造成这种双重失败的主要原因。

摘要翻译

将人工智能（AI），特别是卷积神经网络（Convolutional Neural Networks, CNNs）整合到皮肤病学诊断中，展现出巨大的临床潜力。现有文献主要将算法性能与人类专家进行基准比较，而本研究则采用了一个新颖的视角，探究皮肤镜图像的内在复杂性。通过对多种CNN架构进行严格实验，我们分离出一个在所有模型中被系统性误判的图像子集——这一现象在统计学上被证明超出了随机误差范围。为了确定这些失败案例是源于算法偏差还是图像本身固有的视觉模糊性，皮肤科专家独立评估了这些具有挑战性的病例以及一个对照组。结果显示，在AI误判的图像上，人类诊断性能出现了显著下降。首先，与真实标签的一致性急剧下降：困难图像的科恩卡帕系数（Cohen’s kappa）仅为0.08，而对照组为0.61。其次，我们观察到专家共识严重恶化；医生间的评估者间信度（inter-rater reliability）从对照组图像的中等一致性（弗莱斯卡帕系数 Fleiss kappa = 0.456）下降到困难病例的仅微弱一致性（弗莱斯卡帕系数 = 0.275）。我们确定图像质量是导致这双重系统性失败的主要驱动因素。为促进透明度和可重复性，所有数据、代码及训练模型均已公开。

摘要 (Abstract)

The integration of artificial intelligence (AI), particularly Convolutional Neural Networks (CNNs), into dermatological diagnosis demonstrates substantial clinical potential. While existing literature predominantly benchmarks algorithmic performance against human experts, our study adopts a novel perspective by investigating the intrinsic complexity of dermatoscopic images. Through rigorous experimentation with multiple CNN architectures, we isolated a subset of images systematically misclassified across all models-a phenomenon statistically proven to exceed random chance. To determine if these failures stem from algorithmic biases or inherent visual ambiguity, expert dermatologists independently evaluated these challenging cases alongside a control group. The results revealed a collapse in human diagnostic performance on the AI-misclassified images. First, agreement with ground-truth labels plummeted, with Cohen’s kappa dropping to a mere 0.08 for the difficult images, compared to a 0.61 for the control group. Second, we observed a severe deterioration in expert consensus; inter-rater reliability among physicians fell from moderate concordance (Fleiss kappa = 0.456) on control images to only modest agreement (Fleiss kappa = 0.275) on difficult cases. We identified image quality as a primary driver of these dual systematic failures. To promote transparency and reproducibility, all data, code, and trained models have been made publicly available

关键词: Artificial Intelligence, Convolutional Neural Networks, Dermatological Diagnosis, Image Complexity, Diagnostic Failure, Expert Agreement, Image Quality, Transparency

205. ❌ LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics

作者: Calvin Galagain, Martyna Poreba, François Goulette, Cyrill Stachniss 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00634v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的全景分割任务，特别是针对资源受限的机器人平台设计轻量级模型。虽然涉及模型效率优化（如降低计算需求、提高吞吐量），但所有关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是视觉感知模型，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LiPS的轻量级全景分割方法，旨在解决资源受限机器人平台上的高效计算问题，在保持与重型基线相当精度的同时，实现了高达4.5倍的吞吐量提升和6.8倍的计算量减少。

摘要翻译

全景分割是机器人感知的关键赋能技术，它将语义理解与对象级推理相统一。然而，先进模型日益增长的复杂性使其难以部署在移动机器人等资源受限平台上。我们提出了一种名为LiPS的新方法，通过轻量级设计应对高效计算全景分割的挑战，该方法在保留基于查询的解码机制的同时，引入了精简的特征提取与融合路径。其目标是在显著降低计算需求的同时，提供强大的全景分割性能。在标准基准测试上的评估表明，LiPS达到了与计算量更大的基线模型相当的精度，同时以每秒帧数衡量的吞吐量最高提升至4.5倍，所需计算量减少近6.8倍。这种高效性使LiPS成为现代全景分割模型与现实世界机器人应用之间高度相关的桥梁。

摘要 (Abstract)

Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.

关键词: panoptic segmentation, lightweight model, robotic perception, resource-constrained platforms, computational efficiency, query-based decoding, feature extraction, throughput improvement

206. ❌ DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization

作者: Zhengxian Yang, Fei Xie, Xutao Xue, Rui Zhang, Taicheng Huang, Yang Liu, Mengqi Ji, Tao Yu 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00648v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D场景重建技术，具体研究将鱼眼相机模型集成到3D高斯泼溅（3DGS）框架中，并提出跨视图联合优化策略以解决边缘伪影问题。论文内容涉及计算机图形学、相机模型、优化算法和3D重建，但完全不涉及大语言模型（LLM）、深度学习技术原理创新、AI for Science或任何评分关键词中的大模型相关主题。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文研究的是特定视觉任务的算法改进，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了3D高斯泼溅（3DGS）框架中直接使用鱼眼图像输入时因图像边缘失真和跨视图优化不足导致的伪影问题，通过集成鱼眼相机模型和提出特征重叠驱动的跨视图联合优化策略，实现了无需预处理的鱼眼图像训练并提升了重建质量。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting, 3DGS）能够从日常图像中实现高效的三维场景重建，并支持实时、高保真度的渲染，极大地推动了虚拟现实（VR）/增强现实（AR）应用的发展。鱼眼相机凭借其更宽的视场（Field of View, FOV），有望以更少的输入实现高质量重建，近来备受关注。然而，由于3DGS依赖于光栅化，大多数涉及鱼眼相机输入的后续研究在训练前会先对图像进行去畸变处理，这带来了两个问题：1）图像边缘的黑边会导致信息丢失，抵消了鱼眼相机大视场的优势；2）去畸变过程中的拉伸与插值重采样会将每个像素的值扩散到更大区域，稀释了细节密度——导致3DGS过度拟合这些低频区域，产生模糊和漂浮伪影。在本工作中，我们将鱼眼相机模型集成到原始3DGS框架中，使得无需预处理即可直接使用原始鱼眼图像进行训练。尽管建模正确，我们观察到重建场景在图像边缘仍存在漂浮物：畸变向边缘逐渐增大，而3DGS原有的每迭代随机选择视角优化策略忽略了高斯分布（Gaussian）的跨视角关联性，导致产生极端形状（如过大或过长），从而降低了重建质量。为解决此问题，我们提出了一种基于特征重叠驱动的跨视角联合优化策略，该策略在多个视角间建立了一致的几何与光度约束——这项技术同样适用于现有的基于针孔相机（pinhole-camera）的流程。我们的DirectFisheye-GS在公开数据集上达到或超越了最先进的性能。

摘要 (Abstract)

3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye’s large FOV advantage; 2) Undistortion’s stretch-and-interpolate resampling spreads each pixel’s value over a larger area, diluting detail density – causes 3DGS overfitting these low-frequency zones, producing blur and floating artifacts. In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS’s original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap-driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views-a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets.

关键词: 3D Gaussian Splatting, fisheye camera, cross-view optimization, 3D scene reconstruction, image distortion, real-time rendering, VR/AR applications, geometric constraints

207. ❌ Fluently Lying: Adversarial Robustness Can Be Substrate-Dependent

作者: Daye Kang, Hyeongboo Baek 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00605v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是脉冲神经网络（SNN）目标检测器的对抗鲁棒性，具体探讨了在对抗攻击下检测数量与精度解耦的现象（称为质量腐败）。论文主题属于计算机视觉、对抗机器学习和神经形态计算领域，与所有评分关键词（均聚焦于大语言模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型、语言模型、训练方法、推理技术、对齐、压缩、代理系统或AI for Science相关内容。

!!! tip deepseek-chat TL;DR

该论文研究发现，在对抗攻击下，脉冲神经网络目标检测器可能出现检测数量保持而精度崩溃的“质量腐败”现象，且这种失败模式具有底物依赖性，现有防御组件无法有效检测或缓解。

摘要翻译

当前用于监控和防御对抗攻击下目标检测器的主要工具均假设：当检测精度下降时，检测数量会同步减少。这种关联性此前仅为假设，未经实际测量验证。我们报告了在单一模型中观察到的反例：在标准PGD攻击下，脉冲神经网络目标检测器EMS-YOLO在保持70%以上检测框数量的同时，其mAP指标从0.528急剧下降至0.042。我们将这种数量保持而精度崩溃的现象定义为质量腐化，以区别于非定向攻击评估中占主导的抑制现象。通过对四种SNN架构和两种威胁模型的测试，质量腐化仅出现在四种检测器中的EMS-YOLO上。在该模型上，所有五种标准防御组件均未能检测或缓解质量腐化，这表明现有防御体系可能依赖于基于单一计算基底校准的共享假设。据我们所知，这些结果首次证明了对抗性失效模式可能具有基底依赖性。

摘要 (Abstract)

The primary tools used to monitor and defend object detectors under adversarial attack assume that when accuracy degrades, detection count drops in tandem. This coupling was assumed, not measured. We report a counterexample observed on a single model: under standard PGD, EMS-YOLO, a spiking neural network (SNN) object detector, retains more than 70% of its detections while mAP collapses from 0.528 to 0.042. We term this count-preserving accuracy collapse Quality Corruption (QC), to distinguish it from the suppression that dominates untargeted evaluation. Across four SNN architectures and two threat models (l-infinity and l-2), QC appears only in one of the four detectors tested (EMS-YOLO). On this model, all five standard defense components fail to detect or mitigate QC, suggesting the defense ecosystem may rely on a shared assumption calibrated on a single substrate. These results provide, to our knowledge, the first evidence that adversarial failure modes can be substrate-dependent.

关键词: adversarial robustness, spiking neural network, object detector, quality corruption, substrate-dependent, PGD attack, defense failure, EMS-YOLO

208. ❌ TALENT: Target-aware Efficient Tuning for Referring Image Segmentation

作者: Shuo Jin, Siyue Yu, Bingfeng Zhang, Chao Yao, Meiqin Liu, Jimin Xiao 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00609v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的Referring Image Segmentation任务，提出了一种基于参数高效微调（PEFT）的新框架TALENT来解决非目标激活问题。论文核心贡献在于视觉-语言多模态任务的PEFT方法创新，仅与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为论文明确研究parameter-efficient tuning（PET）并提出了新的目标感知高效调优方法。其他关键词主要涉及大语言模型、推理、对齐、科学AI应用等，与这篇视觉任务论文无直接关联，因此得0分。

!!! tip deepseek-chat TL;DR

论文提出TALENT框架，通过目标感知高效调优解决了Referring Image Segmentation中的非目标激活问题，在多个指标上显著优于现有方法。

摘要翻译

指代图像分割旨在根据自然语言描述分割特定目标。近年来，参数高效调优（PET）已成为一种前景广阔的研究范式。然而，现有的基于PET的方法常面临视觉特征无法聚焦于文本所指的目标实例，反而会激活同类但无关物体的问题。我们分析并量化了这一问题，将其定义为“非目标激活”（NTA）现象。为解决此问题，我们提出了一个新颖框架TALENT，该框架采用面向目标的高效调优策略用于基于PET的指代图像分割。具体而言，我们首先提出一种校正代价聚合器（RCA），以高效聚合文本指代特征。随后，为将“非目标激活”校准为准确的目标激活，我们采用了目标感知学习机制（TLM），包含上下文成对一致性学习和以目标为中心的对比学习。前者利用句子级文本特征实现对指代对象的整体理解，并构建文本指代亲和力图以优化视觉特征的语义关联；后者则进一步强化目标定位能力，以区分特定实例并抑制与其他无关实例的关联。这两个目标协同工作，有效解决了“非目标激活”问题。大量实验评估表明，TALENT在多项指标上均优于现有方法（例如在G-Ref验证集上实现了2.5%的mIoU提升）。我们的代码将在以下地址开源：https://github.com/Kimsure/TALENT。

摘要 (Abstract)

Referring image segmentation aims to segment specific targets based on a natural text expression. Recently, parameter-efficient tuning (PET) has emerged as a promising paradigm. However, existing PET-based methods often suffer from the fact that visual features can’t emphasize the text-referred target instance but activate co-category yet unrelated objects. We analyze and quantify this problem, terming it the non-target activation' (NTA) issue. To address this, we propose a novel framework, TALENT, which utilizes target-aware efficient tuning for PET-based RIS. Specifically, we first propose a Rectified Cost Aggregator (RCA) to efficiently aggregate text-referred features. Then, to calibrate NTA’ into accurate target activation, we adopt a Target-aware Learning Mechanism (TLM), including contextual pairwise consistency learning and target-centric contrastive learning. The former uses the sentence-level text feature to achieve a holistic understanding of the referent and constructs a text-referred affinity map to optimize the semantic association of visual features. The latter further enhances target localization to discover the distinct instance while suppressing associations with other unrelated ones. The two objectives work in concert and address `NTA’ effectively. Extensive evaluations show that TALENT outperforms existing methods across various metrics (e.g., 2.5% mIoU gains on G-Ref val set). Our codes will be released at: https://github.com/Kimsure/TALENT.

关键词: Referring Image Segmentation, Parameter-efficient Tuning, Target-aware Learning, Non-target Activation, Visual-Language Models, Contrastive Learning, Semantic Association, Multi-modal Learning

209. ❌ KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering

作者: Xianyao Zheng, Hong Yu, Hui Cui, Changming Sun, Xiangyu Li, Ran Su, Leyi Wei, Jia Zhou, Junbo Wang, Qiangguo Jin 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00601v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学视觉问答（Med-VQA），提出了一种结合知识图谱和Mamba架构的跨模态交互框架。论文的核心是医学领域的AI应用，特别是生物信息学/医学信息学方向。因此，它只与关键词列表中的’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（评分为10分），因为该关键词直接涵盖了生物医学领域的AI应用。论文未涉及大模型（LLM）技术原理、训练方法（如预训练、微调、对齐）、推理优化、智能体系统或其他通用AI技术，因此其他所有关键词均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对医学视觉问答中难以充分利用领域知识和处理开放式答案的问题，提出了一个知识图谱增强的跨Mamba交互框架（KG-CMI），该框架在多个Med-VQA数据集上超越了现有最先进方法。

摘要翻译

医学视觉问答（Med-VQA）是临床决策支持和远程医疗中的关键多模态任务。现有方法未能充分利用领域特定的医学知识，导致难以准确关联医学图像中的病灶特征与关键诊断标准。此外，基于分类的方法通常依赖于预定义的答案集合，将Med-VQA视为简单的分类问题限制了其适应自由形式答案多样性的能力，并可能忽略答案中的细节语义信息。为解决这些挑战，我们提出了一种知识图谱增强的交叉Mamba交互（KG-CMI）框架，该框架包含细粒度跨模态特征对齐（FCFA）模块、知识图谱嵌入（KGE）模块、跨模态交互表示（CMIR）模块以及自由形式答案增强的多任务学习（FAMT）模块。KG-CMI通过图结构有效整合专业医学知识，学习图像与文本的跨模态特征表示，建立病灶特征与疾病知识之间的关联。此外，FAMT利用开放式问题的辅助知识，提升了模型在开放式Med-VQA任务中的能力。实验结果表明，KG-CMI在三个Med-VQA数据集（即VQA-RAD、SLAKE和OVQA）上均优于现有的先进方法。我们还进行了可解释性实验，进一步验证了该框架的有效性。

摘要 (Abstract)

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent methods fail to fully leverage domain-specific medical knowledge, making it difficult to accurately associate lesion features in medical images with key diagnostic criteria. Additionally, classification-based approaches typically rely on predefined answer sets. Treating Med-VQA as a simple classification problem limits its ability to adapt to the diversity of free-form answers and may overlook detailed semantic information in those answers. To address these challenges, we propose a knowledge graph enhanced cross-Mamba interaction (KG-CMI) framework, which consists of a fine-grained cross-modal feature alignment (FCFA) module, a knowledge graph embedding (KGE) module, a cross-modal interaction representation (CMIR) module, and a free-form answer enhanced multi-task learning (FAMT) module. The KG-CMI learns cross-modal feature representations for images and texts by effectively integrating professional medical knowledge through a graph, establishing associations between lesion features and disease knowledge. Moreover, FAMT leverages auxiliary knowledge from open-ended questions, improving the model’s capability for open-ended Med-VQA. Experimental results demonstrate that KG-CMI outperforms existing state-of-the-art methods on three Med-VQA datasets, i.e., VQA-RAD, SLAKE, and OVQA. Additionally, we conduct interpretability experiments to further validate the framework’s effectiveness.

关键词: Medical visual question answering, Knowledge graph, Cross-modal interaction, Mamba, Free-form answer, Multimodal learning, Clinical decision support

210. ❌ Towards Viewpoint-Robust End-to-End Autonomous Driving with 3D Foundation Model Priors

作者: Hiroki Hashimoto, Hiromichi Goto, Hiroyuki Sugai, Hiroshi Kera, Kazuhiko Kawamoto 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00597v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是自动驾驶领域，使用3D基础模型提供几何先验来增强轨迹规划的鲁棒性，主要涉及计算机视觉、3D感知和自动驾驶技术。所有评分关键词都聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等），而本文完全不涉及语言模型或文本处理，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用3D基础模型几何先验的方法，以增强端到端自动驾驶在相机视角变化下的轨迹规划鲁棒性，实验表明在俯仰和高度扰动条件下性能有明显提升。

摘要翻译

在相机视角变化下实现鲁棒的轨迹规划对于可扩展的端到端自动驾驶至关重要。然而，现有模型通常高度依赖于训练期间所见的相机视角。我们研究了一种无需数据增强的方法，该方法利用来自三维基础模型（3D foundation model）的几何先验。该方法将基于深度估计得到的逐像素三维位置作为位置嵌入（positional embeddings）注入模型，并通过交叉注意力机制融合中间几何特征。在VR-Drive相机视角扰动基准测试上的实验表明，在大多数扰动条件下性能下降有所减少，尤其在俯仰角（pitch）和高度扰动下改进明显。在纵向平移（longitudinal translation）下的增益较小，这表明需要对相机视角变化具有更强鲁棒性的、更独立于视角的集成方法。

摘要 (Abstract)

Robust trajectory planning under camera viewpoint changes is important for scalable end-to-end autonomous driving. However, existing models often depend heavily on the camera viewpoints seen during training. We investigate an augmentation-free approach that leverages geometric priors from a 3D foundation model. The method injects per-pixel 3D positions derived from depth estimates as positional embeddings and fuses intermediate geometric features through cross-attention. Experiments on the VR-Drive camera viewpoint perturbation benchmark show reduced performance degradation under most perturbation conditions, with clear improvements under pitch and height perturbations. Gains under longitudinal translation are smaller, suggesting that more viewpoint-agnostic integration is needed for robustness to camera viewpoint changes.

关键词: autonomous driving, 3D foundation model, viewpoint robustness, trajectory planning, geometric priors, camera viewpoint perturbation, positional embeddings, cross-attention

作者: Junhee Lee, Minseok Kim, Hwanjo Heo, Seungwon Woo, Jinwoo Kim 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00592v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	3.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文使用Vision-Language Models (VLMs)检测社交VR中的骚扰行为，属于大模型在特定领域的应用。与关键词的相关性分析：1）论文使用VLMs，属于大模型范畴，但与纯LLMs/Foundation Models有区别，给3分；2）论文明确提到fine-tune VLMs，与Supervised Fine-tuning高度相关，给8分；3）其他关键词如MoE、Scaling Laws、RLHF、RAG等均未涉及，给0分；4）论文应用领域是社交VR安全，不属于AI for Science中的生物/化学信息学，给0分。

!!! tip deepseek-chat TL;DR

该论文提出基于视觉语言模型的HarassGuard系统，通过微调模型和上下文推理，在仅使用视觉输入的情况下有效检测社交虚拟现实中的骚扰行为，在二分类和多分类任务中分别达到88.09%和68.85%的准确率。

摘要翻译

社交虚拟现实（VR）平台为用户提供了沉浸式的社交体验，但也使其面临严重的在线骚扰风险。现有的安全措施大多为被动响应，而能在事件发生期间主动检测骚扰行为的解决方案通常依赖于敏感的生物识别数据，引发了隐私担忧。本文提出HarassGuard，一种基于视觉语言模型（VLM）的系统，仅利用视觉输入来检测社交VR中的物理骚扰行为。我们构建了一个经伦理审查委员会（IRB）批准的骚扰视觉数据集，应用提示工程，并通过考虑社交VR中的上下文信息对VLM进行微调，以检测骚扰行为。实验结果表明，与最先进的基线模型（即LSTM/CNN、Transformer）相比，HarassGuard取得了具有竞争力的性能，在二分类任务中准确率最高达到88.09%，在多分类任务中达到68.85%。值得注意的是，HarassGuard在达到同等性能的同时，所需的微调样本量显著减少（200个对比1,115个），在上下文推理和隐私保护检测方面展现出独特优势。

摘要 (Abstract)

Social Virtual Reality (VR) platforms provide immersive social experiences but also expose users to serious risks of online harassment. Existing safety measures are largely reactive, while proactive solutions that detect harassment behavior during an incident often depend on sensitive biometric data, raising privacy concerns. In this paper, we present HarassGuard, a vision-language model (VLM) based system that detects physical harassment in social VR using only visual input. We construct an IRB-approved harassment vision dataset, apply prompt engineering, and fine-tune VLMs to detect harassment behavior by considering contextual information in social VR. Experimental results demonstrate that HarassGuard achieves competitive performance compared to state-of-the-art baselines (i.e., LSTM/CNN, Transformer), reaching an accuracy of up to 88.09% in binary classification and 68.85% in multi-class classification. Notably, HarassGuard matches these baselines while using significantly fewer fine-tuning samples (200 vs. 1,115), offering unique advantages in contextual reasoning and privacy-preserving detection.

关键词: Vision-Language Models, Social Virtual Reality, Harassment Detection, Fine-tuning, Contextual Reasoning, Privacy-preserving, Computer Vision, Multimodal AI

212. ❌ FecalFed: Privacy-Preserving Poultry Disease Detection via Federated Learning

作者: Tien-Yu Chi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00559v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉和联邦学习在农业疾病检测中的应用，未涉及大语言模型、深度学习技术原理创新或任何评分关键词中的具体技术（如MoE、RLHF、RAG等）。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI应用于生物信息学（禽流感检测）和科学领域（农业），但这不是核心创新点，只是应用场景，因此给5分（有一定关联）。其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为FecalFed的隐私保护联邦学习框架，用于从粪便图像中检测家禽疾病，通过消除数据集重复和模拟非IID条件，在保持数据隐私的同时实现了接近集中式训练的准确率。

摘要翻译

高致病性禽流感（HPAI）与地方性家禽疾病的早期检测对全球粮食安全至关重要。尽管计算机视觉模型在基于粪便图像进行疾病分类方面表现出色，但规模化部署这些系统却受限于农场数据隐私担忧与机构数据孤岛问题。此外，现有开源农业数据集普遍存在严重且未记录的数据污染。本文提出 FecalFed，一种用于家禽疾病分类的隐私保护联邦学习框架。我们首先整理并发布了 poultry-fecal-fl 数据集，该数据集经过严格去重，包含四个疾病类别的 8,770 张独立图像，揭示并消除了流行公共数据集中高达 46.89% 的重复率。为模拟真实农业环境，我们在高度异构、非独立同分布（non-IID）条件下（狄利克雷分布参数 α=0.5）评估 FecalFed。虽然孤立单农场训练在此数据异构性下性能崩溃，仅获得 64.86% 的准确率，但我们的联邦学习方法无需集中敏感数据即可恢复性能。具体而言，采用服务器端自适应优化（FedAdam）与 Swin-Small 架构实现了 90.31% 的准确率，接近集中式训练上限 95.10%。此外，我们证明边缘优化的 Swin-Tiny 模型保持了 89.74% 的高度竞争力性能，为农场禽类疾病监测建立了一个高效、隐私优先的实践蓝图。

摘要 (Abstract)

Early detection of highly pathogenic avian influenza (HPAI) and endemic poultry diseases is critical for global food security. While computer vision models excel at classifying diseases from fecal imaging, deploying these systems at scale is bottlenecked by farm data privacy concerns and institutional data silos. Furthermore, existing open-source agricultural datasets frequently suffer from severe, undocumented data contamination. In this paper, we introduce $\textbf{FecalFed}$, a privacy-preserving federated learning framework for poultry disease classification. We first curate and release $\texttt{poultry-fecal-fl}$, a rigorously deduplicated dataset of 8,770 unique images across four disease classes, revealing and eliminating a 46.89$%$ duplication rate in popular public repositories. To simulate realistic agricultural environments, we evaluate FecalFed under highly heterogeneous, non-IID conditions (Dirichlet $α=0.5$). While isolated single-farm training collapses under this data heterogeneity, yielding only 64.86$%$ accuracy, our federated approach recovers performance without centralizing sensitive data. Specifically, utilizing server-side adaptive optimization (FedAdam) with a Swin-Small architecture achieves 90.31$%$ accuracy, closely approaching the centralized upper bound of 95.10%. Furthermore, we demonstrate that an edge-optimized Swin-Tiny model maintains highly competitive performance at 89.74$%$, establishing a highly efficient, privacy-first blueprint for on-farm avian disease monitoring.

关键词: federated learning, privacy-preserving, poultry disease detection, computer vision, data contamination, non-IID data, Swin Transformer, agricultural AI

213. ❌ STAR: Mitigating Cascading Errors in Spatial Reasoning via Turn-point Alignment and Segment-level DPO

作者: Pukun Zhao, Longxiang Wang, Chen Chen, Peicheng Wang, Fanqing Zhou, Runze Li, Haojian Huang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00558v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在空间推理任务中的应用，直接涉及LLMs、监督微调（SFT）和直接偏好优化（DPO）等关键词，并高度相关（10分）。论文关注复杂拓扑中的推理错误缓解，与多步推理（CoT）和深度推理（System 2 Thinking）相关（8分）。其他关键词如MoE、SLMs、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在结构化空间导航中容易产生级联错误的问题，提出了基于拓扑锚点的两阶段框架STAR，通过监督微调和空间感知的段级直接偏好优化，在开源模型中实现了最先进的性能。

摘要翻译

结构化空间导航是评估大语言模型空间推理能力的核心基准。现有范式如思维可视化在复杂拓扑结构中易产生级联错误。为解决此问题，我们提出基于拓扑锚点的两阶段框架STAR，并引入包含人类启发性转向点标注的RedMaze-23K数据集。第一阶段通过监督微调帮助模型内化空间语义并剪枝冗余路径；第二阶段采用空间感知分段直接偏好优化方法，以提升长视野导航中的自我修正能力。实验表明STAR在开源模型中达到最先进性能：其320亿参数版本超越DeepSeek-V3（29.27%对25.00%），达到GPT-4性能的82.4%。

摘要 (Abstract)

Structured spatial navigation is a core benchmark for Large Language Models (LLMs) spatial reasoning. Existing paradigms like Visualization-of-Thought (VoT) are prone to cascading errors in complex topologies. To solve this, we propose STAR, a two-stage framework grounded on topological anchors, and introduce the RedMaze-23K dataset with human-inspired turnpoint annotations. The first stage uses supervised fine-tuning to help models internalize spatial semantics and prune redundant paths. The second adopts Spatial-aware Segment-level Direct Preference Optimization (SDPO) to refine self-correction in long-horizon navigation. Experiments show STAR achieves state-of-the-art performance among open-source models: its 32B variant outperforms DeepSeek-V3 (29.27% vs. 25.00%) and reaches 82.4% of GPT-4’s performance.

关键词: Large Language Models, Spatial Reasoning, Supervised Fine-tuning, Direct Preference Optimization, Self-correction, Cascading Errors, Topological Anchors, Navigation

214. ❌ Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning

作者: Yichen Xie, Yixiao Wang, Shuqi Zhao, Cheng-En Wu, Masayoshi Tomizuka, Jianwen Xie, Hao-Shu Fang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器人模仿学习中的数据效率问题，通过多摄像头视角扩展来生成伪演示数据，属于机器人视觉与模仿学习领域。所有关键词均与大语言模型、深度学习技术原理、AI科学应用等直接相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过扩展摄像头视角来生成伪演示数据的方法，以提高机器人模仿学习的数据效率和泛化能力，在仿真和真实世界任务中取得了显著效果。

摘要翻译

机器人模仿学习策略的泛化能力从根本上受限于专家示范数据的多样性，而在不同环境中收集示范数据在实践中成本高昂且困难重重。本文提出一种实用框架，通过在示范收集过程中扩展相机视角来利用固有的场景多样性，无需额外人力投入。该方法并非采集更多轨迹，而是利用多个同步相机视角从每条专家轨迹生成伪示范数据，从而丰富训练数据分布并提升视觉表征的视角不变性。我们分析了不同动作空间与视角扩展机制的交互关系，并证明相机空间表征能进一步增强多样性。此外，我们提出一种多视角动作聚合方法，使单视角策略在部署阶段也能受益于多相机系统。在仿真和真实机器人操作任务中的大量实验表明，相较于单视角基线方法，本框架在数据效率和泛化能力方面均取得显著提升。研究结果表明，扩展相机视角为模仿学习提供了实用且可扩展的解决方案，该方法仅需最低限度的额外硬件配置，并能与现有模仿学习算法无缝集成。项目网站详见 https://yichen928.github.io/robot_multiview。

摘要 (Abstract)

The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views during demonstration collection. Instead of acquiring more trajectories, multiple synchronized camera perspectives are used to generate pseudo-demonstrations from each expert trajectory, which enriches the training distribution and improves viewpoint invariance in visual representations. We analyze how different action spaces interact with view scaling and show that camera-space representations further enhance diversity. In addition, we introduce a multiview action aggregation method that allows single-view policies to benefit from multiple cameras during deployment. Extensive experiments in simulation and real-world manipulation tasks demonstrate significant gains in data efficiency and generalization compared to single-view baselines. Our results suggest that scaling camera views provides a practical and scalable solution for imitation learning, which requires minimal additional hardware setup and integrates seamlessly with existing imitation learning algorithms. The website of our project is https://yichen928.github.io/robot_multiview.

关键词: imitation learning, robot manipulation, multi-camera views, data efficiency, pseudo-demonstrations, viewpoint invariance, visual representations, action aggregation

215. ❌ TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection

作者: Zhijin He, Shuo Jin, Siyue Yu, Shuwei Wu, Bingfeng Zhang, Li Yu, Jimin Xiao 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00549v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究计算机视觉领域的协同显著目标检测（CoSOD），利用视觉基础模型（VFMs）如SAM和DINO，提出了一种无需训练的方法TF-SSD。论文核心是视觉任务和视觉模型的应用，与绝大多数关键词（主要针对大语言模型LLMs及其相关技术）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在计算机视觉（可视为广义科学应用）中的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TF-SSD的无训练协同显著目标检测方法，通过结合SAM和DINO模型来提升检测性能，实验表明其优于现有方法。

摘要翻译

协同显著目标检测（CoSOD）旨在分割在一组相关图像中持续出现的显著目标。尽管近期基于训练的方法取得了显著进展，但它们仍受限于闭集数据集，并表现出有限的泛化能力。然而，很少有研究探索视觉基础模型（VFMs）在解决CoSOD任务上的潜力，这些模型展现出强大的泛化能力和稳健的显著性理解。本文研究并利用VFMs进行CoSOD，进一步提出一种新颖的无训练方法TF-SSD，通过SAM与DINO的协同实现。具体而言，我们首先利用SAM生成全面的原始候选区域，作为候选掩码池。随后，我们引入一个质量掩码生成器来过滤冗余掩码，从而获得一个精炼的掩码集合。由于该生成器基于SAM构建，其本质上缺乏对显著性的语义理解。为此，我们采用了一个图像内显著性过滤器，利用DINO的注意力图来识别单张图像中视觉上显著的掩码。此外，为了将显著性理解扩展到图像组之间，我们提出了一个图像间原型选择器，通过计算跨图像原型之间的相似度得分，选择得分最高的掩码。这些选定的掩码将作为CoSOD的最终预测结果。大量实验表明，我们的TF-SSD方法优于现有方法（例如，相较于近期无训练方法提升了13.7%）。代码可在https://github.com/hzz-yy/TF-SSD获取。

摘要 (Abstract)

Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO’s attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7% gains over the recent training-free method). Codes are available at https://github.com/hzz-yy/TF-SSD.

关键词: Co-salient Object Detection, Training-free, Vision Foundation Models, SAM, DINO, Mask Filter, Synergy, Generalization

216. ❌ Reliev3R: Relieving Feed-forward Reconstruction from Multi-View Geometric Annotations

作者: Youyu Chen, Junjun Jiang, Yueru Luo, Kui Jiang, Xianming Liu, Xu Yan, Dave Zhenyu Chen 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00548v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究3D重建中的弱监督训练方法（Reliev3R），核心是减少对多视图几何标注的依赖，使用单目相对深度和图像稀疏对应关系进行训练。与大多数关键词无关，仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为方法利用了预训练模型的零样本预测，但论文本身不专注于预训练技术。其他关键词涉及大模型技术、对齐、推理、代理等，均不相关。

!!! tip deepseek-chat TL;DR

论文提出Reliev3R，一种弱监督训练范式，用于从零开始训练前馈重建模型，无需昂贵的多视图几何标注，通过利用预训练模型的零样本预测实现与全监督模型相当的性能。

摘要翻译

随着近期技术进步，前馈重建模型（FFRMs）在重建质量及对多下游任务的适应性方面展现出巨大潜力。然而，其对多视图几何标注（如三维点云图与相机位姿）的过度依赖，使得FFRMs的全监督训练方案难以规模化。本文提出Reliev3R，一种弱监督训练范式，可在无需昂贵多视图几何标注的情况下从头训练FFRMs。该方法通过减轻对几何传感数据与计算密集的运动结构预处理流程的依赖，直接从预训练模型的零样本预测所生成的单目相对深度与图像稀疏对应关系中提取三维知识。Reliev3R的核心在于设计了模糊感知相对深度损失与基于三角测量的重投影损失，以促进多视图几何一致性的监督。在数据量较少的情况下从头训练时，Reliev3R的性能可媲美全监督的同类模型，为低成本三维重建监督与可扩展FFRMs的发展迈出了重要一步。

摘要 (Abstract)

With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up. In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models. At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency. Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.

关键词: Feed-forward Reconstruction Models, weakly-supervised training, 3D reconstruction, multi-view geometric annotations, relative depth, sparse correspondences, ambiguity-aware loss, trigonometry-based reprojection

217. ❌ Neuropsychiatric Deviations From Normative Profiles: An MRI-Derived Marker for Early Alzheimer’s Disease Detection

作者: Synne Hjertager Osenbroch, Lisa Ramona Rosvold, Yao Lu, Alvaro Fernandez-Quilez 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00545v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（3D卷积神经网络）进行阿尔茨海默病的早期检测，属于医学影像分析和神经科学领域。虽然它涉及AI在科学（医学）中的应用，但所有关键词都特定于大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、推理、代理、量化等），而论文未提及任何LLM、语言模型或相关技术。唯一略有相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（神经科学）中的应用，但并非核心匹配（论文未使用生物信息学或化学信息学方法）。因此，除最后一个关键词给5分（表示有一定关联）外，其余均给0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究开发了一种基于深度学习的规范建模框架，利用结构MRI来识别神经精神症状的异常负担，作为早期阿尔茨海默病检测的生物标志物，其预测准确性与脑脊液生物标志物相当。

摘要翻译

神经精神症状（Neuropsychiatric symptoms, NPS），如抑郁和淡漠，在阿尔茨海默病（Alzheimer’s disease, AD）中十分常见，且常早于认知衰退出现。由于NPS与疾病进展相关且具有非侵入性特点，其评估有望成为早期检测标志物。然而，现有工具无法区分NPS是正常衰老的一部分还是AD的早期征兆，这限制了其应用价值。本文提出一种基于深度学习的规范建模框架，旨在通过结构磁共振成像识别非典型的NPS负担。我们利用阿尔茨海默病神经影像倡议计划中认知稳定参与者的数据，训练了一个三维卷积神经网络，以学习大脑解剖结构与神经精神症状问卷（Neuropsychiatric Inventory Questionnaire, NPIQ）评分之间的映射关系。预测评分与观测评分之间的偏差被定义为NPIQ评分偏离度（Divergence from NPIQ scores, DNPI）。较高的DNPI与未来转化为AD相关（校正后比值比=2.5；p < 0.01），其预测准确性可与脑脊液AB42相媲美（受试者工作特征曲线下面积分别为0.74与0.75）。我们的方法为早期AD检测提供了一种可扩展、非侵入性的策略。

摘要 (Abstract)

Neuropsychiatric symptoms (NPS) such as depression and apathy are common in Alzheimer’s disease (AD) and often precede cognitive decline. NPS assessments hold promise as early detection markers due to their correlation with disease progression and their non-invasive nature. Yet current tools cannot distinguish whether NPS are part of aging or early signs of AD, limiting their utility. We present a deep learning-based normative modelling framework to identify atypical NPS burden from structural MRI. A 3D convolutional neural network was trained on cognitively stable participants from the Alzheimer’s Disease Neuroimaging Initiative, learning the mapping between brain anatomy and Neuropsychiatric Inventory Questionnaire (NPIQ) scores. Deviations between predicted and observed scores defined the Divergence from NPIQ scores (DNPI). Higher DNPI was associated with future AD conversion (adjusted OR=2.5; p < 0.01) and achieved predictive accuracy comparable to cerebrospinal fluid AB42 (AUC=0.74 vs 0.75). Our approach supports scalable, non-invasive strategies for early AD detection.

关键词: Alzheimer’s disease, neuropsychiatric symptoms, deep learning, normative modelling, structural MRI, early detection, 3D convolutional neural network, biomarker

218. ❌ TRiGS: Temporal Rigid-Body Motion for Scalable 4D Gaussian Splatting

作者: Suwoong Yeom, Joonsik Nam, Seunggyu Choi, Lucas Yunkyu Lee, Sangmin Kim, Jaesik Park, Joonsoo Kim, Kugjin Yun, Kyeongbo Kong, Sukju Kang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00538v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域的4D动态场景重建技术（4D Gaussian Splatting），提出了一种新的连续几何变换表示方法（TRiGS）来解决时间碎片化和内存增长问题。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文研究内容属于传统的计算机视觉/图形学算法改进，未涉及任何大模型、深度学习创新或AI科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有4D高斯泼溅方法在动态场景重建中存在的时间碎片化和内存无限增长问题，提出了TRiGS方法，通过统一的连续几何变换实现了更好的时间一致性和可扩展性，在长视频序列上显著优于先前工作。

摘要翻译

近期4D高斯泼溅（4DGS）方法在动态场景重建方面取得了显著成果，但通常依赖于分段线性速度近似和短时间窗口。这种割裂的建模方式导致了严重的时间碎片化，迫使基元被反复消除和再生以追踪复杂的非线性动态。这种临时性近似消除了物体的长期时间一致性，并不可避免地引发高斯元数量的激增，从而阻碍了方法在长视频序列上的可扩展性。为解决这一问题，我们提出了TRiGS——一种利用统一、连续几何变换的新型4D表示方法。通过整合$SE(3)$变换、分层贝塞尔残差和可学习的局部锚点，TRiGS为独立基元建立了几何一致的刚性运动模型。这种连续表达形式保持了时间一致性，并有效抑制了内存的无限制增长。大量实验表明，TRiGS在标准基准测试中实现了高保真度渲染，同时能够独特地扩展到长视频序列（例如600至1200帧）而不会产生严重的内存瓶颈，在时间稳定性方面显著超越了现有方法。

摘要 (Abstract)

Recent 4D Gaussian Splatting (4DGS) methods achieve impressive dynamic scene reconstruction but often rely on piecewise linear velocity approximations and short temporal windows. This disjointed modeling leads to severe temporal fragmentation, forcing primitives to be repeatedly eliminated and regenerated to track complex nonlinear dynamics. This makeshift approximation eliminates the long-term temporal identity of objects and causes an inevitable proliferation of Gaussians, hindering scalability to extended video sequences. To address this, we propose TRiGS, a novel 4D representation that utilizes unified, continuous geometric transformations. By integrating $SE(3)$ transformations, hierarchical Bezier residuals, and learnable local anchors, TRiGS models geometrically consistent rigid motions for individual primitives. This continuous formulation preserves temporal identity and effectively mitigates unbounded memory growth. Extensive experiments demonstrate that TRiGS achieves high fidelity rendering on standard benchmarks while uniquely scaling to extended video sequences (e.g., 600 to 1200 frames) without severe memory bottlenecks, significantly outperforming prior works in temporal stability.

关键词: 4D Gaussian Splatting, dynamic scene reconstruction, temporal fragmentation, rigid-body motion, continuous geometric transformations, scalability, temporal stability, memory growth

219. ❌ FreqPhys: Repurposing Implicit Physiological Frequency Prior for Robust Remote Photoplethysmography

作者: Wei Qian, Dan Guo, Jinxing Zhou, Bochao Zou, Zitong Yu, Meng Wang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00534v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文FreqPhys专注于远程光电容积描记术（rPPG）技术，通过频率引导的深度学习框架解决生理信号恢复问题。所有关键词（共27个）中，仅“AI for Science OR Bioinformatics OR Cheminformatics”与论文有一定关联（评5分），因为rPPG属于生物医学AI应用领域，涉及生理监测和信号处理，但论文未明确提及生物信息学或化学信息学。其他关键词均与大型语言模型（LLM）、模型训练、推理优化、代理系统等无关，论文内容完全集中在计算机视觉和生物医学工程交叉领域，未涉及任何大模型技术或相关创新，因此评0分。加权总分计算为5.0（仅一个关键词得5分，权重1.0）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FreqPhys的频率引导远程光电容积描记术框架，通过利用生理频率先验和跨域表示学习，有效抑制运动伪影和光照波动，显著提升了在挑战性运动条件下的生理信号恢复性能。

摘要翻译

远程光电容积描记术（rPPG）通过从面部视频中捕捉细微的肤色变化，实现了非接触式生理监测。然而，现有方法主要依赖于时域建模，使其易受运动伪影和光照波动的影响，微弱的生理线索容易被噪声淹没。为应对这些挑战，我们提出了FreqPhys——一个频率引导的rPPG框架，该框架显式利用生理频率先验知识以实现鲁棒的信号恢复。具体而言，FreqPhys首先应用生理带通滤波模块以抑制带外干扰，随后结合生理频谱调制与自适应频谱选择，以增强与脉搏相关的频率成分，同时抑制残留的带内噪声。跨域表征学习模块进一步将这些频谱先验与深层时域特征相融合，以捕获信息丰富的时空依赖性。最后，一个频率感知的条件扩散过程逐步重建高保真rPPG信号。在六个基准数据集上的大量实验表明，FreqPhys相较于现有最先进方法取得了显著提升，尤其在具有挑战性的运动条件下。这凸显了显式建模生理频率先验的重要性。源代码将予以公开。

摘要 (Abstract)

Remote photoplethysmography (rPPG) enables contactless physiological monitoring by capturing subtle skin-color variations from facial videos. However, most existing methods predominantly rely on time-domain modeling, making them vulnerable to motion artifacts and illumination fluctuations, where weak physiological clues are easily overwhelmed by noise. To address these challenges, we propose FreqPhys, a frequency-guided rPPG framework that explicitly leverages physiological frequency priors for robust signal recovery. Specifically, FreqPhys first applies a Physiological Bandpass Filtering module to suppress out-of-band interference, and then performs Physiological Spectrum Modulation together with adaptive spectral selection to emphasize pulse-related frequency components while suppress residual in-band noise. A Cross-domain Representation Learning module further fuses these spectral priors with deep time-domain features to capture informative spatial–temporal dependencies. Finally, a frequency-aware conditional diffusion process progressively reconstructs high-fidelity rPPG signals. Extensive experiments on six benchmarks demonstrate that FreqPhys yields significant improvements over state-of-the-art approaches, particularly under challenging motion conditions. It highlights the importance of explicitly modeling physiological frequency priors. The source code will be released.

关键词: remote photoplethysmography, rPPG, frequency-guided framework, physiological frequency priors, motion artifacts, spectral modulation, cross-domain representation learning, diffusion process

220. ❌ AceTone: Bridging Words and Colors for Conditional Image Grading

作者: Tianren Ma, Mingxiang Liao, Xijin Zhang, Qixiang Ye 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00530v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AceTone专注于计算机视觉和图像处理领域的条件性图像色彩分级，使用VQ-VAE、视觉语言模型和强化学习等技术。虽然涉及生成模型和强化学习，但所有关键词均明确针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、Agents等），或特定科学领域AI应用（如生物信息学）。论文未提及任何语言模型、文本生成、推理、对齐、代理或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了AceTone，一种基于文本提示或参考图像生成3D-LUTs进行条件性图像色彩分级的新方法，通过VQ-VAE、视觉语言模型和强化学习实现，在性能和人类评估上优于现有方法。

摘要翻译

色彩影响我们对图像风格与情感的解读。现有的调色方法多依赖于局部重着色或固定滤镜库，难以泛化至多样创作意图或契合人类审美偏好。本研究提出AceTone，首次在统一框架内实现多模态条件调色的方法。AceTone将调色构建为生成式色彩转换任务，通过模型直接依据文本提示或参考图像生成对应的三维查找表（3D-LUT）。我们开发了基于矢量量化变分自编码器（VQ-VAE）的标记化器，将$3\times32^3$维度的LUT向量压缩为64个离散标记，并保持$\Delta E<2$的色彩保真度。进一步构建了大规模数据集AceTone-800K，训练视觉语言模型以预测LUT标记，并通过强化学习使输出结果与感知保真度及美学标准对齐。实验表明，AceTone在文本引导与参考引导的调色任务中均达到最优性能，相较于现有方法将学习感知图像块相似度（LPIPS）指标提升最高达50%。人工评估证实AceTone生成的结果视觉愉悦且风格协调，为语言驱动、审美对齐的色彩调校开辟了新路径。

摘要 (Abstract)

Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a $3\times32^3$ LUT vector to 64 discrete tokens with $ΔE<2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone’s results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.

关键词: color grading, conditional image transformation, 3D-LUT generation, VQ-VAE, vision-language model, reinforcement learning, aesthetic alignment, multimodal conditioning

221. ❌ Learnability-Guided Diffusion for Dataset Distillation

作者: Jeffrey A. Chan-Santiago, Mubarak Shah 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00519v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的扩散模型和数据集蒸馏技术，研究如何通过可学习性指导生成合成数据集以减少冗余并提高训练效率。所有评分关键词均与大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、推理加速、对齐等）或AI for Science（生物信息学、化学信息学）直接相关，而本文未涉及任何LLM技术、大模型原理创新或科学领域AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对现有扩散模型在数据集蒸馏中产生冗余训练信号的问题，提出了可学习性引导的扩散方法（LGD），通过增量式生成合成数据集并引入可学习性评分来指导样本生成，从而减少冗余39.1%，并在ImageNet-1K等数据集上取得了最先进的性能。

摘要翻译

在大规模数据集上训练机器学习模型成本高昂且耗时。数据集蒸馏通过创建小型合成数据集来解决这一问题，使其能达到与完整数据集相同的性能。近期方法利用扩散模型生成蒸馏数据，其策略包括促进多样性或匹配训练梯度。然而，现有方法会产生冗余的训练信号，即样本传递重叠的信息。实验表明，蒸馏数据集的不相交子集捕获了80-90%的重叠信号。这种冗余源于优化视觉多样性或平均训练动态时未考虑样本间的相似性，导致数据集中多个样本共享相似信息而非互补知识。我们提出可学习性驱动的数据集蒸馏方法，通过连续阶段逐步构建合成数据集。该方法从少量样本开始，训练模型并根据可学习性分数生成新样本——该分数用于识别当前模型能从哪些信息中学习，从而形成自适应课程。我们引入了可学习性引导扩散（Learnability-Guided Diffusion, LGD），该方法在平衡当前模型训练效用与参考模型下有效性的同时，生成与课程对齐的样本。我们的方法将冗余度降低了39.1%，促进了训练阶段间的专业化分工，并在ImageNet-1K（60.1%）、ImageNette（87.2%）和ImageWoof（72.9%）数据集上取得了最先进的结果。代码已发布于项目页面https://jachansantiago.github.io/learnability-guided-distillation/。

摘要 (Abstract)

Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%). Our code is available on our project page https://jachansantiago.github.io/learnability-guided-distillation/.

关键词: dataset distillation, diffusion models, learnability-guided diffusion, synthetic dataset, training redundancy, curriculum learning, ImageNet-1K, computer vision

222. ❌ RT-GS: Gaussian Splatting with Reflection and Transmittance Primitives

作者: Kunnong Zeng, Chensheng Peng, Yichen Xie, Masayoshi Tomizuka, Cem Yuksel 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00509v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RT-GS专注于计算机视觉和图形学领域，提出了一种改进Gaussian Splatting的方法，通过集成微表面材质模型和光线追踪来联合建模镜面反射和透射。该研究属于3D场景重建和渲染技术，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何语言模型、训练方法、推理优化、对齐技术、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

论文解决了Gaussian Splatting在同时建模镜面反射和半透明表面后物体外观方面的局限性，提出RT-GS框架，通过分离的高斯基元和可微分光线追踪成功实现了复杂环境中反射和透明表面的真实感新视角合成。

摘要翻译

高斯泼溅（Gaussian Splatting）是重建漫反射场景的有力工具，但在同时建模镜面反射与半透明表面后方物体的外观方面存在困难。这些镜面反射与透射现象对于逼真的新视角合成至关重要，而现有方法未能充分结合底层物理过程来模拟它们。为解决这一问题，我们提出了RT-GS，一个将微表面材质模型与光线追踪相结合的统一框架，以在高斯泼溅中联合建模镜面反射与透射。我们通过为反射和透射分别使用独立的高斯基元来实现这一目标，从而能够同时建模远距离反射并重建透明表面后的物体。我们采用可微分光线追踪框架来获取镜面反射与透射的外观表现。实验表明，我们的方法在复杂环境中成功生成了反射效果并恢复了透明表面后的物体，在镜面光交互显著的场景中，相比先前方法取得了显著的定性提升。

摘要 (Abstract)

Gaussian Splatting is a powerful tool for reconstructing diffuse scenes, but it struggles to simultaneously model specular reflections and the appearance of objects behind semi-transparent surfaces. These specular reflections and transmittance are essential for realistic novel view synthesis, and existing methods do not properly incorporate the underlying physical processes to simulate them. To address this issue, we propose RT-GS, a unified framework that integrates a microfacet material model and ray tracing to jointly model specular reflection and transmittance in Gaussian Splatting. We accomplish this by using separate Gaussian primitives for reflections and transmittance, which allow modeling distant reflections and reconstructing objects behind transparent surfaces concurrently. We utilize a differentiable ray tracing framework to obtain the specular reflection and transmittance appearance. Our experiments demonstrate that our method successfully produces reflections and recovers objects behind transparent surfaces in complex environments, achieving significant qualitative improvements over prior methods where these specular light interactions are prominent.

关键词: Gaussian Splatting, specular reflection, transmittance, novel view synthesis, microfacet material model, ray tracing, differentiable rendering, 3D reconstruction

223. ❌ RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection

作者: Jihwan Park, Chanhyeong Yang, Jinyoung Park, Taehoon Song, Hyunwoo J. Kim 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00507v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RegFormer专注于计算机视觉领域的人-物交互检测，提出了一种基于Transformer的弱监督学习方法。虽然研究背景中提到大模型在不同领域的应用可酌情给分，但该论文的核心内容（视觉Transformer、弱监督学习、人-物交互检测）与所有评分关键词（均围绕大语言模型技术、训练方法、推理优化等）无直接关联。论文未涉及任何大语言模型、深度学习技术原理创新或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为RegFormer的Transformer模块，用于解决弱监督人-物交互检测中的计算效率低和误报问题，实现了从图像级推理到实例级推理的直接迁移，并在实验中达到了与全监督模型相当的性能。

摘要翻译

弱监督人物-物体交互检测对于可扩展的场景理解至关重要，因为它仅通过图像级标注学习交互关系。由于缺乏定位信号，先前研究通常依赖外部物体检测器生成候选对，再通过成对推理推断其交互关系。然而，该框架因枚举大量实例对产生的高计算成本而难以扩展。此外，非交互组合产生的误报会阻碍准确的实例级人物-物体交互推理。为解决这些问题，我们提出关系定位变换器——一种用于高效精准人物-物体交互推理的通用交互识别模块。在图像级监督下，该模块利用空间定位信号作为推理过程的引导，并促进局部感知的交互学习。通过学习局部化交互线索，我们的模块能够区别人物、物体及其交互关系，实现从图像级交互推理到精准高效实例级推理的直接迁移，无需额外训练。大量实验与分析表明，该模块能有效学习实例级交互推理的空间线索，以高效率运行，甚至达到与全监督模型相当的性能。代码已开源：https://github.com/mlvlab/RegFormer。

摘要 (Abstract)

Weakly-supervised Human-Object Interaction (HOI) detection is essential for scalable scene understanding, as it learns interactions from only image-level annotations. Due to the lack of localization signals, prior works typically rely on an external object detector to generate candidate pairs and then infer their interactions through pairwise reasoning. However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it suffers from false positives arising from non-interactive combinations, which hinder accurate instance-level HOI reasoning. To address these issues, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module for efficient and accurate HOI reasoning. Under image-level supervision, RegFormer leverages spatially grounded signals as guidance for the reasoning process and promotes locality-aware interaction learning. By learning localized interaction cues, our module distinguishes humans, objects, and their interactions, enabling direct transfer from image-level interaction reasoning to precise and efficient instance-level reasoning without additional training. Our extensive experiments and analyses demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even achieves performance comparable to fully supervised models. Our code is available at https://github.com/mlvlab/RegFormer.

关键词: Weakly-supervised Human-Object Interaction detection, Relational Grounding Transformer, image-level supervision, instance-level reasoning, spatial cues, efficient HOI reasoning, transferable interaction recognition, locality-aware learning

224. ❌ PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

作者: Weifu Fu, Jinyang Li, Bin-Bin Gao, Jialin Li, Yuhuan Lin, Hanqiu Deng, Wenbing Tao, Yong Liu, Chengjie Wang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00503v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的开放集目标检测（OSOD），提出了一种支持文本和视觉提示的通用检测器PET-DINO，并设计了提示增强训练策略。虽然涉及多模态（视觉和文本）和训练策略，但所有关键词均针对大语言模型（LLM）及其相关技术（如微调、推理、对齐、代理等），而本文研究的是视觉目标检测模型（基于DINO架构），并非语言模型。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文针对开放集目标检测中文本表示与复杂视觉概念对齐困难以及稀有类别图像-文本对稀缺的问题，提出了PET-DINO——一种支持文本和视觉提示的通用检测器，通过新的提示生成模块和训练策略，在多种基于提示的检测协议上展现了竞争力的零样本检测能力。

摘要翻译

开放集目标检测（OSOD）能够识别超出固定类别的新颖类别，但面临文本表征与复杂视觉概念对齐困难以及稀有类别图像-文本对稀缺的挑战，导致其在专业领域或复杂物体检测中性能欠佳。现有的视觉提示方法部分解决了这些问题，但通常涉及复杂的多模态设计和多阶段优化，延长了开发周期。此外，数据驱动的OSOD模型的有效训练策略仍待深入探索。为应对这些挑战，我们提出PET-DINO——一种支持文本与视觉提示的通用检测器。我们提出的对齐友好型视觉提示生成模块（Alignment-Friendly Visual Prompt Generation, AFVPG）基于先进的文本提示检测器构建，解决了文本表征引导的局限性并缩短了开发周期。我们引入了两种提示增强训练策略：迭代层级的批内并行提示（Intra-Batch Parallel Prompting, IBP）和整体训练层级的动态记忆驱动提示（Dynamic Memory-Driven Prompting, DMD）。这些策略实现了对多提示路径的同步建模，促进了与多样化实际使用场景的并行对齐。综合实验表明，PET-DINO在各种基于提示的检测协议中展现出具有竞争力的零样本目标检测能力。这些优势可归因于基于继承的设计理念和提示增强训练策略，它们对构建有效的通用目标检测器起到了关键作用。项目页面：https://fuweifuvtoo.github.io/pet-dino。

摘要 (Abstract)

Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, facilitating parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributed to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building an effective generic object detector. Project page: https://fuweifuvtoo.github.io/pet-dino.

关键词: Open-Set Object Detection, Visual Prompt, Text Prompt, Zero-shot Detection, Prompt-enriched Training, Alignment-Friendly Visual Prompt Generation, Grounding DINO, Universal Detector

225. ❌ PC-SAM: Patch-Constrained Fine-Grained Interactive Road Segmentation in High-Resolution Remote Sensing Images

作者: Chengcheng Lv, Rushi Li, Mincheng Wu, Xiufang Shi, Zhenyu Wen, Shibo He 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00495v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文PC-SAM专注于计算机视觉领域的遥感图像道路分割，提出了一种结合全自动分割与交互式分割的统一框架，并设计了针对SAM模型的微调策略以实现细粒度局部修正。该研究与绝大多数大模型技术关键词（如LLMs、MoE、RLHF、RAG等）完全无关，因为这些关键词主要涉及自然语言处理领域的大模型技术原理、训练方法、推理优化等。仅有两个关键词获得5分：1）‘Post-training OR Supervised Fine-tuning OR SFT’：论文明确提到’carefully designing a fine-tuning strategy’，属于模型微调范畴，有一定关联但非核心（核心是分割框架设计而非微调技术本身）；2）‘AI for Science OR Bioinformatics OR Cheminformatics’：遥感图像分析可视为AI在科学（地球观测）领域的应用，有一定关联但非典型生物/化学信息学。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对高分辨率遥感图像中道路分割的挑战，提出了一种名为PC-SAM的统一框架，通过设计微调策略将全自动分割与交互式分割结合，实现了优于全自动模型的道路掩码分割效果，并支持灵活的局部掩码细化。

摘要翻译

从遥感图像中获取的道路掩膜能有效支撑多种下游任务。近年来，多数研究聚焦于提升全自动分割模型在此任务上的性能，并取得了显著进展。然而，当前的全自动方法在识别某些具有挑战性的道路段时仍存在不足，常产生误报和漏报区域。此外，全自动分割不支持对感兴趣区域的局部分割或对现有掩膜的精细化修正。尽管SAM模型作为交互式分割模型被广泛应用且在自然图像上表现优异，但在遥感道路分割中效果欠佳，无法支持细粒度的局部优化。为应对这些局限，我们提出了PC-SAM，它将全自动道路分割与交互式分割整合在统一框架内。通过精心设计的微调策略，点提示的影响被约束在对应图像块内，克服了原始SAM无法进行精细局部修正的缺陷，实现了细粒度的交互式掩膜优化。在多个代表性遥感道路分割数据集上的大量实验表明，结合点提示使用时，PC-SAM在道路掩膜分割上显著优于当前最先进的全自动模型，同时提供了灵活的局部掩膜优化与局部道路分割功能。代码将在https://github.com/Cyber-CCOrange/PC-SAM公开。

摘要 (Abstract)

Road masks obtained from remote sensing images effectively support a wide range of downstream tasks. In recent years, most studies have focused on improving the performance of fully automatic segmentation models for this task, achieving significant gains. However, current fully automatic methods are still insufficient for identifying certain challenging road segments and often produce false positive and false negative regions. Moreover, fully automatic segmentation does not support local segmentation of regions of interest or refinement of existing masks. Although the SAM model is widely used as an interactive segmentation model and performs well on natural images, it shows poor performance in remote sensing road segmentation and cannot support fine-grained local refinement. To address these limitations, we propose PC-SAM, which integrates fully automatic road segmentation and interactive segmentation within a unified framework. By carefully designing a fine-tuning strategy, the influence of point prompts is constrained to their corresponding patches, overcoming the inability of the original SAM to perform fine local corrections and enabling fine-grained interactive mask refinement. Extensive experiments on several representative remote sensing road segmentation datasets demonstrate that, when combined with point prompts, PC-SAM significantly outperforms state-of-the-art fully automatic models in road mask segmentation, while also providing flexible local mask refinement and local road segmentation. The code will be available at https://github.com/Cyber-CCOrange/PC-SAM.

关键词: road segmentation, remote sensing images, interactive segmentation, SAM model, fine-tuning, patch-constrained, mask refinement, high-resolution

226. ❌ ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction

作者: Quanyuan Ruan, Kewei Shi, Jiabao Lei, Xifeng Gao, Xiaoguang Han 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00494v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ARGS专注于3D对象生成的计算机视觉任务，提出了一种基于高斯泼溅的自回归框架，用于并行渐进式多尺度预测。虽然涉及自回归框架、transformer架构和层次化生成，但其核心是3D几何表示和生成，而非大语言模型或深度学习技术原理的创新。所有评分关键词均与大语言模型、对齐、推理、代理、科学AI应用等主题相关，与本文的3D视觉生成研究无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了自回归高斯泼溅（ARGS）框架，解决了3D对象生成中多尺度细节可控预测的问题，实现了高效、视觉保真的多尺度高斯表示生成。

摘要翻译

用于二维图像下一尺度预测的自回归框架已展现出通过逐步细化粗糙输入来生成多样且复杂内容的强大潜力。然而，将这一范式扩展到三维物体生成领域仍基本处于未探索状态。本文提出了自回归高斯泼溅（ARGS, auto-regressive Gaussian splatting）框架，该框架能够根据细节层级并行进行下一尺度预测以完成生成。我们提出了一种高斯简化策略，并通过反转该简化过程来指导下一尺度的生成。得益于层次化树结构的使用，生成过程仅需 (\mathcal{O}(\log n)) 步，其中 (n) 为点数。此外，我们提出了一种基于树的变换器（tree-based transformer），以自回归方式预测树结构，允许叶节点关注其内部祖先节点，从而增强结构一致性。大量实验表明，我们的方法能够有效生成具有可控细节层级、视觉保真度及可管理时间消耗的多尺度高斯表示。

摘要 (Abstract)

Auto-regressive frameworks for next-scale prediction of 2D images have demonstrated strong potential for producing diverse and sophisticated content by progressively refining a coarse input. However, extending this paradigm to 3D object generation remains largely unexplored. In this paper, we introduce auto-regressive Gaussian splatting (ARGS), a framework for making next-scale predictions in parallel for generation according to levels of detail. We propose a Gaussian simplification strategy and reverse the simplification to guide next-scale generation. Benefiting from the use of hierarchical trees, the generation process requires only (\mathcal{O}(\log n)) steps, where (n) is the number of points. Furthermore, we propose a tree-based transformer to predict the tree structure auto-regressively, allowing leaf nodes to attend to their internal ancestors to enhance structural consistency. Extensive experiments demonstrate that our approach effectively generates multi-scale Gaussian representations with controllable levels of detail, visual fidelity, and a manageable time consumption budget.

关键词: Auto-regressive Gaussian splatting, 3D object generation, Next-scale prediction, Hierarchical trees, Tree-based transformer, Multi-scale Gaussian representations, Controllable levels of detail, Parallel progressive generation

227. ❌ All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

作者: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Peter Tu, Jing Zhang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00479v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language Models（VLMs）中强化学习（特别是GRPO）对推理能力的影响，发现RL模型倾向于深度但狭窄的推理，而基础模型则表现出更广泛、更多样的思维模式。论文主要涉及推理机制（Chain of Thought/System 2 Thinking）的分析和改进，但未涉及其他关键词如LLMs、MoE、SFT、RAG等具体技术。因此，仅与推理相关的两个关键词（Chain of Thought和System 2 Thinking）高度相关（10分），其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

论文研究了强化学习（特别是GRPO）在Vision-Language Models中导致推理多样性崩溃的问题，并提出了Multi-Group Policy Optimization（MUPO）方法来激励发散思维，在基准测试中证明了其有效性。

摘要翻译

近期研究表明，强化学习（Reinforcement Learning, RL），尤其是群体相对策略优化（Group Relative Policy Optimization, GRPO），能够从本质上激发并增强视觉语言模型（Vision-Language Models, VLMs）的推理能力。然而，尽管前景广阔，驱动强化学习模型有效性的内在机制及其局限性仍未得到充分探索。本文指出，强化学习模型与基础模型之间存在根本性的行为差异：前者倾向于进行深入但狭窄的推理，而基础模型尽管在单一路径上的推理不够精细，却展现出更广泛、更多元的思维模式。通过对训练动态的进一步分析，我们发现GRPO容易发生多样性崩溃，导致模型过早收敛于有限的推理策略子集，同时丢弃了大部分潜在的替代方案，从而陷入局部最优并导致可扩展性差。为解决这一问题，我们提出了多群体策略优化（Multi-Group Policy Optimization, MUPO），这是一种简单而有效的方法，旨在激励模型针对多个解决方案进行发散性思考，并在多个基准测试中验证了其有效性。项目页面：https://xytian1008.github.io/MUPO/

摘要 (Abstract)

Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/

关键词: Vision-Language Models, Reinforcement Learning, Group Relative Policy Optimization, reasoning diversity, divergent thinking, Multi-Group Policy Optimization, training dynamics, benchmark evaluation

228. ❌ Learning and Generating Mixed States Prepared by Shallow Channel Circuits

作者: Fangjun Hu, Christian Kokail, Milan Kornjača, Pedro L. S. Lopes, Weiyuan Gong, Sheng-Tao Wang, Xun Gao, Stefan Ostermann 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01197v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子信息中的量子态学习与生成问题，聚焦于浅层信道电路制备的混合态，属于量子计算领域。所有关键词均针对大语言模型（LLM）和深度学习技术，包括模型架构、训练方法、推理优化、对齐、应用等具体方向。论文内容完全不涉及LLM、深度学习或传统AI技术，仅最后一句话提到经典扩散模型作为类比，但并非核心研究内容。因此，除“AI for Science”因论文属于量子计算（可视为科学AI应用）获得5分外，其余关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何从测量数据中高效学习由浅层信道电路制备的量子混合态，证明了在平凡相中的任意混合态可以通过多项式样本和运行时间被学习并近似生成，为基于浅层信道电路的量子生成模型提供了理论基础。

摘要翻译

从测量数据中学习量子态是量子信息与计算复杂性领域的核心问题。本研究探讨了在有限维格点上学习生成混合态的问题。受近期物质混合态相研究进展的启发，我们聚焦于平凡相中的任意态。若存在一个浅层制备信道电路，使得在整个制备过程中保持局域可逆性，则该态属于平凡相。我们证明，仅通过测量访问即可高效学习此类中的任意混合态。具体而言，给定未知平凡相混合态的若干副本，我们的算法能够输出一个浅层局域信道电路，该电路能以迹距离近似生成该态。在电路深度与门局域性为常数（或多对数）的假设下，样本复杂度与运行时间在量子比特数上呈多项式（或拟多项式）增长。重要的是，学习者并未获知原始制备电路，仅依赖其存在性。我们的结果为基于浅层信道电路的量子生成模型提供了结构基础。在经典极限下，该框架也启发了一种高效经典扩散模型算法，其仅需多项式开销的训练与生成成本。

摘要 (Abstract)

Learning quantum states from measurement data is a central problem in quantum information and computational complexity. In this work, we study the problem of learning to generate mixed states on a finite-dimensional lattice. Motivated by recent developments in mixed state phases of matter, we focus on arbitrary states in the trivial phase. A state belongs to the trivial phase if there exists a shallow preparation channel circuit under which local reversibility is preserved throughout the preparation. We prove that any mixed state in this class can be efficiently learned from measurement access alone. Specifically, given copies of an unknown trivial phase mixed state, our algorithm outputs a shallow local channel circuit that approximately generates this state in trace distance. The sample complexity and runtime are polynomial (or quasi-polynomial) in the number of qubits, assuming constant (or polylogarithmic) circuit depth and gate locality. Importantly, the learner is not given the original preparation circuit and relies only on its existence. Our results provide a structural foundation for quantum generative models based on shallow channel circuits. In the classical limit, our framework also inspires an efficient algorithm for classical diffusion models using only a polynomial overhead of training and generation.

关键词: quantum state learning, mixed states, shallow channel circuits, trivial phase, quantum generative models, sample complexity, trace distance, diffusion models

229. ❌ Automated Detection of Multiple Sclerosis Lesions on 7-tesla MRI Using U-net and Transformer-based Segmentation

作者: Michael Maynord, Minghui Liu, Cornelia Fermüller, Seongjin Choi, Yuxin Zeng, Shishir Dahal, Daniel M. Harrison 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分割，使用基于Transformer的模型（如UNETR和SegFormer）进行多发性硬化病变检测。所有关键词均与大语言模型（LLM）或深度学习通用技术原理相关，但论文未涉及LLM、MoE、缩放定律、训练方法、对齐、推理优化、代理系统、模型压缩等主题。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物医学AI应用领域，但并非核心创新点，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了7T MRI上多发性硬化病变自动分割的挑战，通过训练基于Transformer的模型在原生高分辨率数据上实现了比传统方法更好的性能，特别是在小病变检测方面。

摘要翻译

超高场强7特斯拉（7T）磁共振成像提升了多发性硬化（MS）白质病变（WML）的可视化效果，但其对比度与伪影特征与1.5-3T成像存在显著差异——这表明广泛使用的自动化分割工具可能无法直接迁移应用。我们分析了7T FLAIR序列扫描数据，并基于病变分割工具（LST）的输出结果生成参考WML掩膜，再经专家手动修正。作为外部对照，我们应用了最初基于低场强数据开发的LST-LPA算法以及较新的LST-AI集成模型。随后，我们在多种分辨率（0.5×0.5×0.5³、1.0×1.0×1.0³及1.5×1.5×2.0³）的7T FLAIR数据上训练了基于Transformer架构的3D UNETR和SegFormer模型，并采用BraTS 2023框架的体素级与病变级指标对所有方法进行评估。在原始0.5×0.5×0.5³分辨率的保留测试集上，基于7T数据训练的Transformer模型与LST-AI取得了相当的吻合度，同时能检测出传统方法遗漏的额外小病灶，但其代价是存在一定的边界变异性和偶发的伪影相关误检。在独立的7T测试集上，我们最佳的Transformer模型（SegFormer）取得了体素级Dice系数0.61和病变级Dice系数0.20，优于传统LST-LPA工具（Dice系数0.39，病变级Dice系数0.02）。在降采样图像上训练的模型性能有所下降，这凸显了原始7T分辨率对于小病灶检测的价值。通过开源我们基于7T训练的模型（https://github.com/maynord/7T-MS-lesion-segmentation），我们旨在为超高场强MS研究中的自动化病变量化提供一个可复现、即用型工具资源。

摘要 (Abstract)

Ultra-high field 7-tesla (7T) MRI improves visualization of multiple sclerosis (MS) white matter lesions (WML) but differs sufficiently in contrast and artifacts from 1.5-3T imaging - suggesting that widely used automated segmentation tools may not translate directly. We analyzed 7T FLAIR scans and generated reference WML masks from Lesion Segmentation Tool (LST) outputs followed by expert manual revision. As external comparators, we applied LST-LPA and the more recent LST-AI ensemble, both originally developed on lower-field data. We then trained 3D UNETR and SegFormer transformer-based models on 7T FLAIR at multiple resolutions (0.5x0.5x0.5^3, 1.0x1.0x1.0^3, and 1.5x1.5x2.0^3) and evaluated all methods using voxel-wise and lesion-wise metrics from the BraTS 2023 framework. On the held-out test set at native 0.5x0.5x0.5^3 resolution, 7T-trained transformers achieved competitive overlap with LST-AI while recovering additional small lesions that were missed by classical methods, at the cost of some boundary variability and occasional artifact-related false positives. On a held-out 7 T test set, our best transformer model (SegFormer) achieved a voxel-wise Dice of 0.61 and lesion-wise Dice of 0.20, improving on the classical LST-LPA tool (Dice 0.39, lesion-wise Dice 0.02). Performance decreased for models trained on downsampled images, underscoring the value of native 7T resolution for small-lesion detection. By releasing our 7T-trained models, we aim to provide a reproducible, ready-to-use resource for automated lesion quantification in ultra-high field MS research (https://github.com/maynord/7T-MS-lesion-segmentation).

关键词: multiple sclerosis, lesion segmentation, 7T MRI, transformer, UNETR, SegFormer, automated detection, medical imaging

230. ❌ NeuroDDAF: Neural Dynamic Diffusion-Advection Fields with Evidential Fusion for Air Quality Forecasting

作者: Prasanjit Dey, Soumyabrata Dev, Angela Meyer, Bianca Schoen-Phelan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01175v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于空气质量预测，提出了一种结合物理模型和神经网络的混合框架（NeuroDDAF）。论文的核心是时空预测、物理信息机器学习、图神经网络、神经ODE和不确定性量化。所有关键词均与大语言模型（LLM）、模型训练/对齐技术、推理优化、智能体系统等直接相关，而本文未涉及这些主题。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为空气质量预测属于环境科学领域的AI应用，但论文并未明确使用这些术语，也未深入生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NeuroDDAF的物理信息神经网络框架，用于解决空气质量预测中的非线性时空动态和不确定性量化问题，在多个城市数据集上实现了优于基线模型的预测精度和更好的泛化能力。

摘要翻译

精准的空气质量预测对于保护公众健康及指导环境政策至关重要，但由于非线性时空动态、风驱输送以及区域间的分布偏移，该任务仍面临挑战。基于物理的模型具有可解释性，但计算成本高昂且常依赖限制性假设；而纯数据驱动模型虽可能达到较高精度，却往往缺乏鲁棒性与校准后的不确定性量化能力。为应对这些局限，我们提出神经动态扩散-平流场（NeuroDDAF），一种融合物理知识的预测框架，将神经表示学习与开放系统输送建模相统一。NeuroDDAF整合了以下组件：（i）GRU-图注意力编码器，用于捕捉时间动态与风场感知的空间交互；（ii）带有可学习残差的傅里叶域扩散-平流模块；（iii）风场调制的潜变量神经常微分方程，用于建模时变连接性下的连续时间演化过程；（iv）证据融合机制，可自适应结合物理引导预测与神经预测，同时量化不确定性。在四个城市数据集（北京、深圳、天津及安科纳）上进行的1至3天预测实验表明，NeuroDDAF在包括AirPhyNet在内的强基线模型上均取得更优性能，长期预测的均方根误差（RMSE）和平均绝对误差（MAE）分别降低达9.7%与9.4%。在北京数据集上，NeuroDDAF在1天预测中达到41.63微克/立方米的RMSE，在3天预测中达到48.88微克/立方米的RMSE，代表了所有对比方法中的最佳性能。此外，NeuroDDAF提升了跨城市泛化能力，并产生校准良好的不确定性估计，集成方差分析及不同风况下的案例研究证实了这一点。

摘要 (Abstract)

Accurate air quality forecasting is crucial for protecting public health and guiding environmental policy, yet it remains challenging due to nonlinear spatiotemporal dynamics, wind-driven transport, and distribution shifts across regions. Physics-based models are interpretable but computationally expensive and often rely on restrictive assumptions, whereas purely data-driven models can be accurate but may lack robustness and calibrated uncertainty. To address these limitations, we propose Neural Dynamic Diffusion-Advection Fields (NeuroDDAF), a physics-informed forecasting framework that unifies neural representation learning with open-system transport modeling. NeuroDDAF integrates (i) a GRU-Graph Attention encoder to capture temporal dynamics and wind-aware spatial interactions, (ii) a Fourier-domain diffusion-advection module with learnable residuals, (iii) a wind-modulated latent Neural ODE to model continuous-time evolution under time-varying connectivity, and (iv) an evidential fusion mechanism that adaptively combines physics-guided and neural forecasts while quantifying uncertainty. Experiments on four urban datasets (Beijing, Shenzhen, Tianjin, and Ancona) across 1-3 day horizons show that NeuroDDAF consistently outperforms strong baselines, including AirPhyNet, achieving up to 9.7% reduction in RMSE and 9.4% reduction in MAE on long-term forecasts. On the Beijing dataset, NeuroDDAF attains an RMSE of 41.63 $μ$g/m$^3$ for 1-day prediction and 48.88 $μ$g/m$^3$ for 3-day prediction, representing the best performance among all compared methods. In addition, NeuroDDAF improves cross-city generalization and yields well-calibrated uncertainty estimates, as confirmed by ensemble variance analysis and case studies under varying wind conditions.

关键词: air quality forecasting, physics-informed neural networks, spatiotemporal dynamics, graph attention networks, neural ODE, evidential fusion, uncertainty quantification, diffusion-advection modeling

231. ❌ Safe learning-based control via function-based uncertainty quantification

作者: Abdullah Tokmak, Toni Karvonen, Thomas B. Schön, Dominik Baumann 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01173v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于学习的控制方法中的不确定性量化问题，采用场景方法构建不确定性管并应用于安全贝叶斯优化，在Furuta摆上进行实验验证。所有关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文专注于控制理论、不确定性量化和安全优化，未涉及任何大模型或深度学习技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于场景方法的不确定性量化技术，用于构建学习型控制系统中未知函数的高概率不确定性管，并将其集成到安全贝叶斯优化算法中，成功应用于Furuta摆的控制参数安全调优。

摘要翻译

在安全关键系统中部署基于学习的控制方法时，不确定性量化至关重要。通常，这通过构建不确定性管道来实现，该管道能以高概率包含目标未知函数（例如奖励函数、约束函数或底层动力学模型）。然而，现有的不确定性量化方法通常依赖于对未知函数的限制性假设，例如函数范数或利普希茨常数的已知界限，且难以处理不连续性。在本文中，我们将未知函数建模为随机函数，可从中生成独立同分布的实现样本，并基于场景方法构建不确定性管道。该管道仅依赖于采样实现，且以高概率成立。我们将这些不确定性管道集成到一种安全的贝叶斯优化算法中，并利用该算法在真实的Furuta摆（Furuta pendulum）上安全地调整控制参数。

摘要 (Abstract)

Uncertainty quantification is essential when deploying learning-based control methods in safety-critical systems. This is commonly realized by constructing uncertainty tubes that enclose the unknown function of interest, e.g., the reward and constraint functions or the underlying dynamics model, with high probability. However, existing approaches for uncertainty quantification typically rely on restrictive assumptions on the unknown function, such as known bounds on functional norms or Lipschitz constants, and struggle with discontinuities. In this paper, we model the unknown function as a random function from which independent and identically distributed realizations can be generated, and construct uncertainty tubes via the scenario approach that hold with high probability and rely solely on the sampled realizations. We integrate these uncertainty tubes into a safe Bayesian optimization algorithm, which we then use to safely tune control parameters on a real Furuta pendulum.

关键词: uncertainty quantification, learning-based control, scenario approach, safe Bayesian optimization, Furuta pendulum, safety-critical systems, random function modeling

232. ❌ Bridging the Simulation-to-Experiment Gap with Generative Models using Adversarial Distribution Alignment

作者: Kai Nelson, Tobias Kreiman, Sergey Levine, Aditi S. Krishnapriyan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01169v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文提出了一种名为Adversarial Distribution Alignment (ADA)的方法，旨在通过生成模型弥合模拟与实验数据之间的分布差距。论文的核心是生成模型在科学领域的应用，特别是物理科学和生物信息学（蛋白质数据）。因此，它与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到’pre-training a generative model’，这与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为它涉及在模拟数据上预训练生成模型，然后与实验数据对齐，这类似于领域适应。其他关键词主要涉及大语言模型（LLMs）的特定技术、训练方法、推理、代理、优化等，而本文专注于通用的生成模型（非LLM）在科学数据对齐中的应用，因此这些关键词得分为0分。

!!! tip deepseek-chat TL;DR

该研究解决了科学中模拟与实验数据之间的分布差距问题，提出了一种对抗性分布对齐方法，通过在模拟数据上预训练生成模型并与实验观测对齐，成功应用于分子和蛋白质数据验证。

摘要翻译

科学与工程领域的一个根本性挑战在于仿真与实验之间的差距。尽管我们通常掌握物理定律的先验知识，但这些物理定律对于复杂系统往往难以精确求解。此类系统通常通过仿真器进行建模，而仿真器会引入计算近似。与此同时，实验测量能更真实地反映现实世界，但实验数据通常仅包含部分反映系统完整潜在状态的观测结果。我们提出一种数据驱动的分布对齐框架，通过在完全观测（但不完美）的仿真数据上预训练生成模型，再将其与实验数据中部分（但真实）的观测结果进行对齐，从而弥合这一仿真-实验差距。虽然我们的方法具有领域无关性，但我们通过引入对抗性分布对齐（Adversarial Distribution Alignment, ADA）方法，将研究立足于物理科学领域。该方法将原子位置的生成模型——初始训练于模拟的玻尔兹曼分布——与实验观测的分布进行对齐。我们证明，即使存在多个潜在相关的可观测量，我们的方法仍能恢复目标可观测分布。我们还在合成数据、分子数据及实验蛋白质数据上对框架进行了实证验证，结果表明该方法能使生成模型与多样化的可观测量实现对齐。代码发布于 https://kaityrusnelson.com/ada/。

摘要 (Abstract)

A fundamental challenge in science and engineering is the simulation-to-experiment gap. While we often possess prior knowledge of physical laws, these physical laws can be too difficult to solve exactly for complex systems. Such systems are commonly modeled using simulators, which impose computational approximations. Meanwhile, experimental measurements more faithfully represent the real world, but experimental data typically consists of observations that only partially reflect the system’s full underlying state. We propose a data-driven distribution alignment framework that bridges this simulation-to-experiment gap by pre-training a generative model on fully observed (but imperfect) simulation data, then aligning it with partial (but real) observations of experimental data. While our method is domain-agnostic, we ground our approach in the physical sciences by introducing Adversarial Distribution Alignment (ADA). This method aligns a generative model of atomic positions – initially trained on a simulated Boltzmann distribution – with the distribution of experimental observations. We prove that our method recovers the target observable distribution, even with multiple, potentially correlated observables. We also empirically validate our framework on synthetic, molecular, and experimental protein data, demonstrating that it can align generative models with diverse observables. Our code is available at https://kaityrusnelson.com/ada/.

关键词: generative models, simulation-to-experiment gap, adversarial distribution alignment, domain-agnostic, physical sciences, Boltzmann distribution, molecular data, protein data

233. ❌ Reasoning Shift: How Context Silently Shortens LLM Reasoning

作者: Gleb Rodionov 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01161v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在复杂上下文环境中的推理行为变化，与’Large Language Models’高度相关（10分），直接研究’Chain of Thought’推理（10分）和’System 2 Thinking’（8分），涉及’Self-Correction’行为（8分）。论文探讨上下文管理问题，与’Context Window Extension’（5分）、‘LLM Agents’（5分）和’In-context Learning’（5分）有一定关联。其他关键词如MoE、量化、对齐等未在研究中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究发现LLMs在不同上下文条件下（如冗长无关信息、多轮对话、复杂任务子问题）会产生更短的推理轨迹，并伴随自我验证行为减少，这可能影响复杂任务的性能。

摘要翻译

展现出测试时扩展行为的大型语言模型（LLMs），例如生成长推理链和自我验证，在复杂、长期的推理任务中已表现出卓越性能。然而，这些推理行为的鲁棒性仍未得到充分探究。为此，我们对多种推理模型在三种场景下进行了系统性评估：（1）添加了冗长无关上下文的问题；（2）包含独立任务的多轮对话场景；（3）作为复杂任务中子任务呈现的问题。我们观察到一个有趣的现象：与问题单独呈现时生成的推理链相比，在不同上下文条件下，推理模型针对同一问题倾向于生成短得多的推理链（缩短幅度可达50%）。更细粒度的分析表明，这种压缩与自我验证及不确定性管理行为（如双重检查）的减少相关。虽然这种行为转变不会影响简单问题的解决性能，但它可能对更具挑战性的任务表现造成影响。我们希望我们的发现能引起更多对推理模型鲁棒性以及LLMs与基于LLM的智能体上下文管理问题的关注。

摘要 (Abstract)

Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.

关键词: Large Language Models, reasoning behavior, context management, self-verification, reasoning traces, LLM robustness, test-time scaling, uncertainty management

234. ❌ Property-Level Flood Risk Assessment Using AI-Enabled Street-View Lowest Floor Elevation Extraction and ML Imputation Across Texas

作者: Xiangpeng Li, Yu-Hsuan Ho, Sam D Brody, Ali Mostafavi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01153v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究利用AI分析街景图像进行洪水风险评估，属于AI在科学领域的应用，但与评审背景中强调的大模型和深度学习技术原理创新关联较弱。论文中使用了AI（如Elev-Vision框架）和机器学习（如Random Forest、Gradient Boosting）方法，但未涉及大语言模型（LLMs）、深度学习架构创新（如MoE、Scaling Laws）、训练对齐技术（如RLHF、Instruction Tuning）、推理优化（如RAG、Quantization）或智能体系统等核心关键词。唯一相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将AI应用于环境科学和灾害风险评估，属于AI for Science范畴，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均未在论文标题或摘要中体现，完全无关，评分为0分。加权总分为5.0，远低于动态及格分26.6，表明论文与评审关注的大模型和深度学习技术原理创新主题相关性很低。

!!! tip deepseek-chat TL;DR

该论文开发了一个基于AI街景图像分析和机器学习插补的三阶段流程，用于提取建筑物最低楼层高程并评估德克萨斯州房产级别的洪水风险，结果表明该方法可扩展并改进区域洪水风险表征。

摘要翻译

本文提出，基于人工智能的街景影像分析，结合性能门控的机器学习插补方法，为区域尺度洪水风险评估中生成建筑特定高程数据提供了一条可行路径。我们在德克萨斯州18个关注区域开发并应用了一套三阶段流程：（1）利用Elev-Vision框架从谷歌街景影像中提取最低楼层高程（LFE）及街道标高与最低楼层间的高程差（HDSL）；（2）基于16项地形、水文、地理和洪水暴露特征训练的随机森林与梯度提升模型，对缺失的HDSL值进行插补；（3）将生成的高程数据集与Fathom百年一遇淹没面及美国陆军工程兵团（USACE）水深-损失函数相结合，以估算单体建筑的室内洪水深度与预期损失。在12,241栋住宅建筑中，73.4%的地块可获得街景影像，其中49.0%（5,992栋建筑）成功实现了LFE/HDSL的直接提取。在交叉验证性能可靠的13个关注区域中保留了插补结果，所选模型的R平方值介于0.159至0.974之间；另有5个关注区域因性能不足被明确排除在预测范围外。结果表明，基于街景的高程测绘虽无法覆盖所有建筑，但其具备足够的可扩展性，能够超越灾害暴露分析，实现对室内淹没深度与预期损失的结构级估算，从而实质性改进区域洪水风险表征。在科学层面，本研究将LFE估算从试点性的概念验证推进为区域尺度的端到端工作流程；在实践层面，为缺乏完整高程证书但需要地块级信息以支持减灾、规划与洪水风险管理的行政辖区提供了一个可复制的框架。

摘要 (Abstract)

This paper argues that AI-enabled analysis of street-view imagery, complemented by performance-gated machine-learning imputation, provides a viable pathway for generating building-specific elevation data at regional scale for flood risk assessment. We develop and apply a three-stage pipeline across 18 areas of interest (AOIs) in Texas that (1) extracts LFE and the height difference between street grade and the lowest floor (HDSL) from Google Street View imagery using the Elev-Vision framework, (2) imputes missing HDSL values with Random Forest and Gradient Boosting models trained on 16 terrain, hydrologic, geographic, and flood-exposure features, and (3) integrates the resulting elevation dataset with Fathom 1-in-100 year inundation surfaces and USACE depth-damage functions to estimate property-specific interior flood depth and expected loss. Across 12,241 residential structures, street-view imagery was available for 73.4% of parcels and direct LFE/HDSL extraction was successful for 49.0% (5,992 structures). Imputation was retained for 13 AOIs where cross-validated performance was defensible, with selected models achieving R suqre values from 0.159 to 0.974; five AOIs were explicitly excluded from prediction because performance was insufficient. The results show that street-view-based elevation mapping is not universally available for every property, but it is sufficiently scalable to materially improve regional flood-risk characterization by moving beyond hazard exposure to structure-level estimates of interior inundation and expected damage. Scientifically, the study advances LFE estimation from a pilot-scale proof of concept to a regional, end-to-end workflow. Practically, it offers a replicable framework for jurisdictions that lack comprehensive Elevation Certificates but need parcel-level information to support mitigation, planning, and flood-risk management.

关键词: flood risk assessment, street-view imagery, lowest floor elevation, machine learning imputation, property-specific elevation, regional scale, AI-enabled analysis, depth-damage functions

235. ❌ Deep Reinforcement Learning for Robotic Manipulation under Distribution Shift with Bounded Extremum Seeking

作者: Shaifalee Saxena, Rafael Fierro, Alexander Scheinker 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01142v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是机器人操作中的深度强化学习（DDPG）与有界极值搜索（bounded ES）的混合控制方法，以应对分布偏移问题。所有关键词均与大模型、深度学习技术原理或AI for Science相关，但论文未涉及任何大模型（LLM）、语言模型、提示工程、对齐、推理、代理、压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或生物/化学信息学等主题。论文专注于传统的深度强化学习在机器人控制中的应用，与给定的大模型相关关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合深度强化学习（DDPG）和有界极值搜索（bounded ES）的混合控制器，以提升机器人在推和抓放任务中面对分布偏移（如目标变化、摩擦变化）时的鲁棒性，实验表明该方法在多种分布外设置下有效。

摘要翻译

强化学习在机器人操作任务中展现出优异性能，但当前测条件偏离训练分布时，习得策略的性能常出现退化。这一局限在接触密集型任务（如推物和抓放操作）中尤为突出，因为目标、接触条件或机器人动力学的变化可能在推理阶段使系统处于分布外状态。本文研究一种混合控制器，它将强化学习与有界极值搜索相结合，以提升此类条件下的鲁棒性。在所提出的方法中，深度确定性策略梯度（DDPG）策略在标准条件下针对机器人推物和抓放任务进行训练，随后在部署阶段与有界极值搜索（bounded ES）结合使用。强化学习策略提供快速的操作行为，而有界极值搜索则确保当操作条件偏离训练环境时，整体控制器对时变干扰具有鲁棒性。该复合控制器在多种分布外场景下进行评估，包括时变目标和空间变化的摩擦区域。

摘要 (Abstract)

Reinforcement learning has shown strong performance in robotic manipulation, but learned policies often degrade in performance when test conditions differ from the training distribution. This limitation is especially important in contact-rich tasks such as pushing and pick-and-place, where changes in goals, contact conditions, or robot dynamics can drive the system out-of-distribution at inference time. In this paper, we investigate a hybrid controller that combines reinforcement learning with bounded extremum seeking to improve robustness under such conditions. In the proposed approach, deep deterministic policy gradient (DDPG) policies are trained under standard conditions on the robotic pushing and pick-and-place tasks, and are then combined with bounded ES during deployment. The RL policy provides fast manipulation behavior, while bounded ES ensures robustness of the overall controller to time variations when operating conditions depart from those seen during training. The resulting controller is evaluated under several out-of-distribution settings, including time-varying goals and spatially varying friction patches.

关键词: Deep Reinforcement Learning, Robotic Manipulation, Distribution Shift, Bounded Extremum Seeking, DDPG, Robustness, Out-of-distribution, Hybrid Controller

236. ❌ Reconsidering Dependency Networks from an Information Geometry Perspective

作者: Kazuya Takabatake, Shotaro Akaho 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01117v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究依赖网络的理论基础，从信息几何角度分析伪吉布斯采样，属于概率图模型和统计学习理论领域。所有评分关键词均涉及大模型、深度学习及其相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文从信息几何角度重新审视依赖网络，通过将伪吉布斯采样解释为m-投影，引入全条件散度并推导上界来定位平稳分布，并将结构和参数学习重新表述为可分解的优化问题，证明了学习模型分布随样本增加收敛到真实分布。

摘要翻译

依赖网络（Dependency networks, Heckerman et al., 2000）通过伪吉布斯采样（pseudo-Gibbs sampling）结合独立学习的局部条件分布，为建模多变量复杂系统提供了一个灵活框架。尽管其在计算上优于贝叶斯网络与马尔可夫网络，依赖网络的理论基础仍不完善，主要因为其模型分布——定义为伪吉布斯采样的平稳分布——缺乏闭式表达式。本文对伪吉布斯采样进行了信息几何分析，将每一步采样解释为向全条件流形（full conditional manifold）的 m-投影。基于此解释，我们引入了全条件散度（full conditional divergence），并推导出一个上界，用以刻画平稳分布在概率分布空间中的位置。随后，我们将结构学习与参数学习重新表述为可分解为各节点独立子问题的优化问题，并证明当训练样本数量趋于无穷时，所学模型分布收敛于真实潜在分布。实验证实，所提出的上界在实践中是紧致的。

摘要 (Abstract)

Dependency networks (Heckerman et al., 2000) provide a flexible framework for modeling complex systems with many variables by combining independently learned local conditional distributions through pseudo-Gibbs sampling. Despite their computational advantages over Bayesian and Markov networks, the theoretical foundations of dependency networks remain incomplete, primarily because their model distributions – defined as stationary distributions of pseudo-Gibbs sampling – lack closed-form expressions. This paper develops an information-geometric analysis of pseudo-Gibbs sampling, interpreting each sampling step as an m-projection onto a full conditional manifold. Building on this interpretation, we introduce the full conditional divergence and derive an upper bound that characterizes the location of the stationary distribution in the space of probability distributions. We then reformulate both structure and parameter learning as optimization problems that decompose into independent subproblems for each node, and prove that the learned model distribution converges to the true underlying distribution as the number of training samples grows to infinity. Experiments confirm that the proposed upper bound is tight in practice.

关键词: Dependency Networks, Information Geometry, Pseudo-Gibbs Sampling, Full Conditional Divergence, Stationary Distribution, Structure Learning, Parameter Learning, Convergence Analysis

237. ❌ Inverse Design of Optical Multilayer Thin Films using Robust Masked Diffusion Models

作者: Jonas Schaible, Asena Karolin Özdemir, Charlotte Debus, Sven Burger, Achim Streit, Christiane Becker, Klaus Jäger, Markus Götz 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01106v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出OptoLlama，一个基于掩码扩散语言模型的大模型，用于光学多层薄膜的逆向设计，属于AI for Science（科学AI）领域，因此该关键词得10分。论文使用语言模型架构，因此与’Large Language Models’相关，但并非核心研究LLM技术本身，得8分。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于掩码扩散语言模型OptoLlama的光学多层薄膜逆向设计方法，能够从目标光谱中生成材料-厚度序列，相比现有基线方法将平均绝对光谱误差降低了2.9-3.45倍。

摘要翻译

光学多层膜堆栈的逆向设计旨在从期望的目标光谱反推层材料、厚度及排列顺序。由于设计空间庞大且解不具唯一性，这始终是一项长期挑战。我们提出\texttt{OptoLlama}，一种基于掩码扩散语言模型的逆向薄膜设计方法，可根据光学光谱进行设计。该方法将多层堆栈表示为材料-厚度标记序列，以反射率、吸收率和透射率光谱为条件生成结构，并学习从光学响应到结构的概率映射。在包含3000个目标光谱的代表性测试集上评估，\texttt{OptoLlama}将平均绝对光谱误差相较于最近邻模板基线降低了2.9倍，相较于最先进的数据驱动基线\texttt{OptoGPT}降低了3.45倍。针对设计目标和专家定义目标的案例研究表明，该模型能复现特征光谱曲线，并还原具有物理意义的堆栈结构模式，包括分布式布拉格反射器。这些成果确立了基于扩散的序列建模作为逆向光子设计的强大框架。

摘要 (Abstract)

Inverse design of optical multilayer stacks seeks to infer layer materials, thicknesses, and ordering from a desired target spectrum. It is a long-standing challenge due to the large design space and non-unique solutions. We introduce \texttt{OptoLlama}, a masked diffusion language model for inverse thin-film design from optical spectra. Representing multilayer stacks as sequences of material-thickness tokens, \texttt{OptoLlama} conditions generation on reflectance, absorptance, and transmittance spectra and learns a probabilistic mapping from optical response to structure. Evaluated on a representative test set of 3,000 targets, \texttt{OptoLlama} reduces the mean absolute spectral error by 2.9-fold relative to a nearest-neighbor template baseline and by 3.45-fold relative to the state-of-the-art data-driven baseline, called \texttt{OptoGPT}. Case studies on designed and expert-defined targets show that the model reproduces characteristic spectral features and recovers physically meaningful stack motifs, including distributed Bragg reflectors. These results establish diffusion-based sequence modeling as a powerful framework for inverse photonic design.

关键词: inverse design, optical multilayer thin films, masked diffusion model, language model, OptoLlama, spectral error reduction, photonic design, sequence modeling

238. ❌ Model-Based Learning of Near-Optimal Finite-Window Policies in POMDPs

作者: Philip Jordan, Maryam Kamgarpour 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01024v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是部分可观测马尔可夫决策过程（POMDPs）中基于模型的有限窗口策略学习，属于强化学习和决策理论领域。论文内容完全不涉及大语言模型、深度学习、AI for Science或任何指定的技术关键词，所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文研究了在表格型部分可观测马尔可夫决策过程中，如何通过估计超状态MDP模型来学习近似最优的有限窗口策略，并提供了单轨迹下的样本复杂度保证。

摘要翻译

我们研究基于模型的有限窗口策略在表格型部分可观测马尔可夫决策过程（POMDPs）中的学习问题。在部分可观测性下学习的常见方法是通过有限的动作-观测窗口来近似无界的历史依赖性。这构建了一个基于历史的有限状态马尔可夫决策过程（MDP），称为超状态MDP。一旦获得该超状态MDP的模型，便可使用标准MDP算法计算最优策略，这凸显了对样本高效模型估计的需求。估计超状态MDP模型具有挑战性，因为轨迹是通过与原始POMDP交互生成的，导致采样过程与目标模型之间存在不匹配。我们提出了一种针对表格型POMDP的模型估计方法，并分析了其样本复杂度。我们的分析利用了滤波器稳定性与弱相关随机变量的集中不等式之间的联系。由此，我们获得了从单条轨迹估计超状态MDP模型的严格样本复杂度保证。结合值迭代，这为POMDP提供了近似最优的有限窗口策略。

摘要 (Abstract)

We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.

关键词: POMDPs, finite-window policies, model-based learning, superstate MDP, sample complexity, value iteration, filter stability, tabular POMDPs

239. ❌ EmbedPart: Embedding-Driven Graph Partitioning for Scalable Graph Neural Network Training

作者: Nikolai Merkel, Ruben Mayer, Volker Markl, Hans-Arno Jacobsen 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01000v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图神经网络（GNN）训练中的图分区技术，提出了一种基于节点嵌入的快速分区方法EmbedPart。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是图神经网络（GNN）的分布式训练优化，属于图机器学习领域，与LLM、MoE、对齐、推理、代理、量化等关键词无直接关联。论文未涉及生物信息学或化学信息学等AI for Science的具体应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了大规模图神经网络训练中图分区在速度与质量之间的权衡问题，提出了一种基于节点嵌入的快速分区方法EmbedPart，实现了超过100倍的加速并保持了分区质量。

摘要翻译

图神经网络（GNNs）被广泛应用于图结构数据的学习，但将GNN训练扩展到大规模图仍然具有挑战性。为实现可扩展的分布式训练，图被划分为较小的分区并分布到多台机器上，以最小化机器间通信并平衡计算负载。在实践中，现有的划分方法面临着划分开销与划分质量之间的根本性权衡。我们提出了EmbedPart，一种嵌入驱动的划分方法，能够同时实现速度与质量。EmbedPart不直接处理不规则的图结构，而是利用实际GNN训练过程中产生的节点嵌入，并对这些稠密嵌入进行聚类以推导出分区。与Metis相比，EmbedPart实现了超过100倍的加速，同时保持了有竞争力的划分质量，并加速了分布式GNN训练。此外，EmbedPart天然支持图更新与快速重新划分，并可应用于图重排序以改善数据局部性、加速单机GNN训练。通过将划分从不规则的图结构转移到稠密嵌入，EmbedPart实现了可扩展且高质量的图数据优化。

摘要 (Abstract)

Graph Neural Networks (GNNs) are widely used for learning on graph-structured data, but scaling GNN training to massive graphs remains challenging. To enable scalable distributed training, graphs are divided into smaller partitions that are distributed across multiple machines such that inter-machine communication is minimized and computational load is balanced. In practice, existing partitioning approaches face a fundamental trade-off between partitioning overhead and partitioning quality. We propose EmbedPart, an embedding-driven partitioning approach that achieves both speed and quality. Instead of operating directly on irregular graph structures, EmbedPart leverages node embeddings produced during the actual GNN training workload and clusters these dense embeddings to derive a partitioning. EmbedPart achieves more than 100x speedup over Metis while maintaining competitive partitioning quality and accelerating distributed GNN training. Moreover, EmbedPart naturally supports graph updates and fast repartitioning, and can be applied to graph reordering to improve data locality and accelerate single-machine GNN training. By shifting partitioning from irregular graph structures to dense embeddings, EmbedPart enables scalable and high-quality graph data optimization.

关键词: Graph Neural Networks, GNN training, graph partitioning, node embeddings, distributed training, scalability, EmbedPart, data locality

240. ❌ Rapid mixing in positively weighted restricted Boltzmann machines

作者: Weiming Feng, Heng Guo, Minji Yang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00963v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是受限玻尔兹曼机（RBM）的马尔可夫链蒙特卡洛采样算法的混合时间分析，属于经典的概率图模型和统计物理领域。所有关键词均与大语言模型、深度学习技术原理创新、AI科学应用等主题无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文证明了正权重受限玻尔兹曼机的交替扫描采样器具有多对数混合时间，并通过分析铁磁双自旋系统的相同链和Glauber动力学，获得了直至临界阈值的新混合时间界限。

摘要翻译

我们证明了针对正权重受限玻尔兹曼机的交替扫描采样器具有多对数阶混合时间。这一结论是通过分析同一马尔可夫链以及铁磁双自旋系统的格劳伯动力学而得出，其中我们获得了直至临界阈值的新混合时间上界。

摘要 (Abstract)

We show polylogarithmic mixing time bounds for the alternating-scan sampler for positively weighted restricted Boltzmann machines. This is done via analysing the same chain and the Glauber dynamics for ferromagnetic two-spin systems, where we obtain new mixing time bounds up to the critical thresholds.

关键词: restricted Boltzmann machines, mixing time, alternating-scan sampler, Glauber dynamics, ferromagnetic two-spin systems, critical thresholds, polylogarithmic mixing time

241. ❌ Focal plane wavefront control with model-based reinforcement learning

作者: Jalo Nousiainen, Iremsu Taskin, Markus Kasper, Gilles Orban De Xivry, Olivier Absil 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00993v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于天文成像中的自适应光学控制，使用基于模型的强化学习（PO4NCPA算法）来校正非共路像差。虽然属于AI在科学领域的应用（天文科学），但具体技术是强化学习在光学控制中的专门应用，与所有大模型/深度学习技术关键词（如LLMs、MoE、SFT、RAG等）完全无关。仅与最后一个关键词’AI for Science’有一定关联（5分），因为论文确实将机器学习应用于科学问题（天文成像），但并非大模型或深度学习在科学领域的典型应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于模型强化学习的方法PO4NCPA，用于自动检测和校正天文望远镜中的静态和动态非共路像差，通过数值模拟验证了其在多种条件下的有效性。

摘要翻译

对潜在宜居系外行星的直接成像，是极大望远镜高对比度成像仪器的一项核心科学目标。此类系外行星大多近距离环绕宿主恒星运行，其观测受到快速移动的大气散斑和准静态非共路像差（NCPA）的限制。传统的NCPA校正方法通常使用机械镜面探针，这会在操作期间影响性能。本研究提出了基于机器学习的NCPA控制方法，该方法利用序列相位多样性，自动探测并校正动态和静态NCPA误差。我们将先前在自适应光学（AO）中应用的强化学习工作扩展到焦平面控制。一种新的基于模型的强化学习算法——NCPA策略优化算法（PO4NCPA）——将焦平面图像解读为输入数据，并通过序列相位多样性，确定能同时优化非日冕仪和日冕仪后点扩散函数（PSFs）的相位校正，且无需先验系统知识。此外，我们通过数值模拟地面望远镜上受水汽视宁度（动态NCPA）影响的静态NCPA误差以及一个红外成像仪，验证了该方法的有效性。模拟结果表明，PO4NCPA能够稳健地补偿静态和动态NCPA。在静态情况下，使用日冕仪时，它能实现接近最优的焦平面光抑制；不使用日冕仪时，则能实现接近最优的斯特列尔比。对于动态NCPA，它在这些指标上匹配了模态最小二乘重构结合单步延迟积分器的性能。该方法对于极大望远镜（ELT）光瞳、矢量涡旋日冕仪，以及在光子和背景噪声条件下依然有效。PO4NCPA是无模型的，可直接应用于标准成像以及任何类型的日冕仪。其亚毫秒级的推理时间和性能表现，也使其适用于高对比度成像（HCI）之外的大气湍流实时低阶校正。

摘要 (Abstract)

The direct imaging of potentially habitable exoplanets is one prime science case for high-contrast imaging instruments on extremely large telescopes. Most such exoplanets orbit close to their host stars, where their observation is limited by fast-moving atmospheric speckles and quasi-static non-common-path aberrations (NCPA). Conventional NCPA correction methods often use mechanical mirror probes, which compromise performance during operation. This work presents machine-learning-based NCPA control methods that automatically detect and correct both dynamic and static NCPA errors by leveraging sequential phase diversity. We extend previous work in reinforcement learning for AO to focal plane control. A new model-based RL algorithm, Policy Optimization for NCPAs (PO4NCPA), interprets the focal-plane image as input data and, through sequential phase diversity, determines phase corrections that optimize both non-coronagraphic and post-coronagraphic PSFs without prior system knowledge. Further, we demonstrate the effectiveness of this approach by numerically simulating static NCPA errors on a ground-based telescope and an infrared imager affected by water-vapor-induced seeing (dynamic NCPAs). Simulations show that PO4NCPA robustly compensates static and dynamic NCPAs. In static cases, it achieves near-optimal focal-plane light suppression with a coronagraph and near-optimal Strehl without one. With dynamics NCPA, it matches the performance of the modal least-squares reconstruction combined with a 1-step delay integrator in these metrics. The method remains effective for the ELT pupil, vector vortex coronagraph, and under photon and background noise. PO4NCPA is model-free and can be directly applied to standard imaging as well as to any coronagraph. Its sub-millisecond inference times and performance also make it suitable for real-time low-order correction of atmospheric turbulence beyond HCI.

关键词: focal plane wavefront control, model-based reinforcement learning, non-common-path aberrations, adaptive optics, exoplanet imaging, phase diversity, PO4NCPA, coronagraph

242. ❌ Differentially Private Manifold Denoising

作者: Jiaqi Wu, Yiqing Sun, Zhigang Yao 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00942v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于差分隐私（Differential Privacy）在流形降噪（Manifold Denoising）中的应用，提出了一种迭代框架来利用敏感参考数据集校正噪声查询点，同时提供严格的隐私保证。研究内容属于隐私保护机器学习领域，与深度学习、大模型技术原理或AI for Science应用无直接关联。所有评分关键词均围绕大模型、深度学习技术及其应用，而本文未涉及任何大模型架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种差分隐私流形降噪框架，能够在保护参考数据集隐私的同时，通过迭代校正过程有效恢复噪声查询点的流形结构，并提供了严格的隐私保证和效用分析。

摘要翻译

本文提出一种差分隐私流形去噪框架，该框架允许用户利用敏感参考数据集来校正含噪的非隐私查询点，同时确保隐私不受侵害。该方法遵循迭代流程：（i）在校准敏感度下利用参考数据对局部均值与切空间几何结构进行隐私保护估计；（ii）通过每轮迭代的校正步骤，将查询点沿隐私估计的子空间向局部均值方向投影；（iii）采用$(\varepsilon,\delta)$-差分隐私（DP）机制对迭代过程与查询进行严格的隐私核算。从概念上，该框架将差分隐私引入流形方法，在为下游任务（如嵌入、聚类与可视化）保留充分几何信号的同时，为参考数据提供形式化的差分隐私保障。在实际应用中，该流程具有模块化与可扩展性：它将受差分隐私保护的局部几何信息（均值与切空间）与预算化的查询点更新相分离，并通过简易调度器在迭代与查询间分配隐私预算。在流形正则性、采样密度与测量噪声的标准假设下，我们建立了高概率效用保证，证明校正后的查询点以受样本量、噪声水平、带宽及隐私预算共同约束的非渐近速率向流形收敛。仿真与案例研究表明，在适度隐私预算下可实现准确的信号恢复，清晰展示了效用与隐私的权衡关系，并为受监管环境中基于流形的工作流程提供了可直接部署的差分隐私模块，无需重构隐私系统。

摘要 (Abstract)

We introduce a differentially private manifold denoising framework that allows users to exploit sensitive reference datasets to correct noisy, non-private query points without compromising privacy. The method follows an iterative procedure that (i) privately estimates local means and tangent geometry using the reference data under calibrated sensitivity, (ii) projects query points along the privately estimated subspace toward the local mean via corrective steps at each iteration, and (iii) performs rigorous privacy accounting across iterations and queries using $(\varepsilon,δ)$-differential privacy (DP). Conceptually, this framework brings differential privacy to manifold methods, retaining sufficient geometric signal for downstream tasks such as embedding, clustering, and visualization, while providing formal DP guarantees for the reference data. Practically, the procedure is modular and scalable, separating DP-protected local geometry (means and tangents) from budgeted query-point updates, with a simple scheduler allocating privacy budget across iterations and queries. Under standard assumptions on manifold regularity, sampling density, and measurement noise, we establish high-probability utility guarantees showing that corrected queries converge toward the manifold at a non-asymptotic rate governed by sample size, noise level, bandwidth, and the privacy budget. Simulations and case studies demonstrate accurate signal recovery under moderate privacy budgets, illustrating clear utility-privacy trade-offs and providing a deployable DP component for manifold-based workflows in regulated environments without reengineering privacy systems.

关键词: differential privacy, manifold denoising, privacy-preserving machine learning, local geometry estimation, utility-privacy trade-off, iterative correction, sensitive data, formal privacy guarantees

243. ❌ Multi-Mode Quantum Annealing for Variational Autoencoders with General Boltzmann Priors

作者: Gilhan Kim, Daniel K. Park 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00919v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究变分自编码器（VAE）与玻尔兹曼机先验的结合，并使用量子退火进行采样训练，属于生成模型和量子计算在机器学习中的应用。所有关键词均聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等），而本文完全不涉及LLM或深度学习技术原理的创新，仅与“AI for Science”有微弱关联（属于AI在科学计算中的应用），因此除该关键词外均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合玻尔兹曼机先验的变分自编码器（BM-VAE），利用量子退火在三种操作模式下进行高效采样训练，实现了比高斯先验VAE更快的收敛和更低的重建损失，并支持无条件生成和条件生成。

摘要翻译

变分自编码器（VAEs）能够学习复杂数据的紧凑潜在表示，但其生成能力从根本上受限于潜在空间先验分布的选择。基于能量的先验提供了一种超越因子化假设、捕捉潜在变量间结构化交互的原则性方法，然而大规模训练此类先验需要从难解分布中进行精确高效的采样。本文提出玻尔兹曼机先验变分自编码器（BM-VAEs），通过在单一生成系统中采用三种不同操作模式的量子退火采样进行训练。在训练阶段，非绝热量子退火（DQA）为基于能量先验的梯度估计提供无偏的玻尔兹曼样本；在无条件生成阶段，较慢的量子退火（QA）使样本集中于低能量极小值附近；在条件生成阶段，通过添加偏置场引导采样朝向能量景观中特定属性区域（c-QA）。利用D-Wave Advantage2处理器上多达2000个量子比特，我们在多个数据集上实现了稳定高效的训练，相比高斯先验VAE具有更快的收敛速度和更低的重构误差。学习得到的玻尔兹曼先验支持直接从基于能量的潜在分布采样进行无条件生成——这是普通自编码器不具备的能力，并可通过利用已学习的成对交互作用的潜在偏置技术实现条件生成。

摘要 (Abstract)

Variational autoencoders (VAEs) learn compact latent representations of complex data, but their generative capacity is fundamentally constrained by the choice of prior distribution over the latent space. Energy-based priors offer a principled way to move beyond factorized assumptions and capture structured interactions among latent variables, yet training such priors at scale requires accurate and efficient sampling from intractable distributions. Here we present Boltzmann-machine–prior VAEs (BM-VAEs) trained using quantum annealing–based sampling in three distinct operational modes within a single generative system. During training, diabatic quantum annealing (DQA) provides unbiased Boltzmann samples for gradient estimation of the energy-based prior; for unconditional generation, slower quantum annealing (QA) concentrates samples near low-energy minima; for conditional generation, bias fields are added to direct sampling toward attribute-specific regions of the energy landscape (c-QA). Using up to 2000 qubits on a D-Wave Advantage2 processor, we demonstrate stable and efficient training across multiple datasets, with faster convergence and lower reconstruction loss than a Gaussian-prior VAE. The learned Boltzmann prior enables unconditional generation by sampling directly from the energy-based latent distribution, a capability that plain autoencoders lack, and conditional generation through latent biasing that leverages the learned pairwise interactions.

关键词: Variational Autoencoders, Boltzmann Machine Prior, Quantum Annealing, Generative Models, Latent Representations, Energy-based Priors, D-Wave Processor, Conditional Generation

244. ❌ Generalization Bounds for Spectral GNNs via Fourier Domain Analysis

作者: Vahan A. Martirosyan, Daniele Malitesta, Hugues Talbot, Jhony H. Giraldo, Fragkiskos D. Malliaros 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00918v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究谱图神经网络（Spectral GNNs）的理论分析，聚焦于图傅里叶域中的泛化界限和稳定性估计，属于图神经网络的理论基础研究。所有评分关键词均涉及大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用等），而论文完全不涉及任何语言模型、大模型技术或其在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过图傅里叶域分析谱图神经网络，推导了数据依赖、深度和阶数感知的泛化界限及稳定性估计，揭示了避免层间频率放大的实用设计选择。

摘要翻译

谱图神经网络学习图滤波器，但其随深度和多项式阶数增加的行为尚未得到充分理解。我们在图傅里叶域中分析这些模型，其中每一层都成为一个逐频率的更新操作，从而将固定频谱与可训练参数分离，并使深度和阶数显式化。在此框架下，我们证明高斯复杂度在图傅里叶变换下保持不变，这使我们能够推导出数据依赖、深度与阶数感知的泛化界以及稳定性估计。在线性情况下，我们的界更为紧凑；在真实图上，数据依赖项与不同多项式基之间的泛化差距相关，这揭示了能够避免层间频率放大的实用设计选择。

摘要 (Abstract)

Spectral graph neural networks learn graph filters, but their behavior with increasing depth and polynomial order is not well understood. We analyze these models in the graph Fourier domain, where each layer becomes an element-wise frequency update, separating the fixed spectrum from trainable parameters and making depth and order explicit. In this setting, we show that Gaussian complexity is invariant under the Graph Fourier Transform, which allows us to derive data-dependent, depth, and order-aware generalization bounds together with stability estimates. In the linear case, our bounds are tighter, and on real graphs, the data-dependent term correlates with the generalization gap across polynomial bases, highlighting practical choices that avoid frequency amplification across layers.

关键词: Spectral Graph Neural Networks, Graph Fourier Transform, Generalization Bounds, Stability Estimates, Frequency Amplification, Polynomial Bases, Gaussian Complexity, Real Graphs

245. ❌ Orthogonal Learner for Estimating Heterogeneous Long-Term Treatment Effects

作者: Haorui Ma, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00915v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于因果推断中的异质性长期治疗效果估计方法，提出了一种新的正交学习器（LT-O-Learners），用于处理治疗或长期观察重叠度低的问题。论文内容涉及机器学习、统计估计和因果推断，但完全不涉及大语言模型、深度学习技术原理、AI for Science应用或任何评分关键词中的具体技术（如MoE、RLHF、RAG等）。所有关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是传统机器学习在因果推断中的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LT-O-Learners的正交学习器方法，用于解决异质性长期治疗效果估计中因治疗或长期观察重叠度低导致的估计不稳定问题，并通过理论分析和实验验证了其在低重叠度场景下的鲁棒性。

摘要翻译

异质性长期处理效应（HLTE）的估计在市场营销、经济学和医学等领域的个性化决策中被广泛应用，这些领域通常将短期随机实验与长期观测数据相结合。然而，由于某些亚群在处理分配或长期结果观测上的重叠有限，HLTE估计面临挑战，这可能导致估计结果不稳定并产生较大的有限样本方差。为解决这一问题，我们提出了LT-O学习器（长期正交学习器），这是一组用于HLTE估计的新型正交学习器。这些学习器专为典型的HLTE设定设计，该设定结合了短期随机数据集$\mathcal{D}_1$与长期历史数据集$\mathcal{D}_2$。我们LT-O学习器的核心思想是通过引入自定义重叠权重来重新定位学习目标，这些权重会降低在处理分配或长期观测中重叠度较低的样本的权重。我们证明，重新定位后的损失函数等价于加权Oracle损失，并满足Neyman正交性，这意味着我们的学习器对干扰参数估计中的误差具有鲁棒性。我们进一步为LT-O学习器提供了通用的误差界，并给出了达到拟Oracle速率所需的条件。最后，我们的LT-O学习器与模型无关，因此可以用任意的机器学习模型进行实例化。我们在合成与半合成基准数据上进行了实证评估，以验证LT-O学习器的理论性质，特别是在低重叠设定下的鲁棒性。据我们所知，这是首个针对HLTE估计的正交学习器，能够对长期结果中常见的低重叠问题保持鲁棒性。

摘要 (Abstract)

Estimation of heterogeneous long-term treatment effects (HLTEs) is widely used for personalized decision-making in marketing, economics, and medicine, where short-term randomized experiments are often combined with long-term observational data. However, HLTE estimation is challenging due to limited overlap in treatment or in observing long-term outcomes for certain subpopulations, which can lead to unstable HLTE estimates with large finite-sample variance. To address this challenge, we introduce the LT-O-learners (Long-Term Orthogonal Learners), a set of novel orthogonal learners for HLTE estimation. The learners are designed for the canonical HLTE setting that combines a short-term randomized dataset $\mathcal{D}_1$ with a long-term historical dataset $\mathcal{D}_2$. The key idea of our LT-O-Learners is to retarget the learning objective by introducing custom overlap weights that downweight samples with low overlap in treatment or in long-term observation. We show that the retargeted loss is equivalent to the weighted oracle loss and satisfies Neyman-orthogonality, which means our learners are robust to errors in the nuisance estimation. We further provide a general error bound for the LT-O-Learners and give the conditions under which quasi-oracle rate can be achieved. Finally, our LT-O-learners are model-agnostic and can thus be instantiated with arbitrary machine learning models. We conduct empirical evaluations on synthetic and semi-synthetic benchmarks to confirm the theoretical properties of our LT-O-Learners, especially the robustness in low-overlap settings. To the best of our knowledge, ours are the first orthogonal learners for HLTE estimation that are robust to low overlap that is common in long-term outcomes.

关键词: heterogeneous long-term treatment effects, orthogonal learners, low overlap, causal inference, machine learning, robust estimation, treatment effect estimation, personalized decision-making

246. ❌ Event Embedding of Protein Networks : Compositional Learning of Biological Function

作者: Antonin Sulc 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00911v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究蛋白质-蛋白质相互作用网络的序列嵌入方法（Event2Vec），属于生物信息学领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文未涉及大模型、深度学习技术原理或任何其他关键词中的具体技术（如MoE、SFT、RAG等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在蛋白质-蛋白质相互作用网络中强制组合结构的序列嵌入方法（Event2Vec），发现组合性显著提高了通路连贯性、功能类比准确性和层次通路组织。

摘要翻译

本研究探讨在蛋白质-蛋白质相互作用网络中，强制序列嵌入具有严格组合结构是否会产生有意义的几何组织。我们使用加法序列嵌入模型Event2Vec，基于人类STRING相互作用组的随机游走训练64维表征，并与基于Word2Vec、在相同游走上训练的DeepWalk基线模型进行比较。研究发现，组合结构显著提升了通路连贯性（较随机水平提高30.2倍 vs 2.9倍）、功能类比准确性（平均相似度0.966 vs 0.650）以及层级式通路组织能力，而在范数-度反相关等几何特性方面，非组合基线模型则保持相当或更优表现。这些结果表明，强制组合性特别有利于生物网络中的关系推理与组合推理任务。

摘要 (Abstract)

In this work, we study whether enforcing strict compositional structure in sequence embeddings yields meaningful geometric organization when applied to protein-protein interaction networks. Using Event2Vec, an additive sequence embedding model, we train 64-dimensional representations on random walks from the human STRING interactome, and compare against a DeepWalk baseline based on Word2Vec, trained on the same walks. We find that compositional structure substantially improves pathway coherence (30.2$\times$ vs 2.9$\times$ above random), functional analogy accuracy (mean similarity 0.966 vs 0.650), and hierarchical pathway organization, while geometric properties such as norm–degree anticorrelation are shared with or exceeded by the non-compositional baseline. These results indicate that enforced compositionality specifically benefits relational and compositional reasoning tasks in biological networks.

关键词: protein-protein interaction networks, sequence embeddings, compositional structure, Event2Vec, biological networks, pathway coherence, functional analogy, hierarchical organization

247. ❌ Fatigue-Aware Learning to Defer via Constrained Optimisation

作者: Zheng Zhang, Cuong C. Nguyen, David Rosewarne, Kevin Wells, Gustavo Carneiro 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00904v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是人机协作中的学习延迟（Learning to Defer）问题，提出了一种考虑人类疲劳因素的FALCON方法，使用约束马尔可夫决策过程（CMDP）和PPO-Lagrangian训练。虽然涉及AI系统，但论文核心是决策延迟机制、人类疲劳建模和约束优化，与提供的大模型、深度学习技术原理、科学应用等关键词无直接关联。所有关键词均未在标题或摘要中出现，也未涉及相关技术概念，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对现有学习延迟方法忽略人类疲劳导致性能下降的问题，提出了疲劳感知的学习延迟方法FALCON，通过建模人类疲劳动态和约束优化，实现了优于现有方法的人机协作性能，并能零样本泛化到不同疲劳模式的专家。

摘要翻译

学习延迟决策（Learning to Defer, L2D）通过决定人工智能系统应在何时自主行动或延迟给人类专家处理，实现了人机协作。然而，现有的L2D方法假设人类表现是静态的，这与疲劳导致能力下降的公认研究结论相矛盾。我们提出基于约束优化的疲劳感知延迟学习方法（Fatigue-Aware Learning to Defer via Constrained Optimisation, FALCON），该方法利用心理学基础的疲劳曲线显式建模随工作量变化的人类表现。FALCON将L2D构建为一个约束马尔可夫决策过程（Constrained Markov Decision Process, CMDP），其状态同时包含任务特征和人类累积工作量，并通过PPO-Lagrangian训练在给定人机协作资源约束下优化准确率。我们进一步提出了FA-L2D基准测试，该系统性地模拟了从近乎静态到快速衰退的多种疲劳动态模式。在多数据集上的实验表明，FALCON在不同覆盖水平下均持续优于现有最优L2D方法，能够零样本泛化至具有不同疲劳模式的未知专家，并证明了当覆盖范围严格处于0到1之间时，自适应人机协作相较于纯人工智能或纯人类决策具有显著优势。

摘要 (Abstract)

Learning to defer (L2D) enables human-AI cooperation by deciding when an AI system should act autonomously or defer to a human expert. Existing L2D methods, however, assume static human performance, contradicting well-established findings on fatigue-induced degradation. We propose Fatigue-Aware Learning to Defer via Constrained Optimisation (FALCON), which explicitly models workload-varying human performance using psychologically grounded fatigue curves. FALCON formulates L2D as a Constrained Markov Decision Process (CMDP) whose state includes both task features and cumulative human workload, and optimises accuracy under human-AI cooperation budgets via PPO-Lagrangian training. We further introduce FA-L2D, a benchmark that systematically varies fatigue dynamics from near-static to rapidly degrading regimes. Experiments across multiple datasets show that FALCON consistently outperforms state-of-the-art L2D methods across coverage levels, generalises zero-shot to unseen experts with different fatigue patterns, and demonstrates the advantage of adaptive human-AI collaboration over AI-only or human-only decision-making when coverage lies strictly between 0 and 1.

关键词: Learning to Defer, Human-AI cooperation, Fatigue modeling, Constrained Markov Decision Process, PPO-Lagrangian, Adaptive collaboration, Zero-shot generalization, Benchmark FA-L2D

248. ❌ Accurate and Scalable Matrix Mechanisms via Divide and Conquer

作者: Guanlin He, Yingtai Xiao, Jiamu Bai, Xin Gu, Zeyu Ding, Wenpeng Yin, Daniel Kifer 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00868v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是差分隐私中的矩阵机制优化方法（QuerySmasher），属于隐私保护技术领域。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术主题。所有关键词均与大模型技术原理、训练方法、推理优化、应用领域等相关，而本文专注于传统隐私保护算法的改进，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种基于分治策略的可扩展矩阵机制QuerySmasher，用于优化差分隐私查询回答的准确性和可扩展性，并证明其在平方误差和上优于现有方法。

摘要翻译

矩阵机制常用于发布统计信息或生成合成数据时提供无偏的差分隐私查询答案。近期研究发展了可扩展至高维数据集并能为边际查询与循环乘积查询等工作负载提供最优性保证的矩阵机制，例如残差规划器与加权傅里叶分解。这些机制通过对能够紧凑表示目标工作负载的线性无关查询集添加噪声来实现。
本文提出QuerySmasher——一种基于分治策略的可扩展替代方案。针对可通过不同数据边际回答的工作负载，QuerySmasher将每个查询拆分为子查询，并将这些片段重组为相互正交的子工作负载。这些子工作负载构成可被现有低维矩阵机制独立且最优求解的小型低维问题。随后QuerySmasher将这些解整合以回答原始工作负载中的查询。
我们证明QuerySmasher涵盖了残差规划器、增强型残差规划器及加权傅里叶分解等现有方法。在平方误差和准则下，我们论证了该方法对所有工作负载均能优于上述方法。实验部分进一步验证了QuerySmasher的可扩展性与准确性。

摘要 (Abstract)

Matrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads. In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload. We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher.

关键词: Matrix Mechanisms, Differential Privacy, QuerySmasher, Divide and Conquer, Scalability, Sum Squared Error, Workload Optimization, ResidualPlanner

249. ❌ Policy Improvement Reinforcement Learning

作者: Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00860v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出PIRL/PIPO框架，专注于大语言模型（LLMs）的后训练强化学习优化，核心涉及Post-training/SFT（10分）、RLHF/DPO相关技术（10分）和Self-Correction机制（10分）。论文明确针对LLMs（10分）的推理能力提升，涉及多步推理（5分）和深度推理（5分）的评估场景。其他关键词如MoE、量化、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型后训练中强化学习方法的盲目更新问题，提出了Policy Improvement Reinforcement Learning（PIRL）框架和Policy Improvement Policy Optimization（PIPO）算法，通过闭环验证机制实现自我纠正，在数学推理基准上提高了稳定性和性能。

摘要翻译

可验证奖励强化学习已成为提升大语言模型推理能力的核心后训练范式。然而现有方法存在一个共同的盲区：它们基于瞬时的组级或批次级统计量优化策略，却从未验证更新是否真正改进了模型。这种开环设计——仅依赖组内（批次）奖励信号指导，在每一步孤立地进行更新——意味着优化过程可能漂移或崩溃，且缺乏检测与纠正这些失效的机制。我们认为缺失的关键要素是策略改进反馈：即直接测量并优化迭代间进展的能力。为此，我们提出策略改进强化学习框架，该框架以最大化跨迭代累积策略改进的显式目标替代代理奖励最大化，并证明这一时序目标与最终任务性能最大化完全一致。基于PIRL，我们进一步提出策略改进策略优化方法，该方法通过回顾式验证实现闭环优化。在每次迭代中，PIPO会评估前次更新相对于滑动窗口历史基线是否产生了真实改进，进而主动强化有益更新并抑制有害更新——从而将开环过程转化为自校正过程。我们提供的理论分析表明，PIPO在期望意义上沿PIRL目标执行上升优化，数学推理基准测试实验也证明其相比GRPO及其变体具有更高的稳定性与性能表现。

摘要 (Abstract)

Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design – updating in isolation at each step, guided only by within-group (batch) reward signals – means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones – transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

关键词: Reinforcement Learning, Large Language Models, Post-training, Policy Improvement, Self-correcting, Mathematical Reasoning, Closed-loop Optimization, Verification

250. ❌ Optimal Brain Decomposition for Accurate LLM Low-Rank Approximation

作者: Yuhang Li, Donghyun Lee, Ruokai Yin, Priyadarshini Panda 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00821v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM的低秩分解方法，直接涉及LLM技术（10分）和参数高效微调（10分），因为低秩分解是PEFT的一种形式。同时，低秩分解属于模型压缩技术（10分），能提升推理效率（5分）。论文提到在微调中应用（5分），但未涉及其他关键词如MoE、量化、对齐等具体内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Hessian信息的低秩分解方法OBD-LLM，用于优化大型语言模型的权重矩阵分解，在准确度上比现有SVD方法提升20-40%。

摘要翻译

低秩分解已成为大语言模型（LLM）微调与推理中的一个重要问题。通过奇异值分解（SVD），权重矩阵可被最优地分解至低秩空间。以往常见做法是在经过激活白化处理的空间中对权重进行分解，从而获得令人满意的结果。在本研究中，我们提出了最优脑分解大语言模型（OBD-LLM），该方法通过利用二阶海森矩阵信息，在模型空间中对分解问题进行研究。通过对海森矩阵进行严格的克罗内克分解，我们证明分解过程需同时考虑神经网络层的输入与输出信息，相比仅考虑输入的方法能获得更优的分解结果。我们提出的损失感知分解方法涉及对权重矩阵进行双向白化处理。因此，OBD-LLM 为语言模型中的权重最优分解提供了闭式解。值得注意的是，相较于先前最先进的分解方法 SVD-LLM，我们实现了约 20-40% 的性能提升。

摘要 (Abstract)

Low-rank decomposition has emerged as an important problem in Large Language Model (LLM) fine-tuning and inference. Through Singular Value Decomposition (SVD), the weight matrix can be factorized into low-rank spaces optimally. Previously, a common practice was to decompose the weight in the activation-whitened space, and then achieve satisfying results. In this work, we propose Optimal Brain Decomposition LLM (OBD-LLM), which studies the decomposition problem in the model space by utilizing second-order Hessian information. Through a rigorous Kronecker-factorization of the Hessian, we show that the decomposition needs to consider both input and output information of the layer, and achieves much better decomposition results compared to input only method. Our loss-aware decomposition method involves a bi-directional whitening on the weight matrix. As a result, OBD-LLM is a closed-form solution for the optimal decomposition of weights in the language model. Remarkably, we achieve ~20-40% better results than previous state-of-the-art decomposition methods, the SVD-LLM.

关键词: Low-rank decomposition, Large Language Model, Singular Value Decomposition, Hessian information, Parameter-efficient fine-tuning, Model compression, Weight matrix factorization, Inference optimization

251. ❌ Deconfounding Scores and Representation Learning for Causal Effect Estimation with Weak Overlap

作者: Oscar Clivio, Alexander D’Amour, Alexander Franks, David Bruns-Smith, Chris Holmes, Avi Feller 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00811v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于因果效应估计中的统计方法，特别是处理高维特征下的重叠性问题，提出了一种称为“去混淆分数”的特征表示方法。论文内容完全属于传统因果推断和统计机器学习领域，不涉及任何大语言模型、深度学习技术原理、AI应用或相关技术关键词。所有评分关键词均与大模型、深度学习、AI技术相关，而本文研究的是经典的统计因果推断问题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对高维特征下因果效应估计中的重叠性问题，提出了一类称为“去混淆分数”的特征表示方法，证明了在该类表示中预后分数具有最优重叠性。

摘要翻译

重叠性，亦称正性，是因果处理效应估计的关键条件。当处理组间的特征差异显著时，许多常用估计量会面临高方差问题并变得不稳定。这在多维情境中尤为严峻：维数灾难可能使重叠性假设难以成立。为解决此问题，我们提出一类称为去混杂评分的特征表示方法，该方法既能保持识别性，又能保留估计目标；经典的倾向评分与预后评分是其中的两个特例。我们将寻找具有更好重叠性的表示问题，刻画为在去混杂评分约束下最小化一种重叠散度。随后，我们针对一类具有高斯特征的广义线性模型，推导出该族去混杂评分的闭式表达式，并证明在此类模型中预后评分具有重叠最优性。我们通过大量实验对这一特性进行了实证评估。

摘要 (Abstract)

Overlap, also known as positivity, is a key condition for causal treatment effect estimation. Many popular estimators suffer from high variance and become brittle when features differ strongly across treatment groups. This is especially challenging in high dimensions: the curse of dimensionality can make overlap implausible. To address this, we propose a class of feature representations called deconfounding scores, which preserve both identification and the target of estimation; the classical propensity and prognostic scores are two special cases. We characterize the problem of finding a representation with better overlap as minimizing an overlap divergence under a deconfounding score constraint. We then derive closed-form expressions for a class of deconfounding scores under a broad family of generalized linear models with Gaussian features and show that prognostic scores are overlap-optimal within this class. We conduct extensive experiments to assess this behavior empirically.

关键词: causal effect estimation, overlap, positivity, deconfounding scores, prognostic scores, high-dimensional features, generalized linear models, Gaussian features

252. ❌ MIRANDA: MId-feature RANk-adversarial Domain Adaptation toward climate change-robust ecological forecasting with deep learning

作者: Yuchang Jiang, Jan Dirk Wegner, Vivien Sainte Fare Garnot 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00800v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于深度学习在生态预测（植物物候学）中的应用，属于AI for Science领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分）。论文核心贡献是提出MIRANDA方法，一种针对气候变化引起的分布偏移的领域自适应技术，与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。论文未涉及大语言模型、MoE、推理、对齐、压缩、代理等大模型技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对气候变化导致数据分布偏移时深度学习模型在植物物候预测中性能下降的问题，提出了一种名为MIRANDA的中层特征秩对抗领域自适应方法，有效提升了模型对气候分布变化的鲁棒性并缩小了与机理模型的性能差距。

摘要翻译

植物物候建模旨在根据气象时间序列预测季节性阶段（如展叶或开花）的发生时间。可靠的预测对于预估生态系统对气候变化的响应至关重要。传统物候建模主要依赖机理模型，而深度学习方法近期被提出作为灵活的数据驱动替代方案，通常表现出更优的性能。然而当气候变化引起数据分布偏移时，机理模型往往优于深度网络。领域自适应技术可能有助于解决这一局限。但与标准领域自适应设定不同，气候变化会引发时间连续性的领域变化，同时涉及协变量偏移和标签偏移——表现为气象记录变暖及春季物候期提前。为应对这一挑战，我们提出了中阶特征排序对抗领域自适应方法。传统对抗方法在最终潜在表征上强制实现领域不变性，但未显式处理标签偏移问题；与之不同，我们将对抗正则化应用于中间特征。此外，我们采用基于排序的目标函数替代二值领域分类目标，以强制学习到的气象表征具有年份不变性。基于一个覆盖70年时间跨度、包含5个树种67,800条物候观测的国家尺度数据集，我们证明与传统领域自适应方法不同，MIRANDA能提升模型对气候分布偏移的鲁棒性，并缩小与机理模型的性能差距。

摘要 (Abstract)

Plant phenology modelling aims to predict the timing of seasonal phases, such as leaf-out or flowering, from meteorological time series. Reliable predictions are crucial for anticipating ecosystem responses to climate change. While phenology modelling has traditionally relied on mechanistic approaches, deep learning methods have recently been proposed as flexible, data-driven alternatives with often superior performance. However, mechanistic models tend to outperform deep networks when data distribution shifts are induced by climate change. Domain Adaptation (DA) techniques could help address this limitation. Yet, unlike standard DA settings, climate change induces a temporal continuum of domains and involves both a covariate and label shift, with warmer records and earlier start of spring. To tackle this challenge, we introduce Mid-feature Rank-adversarial Domain Adaptation (MIRANDA). Whereas conventional adversarial methods enforce domain invariance on final latent representations, an approach that does not explicitly address label shift, we apply adversarial regularization to intermediate features. Moreover, instead of a binary domain-classification objective, we employ a rank-based objective that enforces year-invariance in the learned meteorological representations. On a country-scale dataset spanning 70 years and comprising 67,800 phenological observations of 5 tree species, we demonstrate that, unlike conventional DA approaches, MIRANDA improves robustness to climatic distribution shifts and narrows the performance gap with mechanistic models.

关键词: Domain Adaptation, Deep Learning, Climate Change, Phenology Modelling, Ecological Forecasting, Adversarial Learning, Distribution Shift, Robustness

253. ❌ Exploring Silent Data Corruption as a Reliability Challenge in LLM Training

作者: Anton Altenbernd, Philipp Wiesner, Odej Kao 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00726v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM预训练过程中的硬件可靠性问题（Silent Data Corruption），直接涉及’Large Language Models’和’Pre-training’关键词，其他关键词如MoE、SFT、RAG等均未在摘要中提及，与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型预训练中由硬件故障引起的静默数据损坏问题，并通过故障注入实验提出了一种轻量级检测方法，证明重新计算最近训练步骤可以有效缓解这些事件的影响。

摘要翻译

随着大语言模型（LLM）规模和复杂度的增长，训练过程中发生故障的后果日益严重。一个主要挑战源于静默数据损坏（SDC，Silent Data Corruption）：即硬件引发的、能绕过系统级检测机制的故障。SDC可能表现为良性的数值噪声，但也可能导致有害的梯度损坏，从而引发损失值尖峰、训练发散或进程停滞。本研究对间歇性SDC如何影响LLM预训练进行了受控分析。通过在GPU矩阵乘法指令层面进行定向故障注入，我们刻画了不同比特位、内核函数和执行阶段对故障的敏感性。分析表明，局部产生的故障可能造成显著影响，包括NaN值传播、损失值、梯度范数和注意力对数概率的短暂尖峰，以及持续性的参数发散。基于观察到的损坏特征，我们提出了一种轻量级检测方法，用于识别潜在有害的参数更新。在参数量为6000万、3.5亿和13亿的LLaMA模型上的实验表明，检测到故障后重新计算最近训练步骤能有效缓解此类事件的影响。

摘要 (Abstract)

As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress. This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter divergence. Building on the observed corruption signatures, we propose a lightweight detection method that identifies potentially harmful parameter updates. Experiments on LLaMA models with 60M, 350M, and 1.3B parameters demonstrate that recomputing the most recent training step upon detection can effectively mitigate the impact of these events.

关键词: Large Language Models, LLM training, Silent Data Corruption, fault injection, pretraining, reliability, GPU matrix-multiply, parameter divergence

254. ❌ ActivityNarrated: An Open-Ended Narrative Paradigm for Wearable Human Activity Understanding

作者: Lala Shakti Swarup Ray, Mengxi Liu, Alcina Pinto, Deepika Gurung, Daniel Geissler, Paul Lukowoicz, Bo Zhou 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于可穿戴设备的人类活动理解，提出了一种开放式的叙事建模方法，将传感器数据与自然语言描述对齐。论文的核心是传感器-语言对齐框架、检索式评估和语言条件学习架构，属于可穿戴计算和活动识别领域。所有关键词（共27个）中，只有"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（5分），因为论文涉及AI在可穿戴健康/行为科学中的应用，但并非核心生物信息学或化学信息学。其他26个关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG、CoT、Agents、Quantization等）均未在标题或摘要中提及，与论文内容完全无关（0分）。论文未涉及大模型技术原理创新或深度学习在科学领域的深度应用，主要贡献在于活动识别范式的转变。

!!! tip deepseek-chat TL;DR

该论文针对可穿戴设备人类活动识别中封闭式分类的局限性，提出了一种开放式叙事建模框架，通过将传感器数据与自然语言描述对齐，实现了对开放词汇活动的鲁棒理解，并在跨参与者评估中显著优于传统基线方法。

摘要翻译

可穿戴人体活动识别技术虽稳步发展，但多数进展仍依赖于闭集分类范式，这限制了其在实际场景中的应用。现实中的人类活动具有开放性、非预设性、个性化及组合性等特点，其展开方式更接近叙事流而非固定类别的实例集合。我们认为，弥合这一差距并非仅需扩大数据集或模型规模，而需从根本上重构可穿戴活动识别的任务定义、监督方式与评估体系。本研究展示了如何在开放词汇环境下，通过对齐可穿戴传感器数据与自然语言描述来建模开放性活动叙事。我们的框架包含三个核心组成部分：首先，我们提出一种自然主义数据采集与标注流程，将多位置可穿戴传感与自由形式的、时间对齐的行为叙事描述相结合，使活动语义能够脱离预定义词汇表而自然涌现；其次，我们设计了一套基于检索的评估框架，用于衡量传感器数据与语言之间的语义对齐程度，从而在无需固定类别的前提下实现原则性评估，同时将闭集分类作为其特例涵盖其中；第三，我们提出一种语言条件学习架构，支持对可变长度传感器流及异构传感器布局进行传感器到文本的推理。实验表明，基于固定标签目标训练的模型在现实场景的变异下性能急剧下降，而开放词汇的传感器-语言对齐方法则能产生鲁棒且语义 grounded 的表征。一旦习得这种对齐关系，闭集活动识别即可转化为简单的下游任务。在跨被试评估中，本方法取得了65.3%的宏平均F1分数，而强闭集基线方法仅获得31-34%的结果。这些发现确立了开放性叙事建模作为现实世界可穿戴活动识别的实用且有效的理论基础。

摘要 (Abstract)

Wearable HAR has improved steadily, but most progress still relies on closed-set classification, which limits real-world use. In practice, human activity is open-ended, unscripted, personalized, and often compositional, unfolding as narratives rather than instances of fixed classes. We argue that addressing this gap does not require simply scaling datasets or models. It requires a fundamental shift in how wearable HAR is formulated, supervised, and evaluated. This work shows how to model open-ended activity narratives by aligning wearable sensor data with natural-language descriptions in an open-vocabulary setting. Our framework has three core components. First, we introduce a naturalistic data collection and annotation pipeline that combines multi-position wearable sensing with free-form, time-aligned narrative descriptions of ongoing behavior, allowing activity semantics to emerge without a predefined vocabulary. Second, we define a retrieval-based evaluation framework that measures semantic alignment between sensor data and language, enabling principled evaluation without fixed classes while also subsuming closed-set classification as a special case. Third, we present a language-conditioned learning architecture that supports sensor-to-text inference over variable-length sensor streams and heterogeneous sensor placements. Experiments show that models trained with fixed-label objectives degrade sharply under real-world variability, while open-vocabulary sensor-language alignment yields robust and semantically grounded representations. Once this alignment is learned, closed-set activity recognition becomes a simple downstream task. Under cross-participant evaluation, our method achieves 65.3% Macro-F1, compared with 31-34% for strong closed-set HAR baselines. These results establish open-ended narrative modeling as a practical and effective foundation for real-world wearable HAR.

关键词: Wearable HAR, Open-ended activity narratives, Sensor-language alignment, Open-vocabulary setting, Retrieval-based evaluation, Language-conditioned learning, Naturalistic data collection

255. ❌ Inverse-Free Sparse Variational Gaussian Processes

作者: Stefano Cortinovis, Laurence Aitchison, Stefanos Eleftheriadis, Mark van der Wilk 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00697v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高斯过程（GPs）的稀疏变分近似方法，专注于优化算法和计算效率，属于传统机器学习中的概率模型领域。论文内容完全不涉及大语言模型（LLMs）、深度学习、大模型技术原理或AI在科学领域的应用，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种无逆矩阵的稀疏变分高斯过程优化方法，通过改进变分边界和推导仅需矩阵乘法的自然梯度更新，提高了训练稳定性和收敛速度，在回归和分类基准测试中实现了与传统方法相当的性能。

摘要翻译

高斯过程（Gaussian processes, GPs）具有吸引人的特性，但大规模训练成本高昂。稀疏变分高斯过程（Sparse variational GP, SVGP）近似降低了计算开销，但仍依赖于核矩阵的Cholesky分解，难以适配低精度、大规模并行硬件。虽然可以通过引入辅助矩阵参数，构建仅依赖矩阵乘法（matmuls）的有效变分下界，但使用现成的一阶优化方法对其进行优化仍具挑战性。我们通过提出一个条件更好的下界，并为辅助参数推导出仅需矩阵乘法的自然梯度更新，显著提升了稳定性和收敛性，从而使无逆矩阵方法变得实用。我们进一步提供了简单的启发式策略，如步长调度和停止准则，使整体优化流程能无缝融入现有工作流。在回归和分类基准测试中，我们证明所提方法：1）可作为基于SVGP的模型（如深度高斯过程）的直接替代方案；2）能恢复与传统方法相近的性能；3）在充分调优时可比基线方法更快。

摘要 (Abstract)

Gaussian processes (GPs) offer appealing properties but are costly to train at scale. Sparse variational GP (SVGP) approximations reduce cost yet still rely on Cholesky decompositions of kernel matrices, ill-suited to low-precision, massively parallel hardware. While one can construct valid variational bounds that rely only on matrix multiplications (matmuls) via an auxiliary matrix parameter, optimising them with off-the-shelf first-order methods is challenging. We make the inverse-free approach practical by proposing a better-conditioned bound and deriving a matmul-only natural-gradient update for the auxiliary parameter, markedly improving stability and convergence. We further provide simple heuristics, such as step-size schedules and stopping criteria, that make the overall optimisation routine fit seamlessly into existing workflows. Across regression and classification benchmarks, we demonstrate that our method 1) serves as a drop-in replacement in SVGP-based models (e.g., deep GPs), 2) recovers similar performance to traditional methods, and 3) can be faster than baselines when well tuned.

关键词: Gaussian processes, sparse variational GP, inverse-free optimization, matrix multiplication, natural-gradient update, regression, classification, computational efficiency

256. ❌ Full-Gradient Successor Feature Representations

作者: Ritish Shrirao, Aditya Priyadarshi, Raghuram Bharadwaj Diddigi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00686v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习（RL）中的后继特征（Successor Features）和广义策略改进（GPI）框架，提出了一种新的全梯度优化算法（FG-SFRQL）以提高样本效率和迁移性能。论文内容完全围绕强化学习的算法改进，未涉及任何大语言模型（LLM）、深度学习技术原理、AI for Science应用或评分关键词列表中的任何技术。所有关键词均与大语言模型、深度学习技术或特定科学AI应用相关，而本文是纯粹的强化学习算法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种全梯度后继特征表示Q学习算法（FG-SFRQL），通过最小化完整的均方贝尔曼误差来优化后继特征，在强化学习的多任务迁移学习中实现了更好的样本效率和性能。

摘要翻译

后继特征（Successor Features，SF）与广义策略改进（Generalized Policy Improvement，GPI）相结合，通过将环境动态与奖励函数解耦，为强化学习（Reinforcement Learning，RL）中的迁移学习提供了一个稳健的框架。然而，标准的SF学习方法通常依赖于半梯度时序差分（Temporal Difference，TD）更新。当与非线性函数逼近结合使用时，半梯度方法缺乏可靠的收敛保证，并可能导致不稳定性，这在多任务场景中尤为突出，因为准确的特征估计对于有效的GPI至关重要。受全梯度DQN（Full Gradient DQN）的启发，我们提出了全梯度后继特征表示Q学习（Full-Gradient Successor Feature Representations Q-Learning，FG-SFRQL），该算法通过最小化完整的均方贝尔曼误差来优化后继特征。与标准方法不同，我们的方法同时计算在线网络和目标网络参数对应的梯度。我们为FG-SFRQL提供了几乎必然收敛的理论证明，并通过实验证明，在离散和连续领域中，相较于半梯度基线方法，最小化完整残差能带来更优的样本效率和迁移性能。

摘要 (Abstract)

Successor Features (SF) combined with Generalized Policy Improvement (GPI) provide a robust framework for transfer learning in Reinforcement Learning (RL) by decoupling environment dynamics from reward functions. However, standard SF learning methods typically rely on semi-gradient Temporal Difference (TD) updates. When combined with non-linear function approximation, semi-gradient methods lack robust convergence guarantees and can lead to instability, particularly in the multi-task setting where accurate feature estimation is critical for effective GPI. Inspired by Full Gradient DQN, we propose Full-Gradient Successor Feature Representations Q-Learning (FG-SFRQL), an algorithm that optimizes the successor features by minimizing the full Mean Squared Bellman Error. Unlike standard approaches, our method computes gradients with respect to parameters in both the online and target networks. We provide a theoretical proof of almost-sure convergence for FG-SFRQL and demonstrate empirically that minimizing the full residual leads to superior sample efficiency and transfer performance compared to semi-gradient baselines in both discrete and continuous domains.

关键词: Successor Features, Generalized Policy Improvement, Transfer Learning, Reinforcement Learning, Full Gradient, Temporal Difference, Multi-task Learning, Sample Efficiency

257. ❌ Performance of Neural and Polynomial Operator Surrogates

作者: Josephine Westermann, Benno Huber, Thomas O’Leary-Roseberry, Jakob Zech 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00689v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究参数化偏微分方程中参数到解映射的代理算子构建，比较了神经算子（包括傅里叶神经算子）与多项式代理方法（稀疏网格和Tensor-Train代理）的性能。论文内容属于科学计算和数值分析领域，与深度学习在科学领域的应用有一定关联（AI for Science），但并未涉及大语言模型（LLMs）、MoE、微调、对齐、推理优化、智能体等具体的大模型技术。所有其他关键词均与论文内容无关，因此除’AI for Science’得5分外，其余关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文系统比较了神经算子和多项式代理方法在参数化偏微分方程求解中的性能，发现没有单一方法普遍最优，多项式方法对平滑输入更高效，而傅里叶神经算子对粗糙输入收敛更快，且导数信息训练能提高数据效率。

摘要翻译

我们研究针对参数偏微分方程中参数-解映射构建代理算子的问题，该场景中重复的正向模型评估计算成本高昂。本文系统性地实证比较了神经算子代理方法（包括基于$L^2_μ$和$H^1_μ$目标训练的降基神经算子与傅里叶神经算子）与多项式代理方法（特别是降基稀疏网格代理和降基张量列车代理）。所有方法均在线性参数扩散问题和非线性参数超弹性问题上进行评估，测试输入场采用具有代数衰减谱系数的随机场，其衰减速率$s$可变。为确保公平比较，我们通过调整超参数生成代理模型集合，对比其成本与近似精度的帕累托前沿，并将成本分解为数据生成、模型构建和模型评估三个部分。研究结果表明，不存在普遍最优的方法。对于光滑输入场（$s \geq 2$），多项式代理方法具有显著更好的数据效率，其中稀疏网格代理的收敛速率与理论预测一致；对于粗糙输入场（$s \leq 1$），傅里叶神经算子展现出最快的收敛速率。基于导数信息的训练相较于标准$L^2_μ$训练能持续提升数据效率，当雅可比信息获取成本可控时，该方法为低数据量场景下的粗糙输入问题提供了具有竞争力的替代方案。这些发现凸显了根据问题正则性、精度需求及应用计算约束选择适配代理方法的重要性。

摘要 (Abstract)

We consider the problem of constructing surrogate operators for parameter-to-solution maps arising from parametric partial differential equations, where repeated forward model evaluations are computationally expensive. We present a systematic empirical comparison of neural operator surrogates, including a reduced-basis neural operator trained with $L^2_μ$ and $H^1_μ$ objectives and the Fourier neural operator, against polynomial surrogate methods, specifically a reduced-basis sparse-grid surrogate and a reduced-basis tensor-train surrogate. All methods are evaluated on a linear parametric diffusion problem and a nonlinear parametric hyperelasticity problem, using input fields with algebraically decaying spectral coefficients at varying rates of decay $s$. To enable fair comparisons, we analyze ensembles of surrogate models generated by varying hyperparameters and compare the resulting Pareto frontiers of cost versus approximation accuracy, decomposing cost into contributions from data generation, setup, and evaluation. Our results show that no single method is universally superior. Polynomial surrogates achieve substantially better data efficiency for smooth input fields ($s \geq 2$), with convergence rates for the sparse-grid surrogate in agreement with theoretical predictions. For rough inputs ($s \leq 1$), the Fourier neural operator displays the fastest convergence rates. Derivative-informed training consistently improves data efficiency over standard $L^2_μ$ training, providing a competitive alternative for rough inputs in the low-data regime when Jacobian information is available at reasonable cost. These findings highlight the importance of matching the surrogate methodology to the regularity of the problem as well as accuracy demands and computational constraints of the application.

关键词: neural operators, polynomial surrogates, parametric PDEs, Fourier neural operator, sparse-grid surrogate, tensor-train surrogate, data efficiency, approximation accuracy

258. ❌ Embedded Variational Neural Stochastic Differential Equations for Learning Heterogeneous Dynamics

作者: Sandeep Kumar Samota, Reema Gupta, Snehashish Chakraverty 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00669v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是使用变分神经随机微分方程（V-NSDE）建模社会经济时间序列数据，属于深度学习在科学（社会科学）领域的应用。然而，所有评分关键词都明确针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化等），而论文完全不涉及任何语言模型、大模型技术或AI for Science中的生物/化学信息学。论文的核心是神经SDE和VAE，与评分关键词列表中的LLM技术无任何关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种变分神经随机微分方程模型，用于有效学习印度奥里萨邦各地区社会经济数据的复杂时空动态模式，实现了对趋势和随机波动的准确建模。

摘要翻译

本研究探讨了随时间变化的社会经济因素相关复杂噪声数据建模的挑战，重点关注印度奥里萨邦各地区的此类数据。传统时间序列模型难以同时捕捉此类数据中的趋势与波动。为解决这一问题，本研究设计了一种变分神经随机微分方程（Variational Neural Stochastic Differential Equation, V-NSDE）模型，该模型融合了神经随机微分方程（Neural SDEs）的表达动力学特性与变分自编码器（Variational Autoencoders, VAEs）的生成能力。该模型包含编码器与解码器：编码器接收初始观测数据和地区嵌入向量，将其转换为高斯分布，以确定初始潜在状态的均值与对数方差；随后，所得潜在状态启动神经随机微分方程，该方程利用神经网络确定支配连续时间潜在动态的漂移函数与扩散函数。这些控制函数依赖于时间索引、潜在状态及地区嵌入向量，使模型能够学习各地区的独有特征。此后，通过概率解码器从潜在轨迹重构观测数据，解码器为每个时间步输出遵循高斯似然分布的均值与对数方差。通过在负对数似然（negative log-likelihood, nll）基础上增加KL散度正则项，证据下界（Evidence Lower Bound, ELBO）训练损失得到优化。实验结果表明，V-NSDE模型能有效学习随时间变化的复杂模式，生成包含不同地区清晰趋势与随机波动的现实性结果。

摘要 (Abstract)

This study examines the challenges of modeling complex and noisy data related to socioeconomic factors over time, with a focus on data from various districts in Odisha, India. Traditional time-series models struggle to capture both trends and variations together in this type of data. To tackle this, a Variational Neural Stochastic Differential Equation (V-NSDE) model is designed that combines the expressive dynamics of Neural SDEs with the generative capabilities of Variational Autoencoders (VAEs). This model uses an encoder and a decoder. The encoder takes the initial observations and district embeddings and translates them into a Gaussian distribution, which determines the mean and log-variance of the first latent state. Then the obtained latent state initiates the Neural SDE, which utilize neural networks to determine the drift and diffusion functions that govern continuous-time latent dynamics. These governing functions depend on the time index, latent state, and district embedding, which help the model learn the unique characteristics specific to each district. After that, using a probabilistic decoder, the observations are reconstructed from the latent trajectory. The decoder outputs a mean and log-variance for each time step, which follows the Gaussian likelihood. The Evidence Lower Bound (ELBO) training loss improves by adding a KL-divergence regularization term to the negative log-likelihood (nll). The obtained results demonstrate the effective learning of V-NSDE in recognizing complex patterns over time, yielding realistic outcomes that include clear trends and random fluctuations across different areas.

关键词: Variational Neural Stochastic Differential Equations, V-NSDE, time-series modeling, socioeconomic data, district embeddings, Neural SDEs, Variational Autoencoders, heterogeneous dynamics

259. ❌ On rankings in multiplayer games with an application to the game of Whist

作者: Alexis Coyette, Charles Modera, Candy Sonveaux, Judicaël Mohet, Francçois-Grégoire Bierwart, Sylverio Pool Marquez, Jarod Ketcha Kouakep, Cédric Simal, Komlan Fiagbe, Violaine Piengeon, Martin Moriamé, Justine Bodart, Marie Dorchain, Maxime Lucas, Rommel Tchinda Djeudjo, Gianluca Peri, Eve Tilman 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00641v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是多人游戏排名问题，提出了一种Bradley-Terry模型的扩展，并应用于纸牌游戏Whist。所有关键词均涉及大模型、深度学习、AI技术原理或科学AI应用，而该论文属于传统统计模型在游戏理论中的应用，与深度学习、大模型技术完全无关，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种扩展的Bradley-Terry模型来处理多人游戏排名问题，并在合成数据集和真实纸牌游戏数据集上验证了该方法的有效性。

摘要翻译

我们提出了一种针对多人游戏的布拉德利-特里模型新扩展，并将纽曼[1]近期提出的一种算法适配于该模型。我们通过合成数据集与一个真实的纸牌游戏数据集，展示了所提方法的应用效果。

摘要 (Abstract)

We propose a novel extension of the Bradley-Terry model to multiplayer games and adapt a recent algorithm by Newman [1] to our model. We demonstrate the use of our proposed method on synthetic datasets and on a real dataset of games of cards.

关键词: Bradley-Terry model, multiplayer games, ranking, Whist, card games, statistical model, synthetic datasets, real dataset

260. ❌ Chameleons do not Forget: Prompt-Based Online Continual Learning for Next Activity Prediction

作者: Marwan Hassani, Tamara Verbeek, Sjoerd van Straten 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00653v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究预测过程监控（PPM）中的持续学习问题，提出CNAPwP方法用于下一个活动预测，主要涉及传统机器学习/深度学习在业务流程分析中的应用，未涉及大模型（LLMs）、大模型技术原理（如MoE、量化、推理加速等）、大模型对齐训练（如RLHF、指令调优）、大模型应用技术（如RAG、智能体）或科学AI（如生物信息学）。所有关键词均与大模型或相关技术直接相关，而本文专注于特定领域的过程预测和持续学习，未使用或讨论大模型技术，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对预测过程监控中因概念漂移导致的灾难性遗忘问题，提出了一种基于提示的在线持续学习方法CNAPwP，在合成和真实数据集上实现了最先进或竞争性的下一个活动预测准确性。

摘要翻译

预测性过程监控（Predictive Process Monitoring, PPM）致力于预测未来过程轨迹，包括下一活动预测。这在过程变化或面临不确定性的动态环境中至关重要。然而，现有框架通常假设静态环境，忽视了动态特性与概念漂移，导致灾难性遗忘问题——即仅关注新数据分布的训练会对先前学习的数据分布性能产生负面影响。持续学习旨在应对缓解灾难性遗忘等相关挑战。本文提出一种名为“基于提示的持续下一活动预测”（Continual Next Activity Prediction with Prompts, CNAPwP）的新方法，该方法将DualPrompt算法适配于下一活动预测任务，以提高预测准确性与适应性，同时减轻灾难性遗忘。我们引入了包含周期性概念漂移的新数据集，并提出一种任务特定的遗忘度量指标，用于衡量初始出现与后续任务出现之间的预测准确率差距。在代表多种周期性漂移场景的三个合成数据集与两个真实数据集上的广泛测试表明，与五种基线方法相比，CNAPwP取得了领先或具有竞争力的结果，证明了其在现实场景中的潜在适用性。本方法的开源实现、数据集及实验结果可通过以下链接获取：https://github.com/SvStraten/CNAPwP。

摘要 (Abstract)

Predictive process monitoring (PPM) focuses on predicting future process trajectories, including next activity predictions. This is crucial in dynamic environments where processes change or face uncertainty. However, current frameworks often assume a static environment, overlooking dynamic characteristics and concept drifts. This results in catastrophic forgetting, where training while focusing merely on new data distribution negatively impacts the performance on previously learned data distributions. Continual learning addresses, among others, the challenges related to mitigating catastrophic forgetting. This paper proposes a novel approach called Continual Next Activity Prediction with Prompts (CNAPwP), which adapts the DualPrompt algorithm for next activity prediction to improve accuracy and adaptability while mitigating catastrophic forgetting. We introduce new datasets with recurring concept drifts, alongside a task-specific forgetting metric that measures the prediction accuracy gap between initial occurrence and subsequent task occurrences. Extensive testing on three synthetic and two real-world datasets representing several setups of recurrent drifts shows that CNAPwP achieves SOTA or competitive results compared to five baselines, demonstrating its potential applicability in real-world scenarios. An open-source implementation of our method, together with the datasets and results, is available at: https://github.com/SvStraten/CNAPwP.

关键词: Predictive Process Monitoring, Continual Learning, Next Activity Prediction, Catastrophic Forgetting, Concept Drift, Prompt-based Learning, DualPrompt Adaptation, Process Mining

261. ❌ Neural Ordinary Differential Equations for Modeling Socio-Economic Dynamics

作者: Sandeep Kumar Samota, Snehashish Chakraverty, Narayan Sethi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00632v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究使用Neural ODEs（神经常微分方程）建模印度奥里萨邦的贫困动态，属于机器学习在社会科学领域的应用。论文未涉及任何大模型（LLM）相关技术，也未讨论深度学习技术原理的创新。所有关键词中，仅“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联（5分），因为论文属于AI在社会科学（可视为广义科学领域）的应用，但未涉及生物信息学或化学信息学。其他关键词均与大模型技术、训练方法、推理优化、代理系统等无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文应用神经常微分方程（Neural ODEs）建模印度奥里萨邦2007-2020年的贫困动态，结果表明该方法能准确捕捉社会经济指标变化，为政策制定提供可靠预测工具。

摘要翻译

贫困是一个复杂的动态挑战，无法通过预定义的微分方程充分刻画。当前，人工智能机器学习方法已在现实世界动态系统建模中展现出巨大潜力。其中，神经常微分方程作为一种强大的数据驱动方法，能够直接从观测数据中学习连续时间动态。本章应用神经常微分方程框架分析印度奥里萨邦的贫困动态。具体而言，我们利用2007年至2020年经济发展与扶贫关键指标的时间序列数据。在神经常微分方程架构中，系统的时间梯度由多层感知机表示。通过数值常微分方程求解器对获得的神经动态系统进行积分，从而得到其随时间演化的轨迹。在反向传播过程中，采用伴随灵敏度方法进行训练期间的梯度计算，以实现通过常微分方程求解器的有效反向传播。训练完成的神经常微分方程模型能够高精度复现观测数据，这证明了该框架捕捉具体结构家庭贫困指标动态变化的能力。研究结果表明，神经常微分方程等机器学习方法可作为社会经济转型建模的有效工具，能够为政策制定者提供可靠预测，从而支持更科学、更有效的扶贫决策。

摘要 (Abstract)

Poverty is a complex dynamic challenge that cannot be adequately captured using predefined differential equations. Nowadays, artificial machine learning (ML) methods have demonstrated significant potential in modelling real-world dynamical systems. Among these, Neural Ordinary Differential Equations (Neural ODEs) have emerged as a powerful, data-driven approach for learning continuous-time dynamics directly from observations. This chapter applies the Neural ODE framework to analyze poverty dynamics in the Indian state of Odisha. Specifically, we utilize time-series data from 2007 to 2020 on the key indicators of economic development and poverty reduction. Within the Neural ODE architecture, the temporal gradient of the system is represented by a multi-layer perceptron (MLP). The obtained neural dynamical system is integrated using a numerical ODE solver to obtain the trajectory of over time. In backpropagation, the adjoint sensitivity method is utilized for gradient computation during training to facilitate effective backpropagation through the ODE solver. The trained Neural ODE model reproduces the observed data with high accuracy. This demonstrates the capability of Neural ODE to capture the dynamics of the poverty indicator of concrete-structured households. The obtained results show that ML methods, such as Neural ODEs, can serve as effective tools for modeling socioeconomic transitions. It can provide policymakers with reliable projections, supporting more informed and effective decision-making for poverty alleviation.

关键词: Neural Ordinary Differential Equations, poverty dynamics, socio-economic modeling, time-series data, machine learning, economic development, adjoint sensitivity method, policy decision-making

262. ❌ Predicting Dynamics of Ultra-Large Complex Systems by Inferring Governing Equations

作者: Qi Shao, Duxin Chen, Jiawen Chen, Yujie Zeng, Athen Ma, Wenwu Yu, Vito Latora, Wei Lin 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00599v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出SIGN框架，用于从数据中推断大型网络系统的控制方程，属于复杂系统预测和科学AI应用领域。与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文应用于气候、生物网络等科学领域，但未明确使用生物信息学或化学信息学方法，且核心是图神经网络和方程发现，而非典型的大模型科学应用。

!!! tip deepseek-chat TL;DR

该论文提出了Sparse Identification Graph Neural Network（SIGN）框架，通过将符号发现定义为边级信息，从数据中推断超大型复杂网络系统的控制方程，实现了可扩展且可解释的长期预测，并成功应用于包含71,987个海面位置温度数据的气候系统。

摘要翻译

预测从气候到生物及技术网络等超大规模复杂系统的行为，是一个尚未解决的核心挑战。现有方法面临一个根本性的权衡：方程发现方法具有可解释性但难以扩展，而神经网络虽可扩展却如同黑箱运行，且长期预测可靠性往往下降。本文提出稀疏辨识图神经网络（Sparse Identification Graph Neural Network，简称SIGN）框架，通过从数据中推断大型网络系统的控制方程，从而克服了这一矛盾。通过将符号发现定义为边层级信息，SIGN将稀疏辨识的可扩展性与网络规模解耦，使得即使在大型系统中也能高效进行方程发现。SIGN能够研究超过10万个节点的网络，同时对噪声、稀疏采样和数据缺失保持鲁棒性。在包括耦合混沌振荡器、神经动力学和流行病传播在内的多种基准系统中，SIGN均能高精度还原控制方程，并维持准确的长期预测。将其应用于包含71,987个海表位置温度测量时间序列的数据集时，SIGN识别出一个紧凑的预测网络模型，并提前两年捕捉到大规模海表温度状况。通过实现在以往难以企及的规模上进行方程发现，SIGN为现实世界复杂系统的可解释且可靠的预测开辟了新路径。

摘要 (Abstract)

Predicting the behavior of ultra-large complex systems, from climate to biological and technological networks, is a central unsolved challenge. Existing approaches face a fundamental trade-off: equation discovery methods provide interpretability but fail to scale, while neural networks scale but operate as black boxes and often lose reliability over long times. Here, we introduce the Sparse Identification Graph Neural Network, a framework that overcome this divide by allowing to infer the governing equations of large networked systems from data. By defining symbolic discovery as edge-level information, SIGN decouples the scalability of sparse identification from network size, enabling efficient equation discovery even in large systems. SIGN allows to study networks with over 100,000 nodes while remaining robust to noise, sparse sampling, and missing data. Across diverse benchmark systems, including coupled chaotic oscillators, neural dynamics, and epidemic spreading, it recovers governing equations with high precision and sustains accurate long-term predictions. Applied to a data set of time series of temperature measurements in 71,987 sea surface positions, SIGN identifies a compact predictive network model and captures large-scale sea surface temperature conditions up to two years in advance. By enabling equation discovery at previously inaccessible scales, SIGN opens a path toward interpretable and reliable prediction of real-world complex systems.

关键词: complex systems, governing equations, graph neural network, sparse identification, networked systems, long-term prediction, interpretable prediction, sea surface temperature

263. ❌ Representation choice shapes the interpretation of protein conformational dynamics

作者: Axel Giottonini, Thomas Lemmin 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00580v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于蛋白质构象动力学的分子动力学模拟分析，提出了新的几何特征表示方法（Orientation features）和比较框架。所有关键词均与大模型、深度学习技术原理或应用无关，仅最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”与生物信息学领域有一定关联，但论文未涉及大模型或深度学习技术，仅涉及传统计算方法和特征工程，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究揭示了蛋白质分子动力学模拟中表示选择对构象组织和动态推断的根本性影响，并提出了新的几何特征表示方法和系统比较框架以提供更全面的动力学理解。

摘要翻译

分子动力学模拟能够在原子层面提供详细的轨迹数据，但从这些高维数据中提取可解释且稳健的洞察仍具挑战性。在实践中，分析通常依赖于单一的构象表征方式。本文指出，表征方式的选择并非中性：它从根本上塑造了从相同模拟数据中推断出的构象组织、相似性关系以及表观转变过程。作为对现有表征方式的补充，我们引入了“取向特征”（Orientation features），这是一种基于几何、具有旋转感知能力的蛋白质骨架编码方法。我们将其与常用描述方式在三种动态体系下进行比较：快速折叠蛋白质、大规模结构域运动以及蛋白质-蛋白质结合过程。在这些体系中，我们发现不同的表征方式强调了构象空间的不同互补方面，没有任何单一表征方式能够完整呈现底层的动力学全貌。为促进系统比较，我们开发了 ManiProt 库，用于高效计算和分析多种蛋白质表征。我们的研究结果推动了一种基于比较、且具有表征方式意识的框架，以用于分子动力学模拟的解读。

摘要 (Abstract)

Molecular dynamics simulations provide detailed trajectories at the atomic level, but extracting interpretable and robust insights from these high-dimensional data remains challenging. In practice, analyses typically rely on a single representation. Here, we show that representation choice is not neutral: it fundamentally shapes the conformational organization, similarity relationships, and apparent transitions inferred from identical simulation data. To complement existing representations, we introduce Orientation features, a geometrically grounded, rotation-aware encoding of protein backbone. We compare it against common descriptions across three dynamical regimes: fast-folding proteins, large-scale domain motions, and protein-protein association. Across these systems, we find that different representations emphasize complementary aspects of conformational space, and that no single representation provides a complete picture of the underlying dynamics. To facilitate systematic comparison, we developed ManiProt, a library for efficient computation and analysis of multiple protein representations. Our results motivate a comparative, representation-aware framework for the interpretation of molecular dynamics simulations.

关键词: protein conformational dynamics, molecular dynamics simulations, representation choice, Orientation features, ManiProt library, conformational organization, protein backbone encoding, systematic comparison

264. ❌ Scenario theory for multi-criteria data-driven decision making

作者: Simone Garatti, Lucrezia Manieri, Alessandro Falsone, Algo Carè, Marco C. Campi, Maria Prandini 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00553v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多准则数据驱动决策的场景理论，属于运筹学、控制理论和决策科学领域，专注于不确定性下的概率鲁棒性保证。论文内容完全不涉及大语言模型、深度学习、人工智能模型训练、推理优化、对齐技术、智能体系统或科学AI应用等关键词相关的技术。所有关键词都与论文主题无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多准则数据驱动决策问题，提出了一种新的场景理论框架，通过集体处理个体准则违反风险，提供了比传统方法更准确的鲁棒性保证，并实现了对所有准则同时满足的鲁棒性水平的更精确量化。

摘要翻译

场景方法为在不确定性下设计解决方案提供了一个强大的数据驱动框架，并具备严格的概率鲁棒性保证。然而，现有理论主要关注基于单一数据集评估解决方案在单一适用性准则下的鲁棒性，而许多实际应用——包括多智能体决策问题——需要同时考虑多个准则，并基于多个数据集（每个准则对应一个数据集）评估其鲁棒性。本文针对多准则数据驱动决策问题发展了一套通用的场景理论。其核心创新在于对违反各准则相关风险的整体处理，相比直接套用标准结果的朴素方法，该方法能得出显著更精确的鲁棒性证明。相应地，这一方法能够更精确地量化所有准则同时满足的鲁棒性水平。所提出的框架广泛适用于多准则数据驱动决策问题，为不确定性下的设计提供了一种原则性、可扩展且理论坚实的方法论。

摘要 (Abstract)

The scenario approach provides a powerful data-driven framework for designing solutions under uncertainty with rigorous probabilistic robustness guarantees. Existing theory, however, primarily addresses assessing robustness with respect to a single appropriateness criterion for the solution based on a dataset, whereas many practical applications - including multi-agent decision problems - require the simultaneous consideration of multiple criteria and the assessment of their robustness based on multiple datasets, one per criterion. This paper develops a general scenario theory for multi-criteria data-driven decision making. A central innovation lies in the collective treatment of the risks associated with violations of individual criteria, which yields substantially more accurate robustness certificates than those derived from a naive application of standard results. In turn, this approach enables a sharper quantification of the robustness level with which all criteria are simultaneously satisfied. The proposed framework applies broadly to multi-criteria data-driven decision problems, providing a principled, scalable, and theoretically grounded methodology for design under uncertainty.

关键词: scenario approach, multi-criteria decision making, data-driven decision making, robustness guarantees, uncertainty, probabilistic robustness, multi-agent decision problems, robustness certificates

265. ❌ Activation Saturation and Floquet Spectrum Collapse in Neural ODEs

作者: Nikolaos M. Matzakos 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究神经常微分方程（Neural ODEs）中激活函数饱和对动力学特性的理论限制，属于深度学习理论分析领域。论文内容聚焦于数学证明和理论分析，不涉及大语言模型（LLMs）、模型训练技术（如预训练、微调、对齐）、推理优化、智能体系统、模型压缩或科学AI应用等关键词。所有关键词均与大模型技术或相关应用直接相关，而本文研究的是基础神经网络架构的理论性质，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文证明了在具有饱和激活函数（如tanh、sigmoid）的神经常微分方程中，激活饱和会导致Floquet谱崩溃，限制系统的收缩和混沌敏感性，并通过数值实验验证了理论结果。

摘要翻译

我们证明，对于采用饱和激活函数（如$\tanh$、sigmoid等）的自洽神经微分方程$\dot{h}=f_θ(h)$，激活饱和会对其结构动力学施加限制：若多层感知机$f_θ$中有$q$个隐藏层在区域$U$上满足$|σ’|\leδ$，则输入雅可比矩阵将按$\norm{Df_θ(x)}\le C(U)$衰减（对于满足$\sup_{x}|σ’(x)|\le 1$的激活函数，例如$\tanh$与sigmoid，该式可简化为$C_Wδ^q$），从而迫使沿任意$T$周期轨道$γ\subset U$的所有弗洛凯（李雅普诺夫）指数落入区间$[-C(U),;C(U)]$内。这导致弗洛凯谱的坍缩：随着饱和程度加深（$δ\to 0$），所有指数被驱向零，从而同时限制了强收缩与混沌敏感性。该阻碍是结构性的——它在推理阶段约束了学习得到的向量场，且与训练质量无关。作为次要贡献，对于满足$σ’>0$的激活函数，通过饱和加权谱分解可得到更精细的界$\widetilde{C}(U)\le C(U)$，其改进效果在流层面上随$T$呈指数级放大。所有结果均在Stuart–Landau振子上进行了数值验证；这些界为$\tanh$-神经微分方程在Morris–Lecar神经元模型上经验性观测到的失效现象提供了理论解释。

摘要 (Abstract)

We prove that activation saturation imposes a structural dynamical limitation on autonomous Neural ODEs $\dot{h}=f_θ(h)$ with saturating activations ($\tanh$, sigmoid, etc.): if $q$ hidden layers of the MLP $f_θ$ satisfy $|σ’|\leδ$ on a region~$U$, the input Jacobian is attenuated as $\norm{Df_θ(x)}\le C(U)$ (for activations with $\sup_{x}|σ’(x)|\le 1$, e.g.\ $\tanh$ and sigmoid, this reduces to $C_Wδ^q$), forcing every Floquet (Lyapunov) exponen along any $T$-periodic orbit $γ\subset U$ into the interval $[-C(U),;C(U)]$. This is a collapse of the Floquet spectrum: as saturation deepens ($δ\to 0$), all exponents are driven to zero, limiting both strong contraction and chaotic sensitivity. The obstruction is structural – it constrains the learned vector field at inference time, independent of training quality. As a secondary contribution, for activations with $σ’>0$, a saturation-weighted spectral factorisation yields a refined bound $\widetilde{C}(U)\le C(U)$ whose improvement is amplified exponentially in~$T$ at the flow level. All results are numerically illustrated on the Stuart–Landau oscillator; the bounds provide a theoretical explanation for the empirically observed failure of $\tanh$-NODEs on the Morris–Lecar neuron model.

关键词: Neural ODEs, activation saturation, Floquet spectrum, Lyapunov exponents, saturating activations, dynamical limitations, theoretical analysis, Stuart-Landau oscillator

266. ❌ Learning from Many and Adapting to the Unknown in Open-set Test Streams

作者: Xiao Zhang, Juntao Lyu, Tianyu Hu, Qianchuan Zhao, Huimin Ma 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00533v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在开放集测试流中的测试时适应（TTA），提出了一种参数高效的适应方法SyCo，使用低秩适配器进行更新。因此与’Large Language Models’高度相关（10分），与’PEFT/LoRA/Parameter-efficient Fine-tuning’高度相关（10分），与’Pre-training/Domain Adaptation’有一定关联（8分，涉及领域适应概念）。其他关键词如MoE、SLMs、Scaling Laws、SFT、RLHF、RAG等均未在摘要中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在开放集测试流中面对持续分布漂移和未知任务时的脆弱性问题，提出了一种受生物启发的参数高效适应方法SyCo，在18个NLP数据集上显著提升了模型在未见任务和未见数据漂移上的适应性能。

摘要翻译

大语言模型（LLMs）通过可复用的表征和灵活的推理能力实现跨任务泛化，但在实际部署中面对动态演化的任务和持续的分布漂移时仍显脆弱。一种常见方法是测试时适应（Test-Time Adaptation, TTA），现有方法通常基于人工设计的无监督目标在全参数空间更新模型，大多忽视了如何保护共享的源领域知识以及适应信号的可靠性。受果蝇记忆更新的分子信号级联机制启发，我们提出突触巩固（Synapse Consolidation, SyCo），一种参数高效的大语言模型适应方法。该方法在由问题理解、过程理解和源领域防护栏驱动的结构化TTA目标指导下，通过Rac1和MAPK通路更新低秩适配器。Rac1将可塑性限制在对源知识较不关键的尾部梯度子空间，从而在保护源表征的同时实现快速特化。MAPK采用分层控制器抑制噪声更新，并在非平稳数据流中巩固有效的适应。为模拟多源任务持续出现的真实部署场景，我们提出多源开放集适应（Multi-source Open-set Adaptation, MOA）设定：模型首先在多个带标注的源任务上训练，随后在开放、非平稳的无标注测试流上进行适应，这些测试流混合了已见和未见任务，且在标签和意图空间存在部分重叠。在18个NLP数据集和MOA设定下的实验表明，SyCo持续优于强基线方法，在未见任务适应上达到78.31%的准确率，在未见数据漂移上达到85.37%的准确率。

摘要 (Abstract)

Large Language Models (LLMs) generalize across tasks via reusable representations and flexible reasoning, yet remain brittle in real deployment under evolving tasks and continual distribution shift. A common approach is Test-Time Adaptation (TTA), existing ones of which updates models with hand-designed unsupervised objectives over the full parameter space and mostly overlook preserving shared source knowledge and the reliability of adaptation signals. Drawing on molecular signaling cascades of memory updating in Drosophila, we propose Synapse Consolidation (SyCo), a parameter-efficient LLM adaptation method that updates low-rank adapters through Rac1 and MAPK pathways under the guidance of a structured TTA objective driven by problem understanding, process understanding, and source-domain guardrail. Rac1 confines plasticity to a tail-gradient subspace that is less critical for source knowledge, enabling rapid specialization while preserving source representations. MAPK uses a tiered controller to suppress noisy updates and consolidate useful adaptations under non-stationary streams. To model real deployments with multiple sources and continually emerging tasks, we introduce Multi-source Open-set Adaptation (MOA) setting, where a model is trained on multiple labeled source tasks and then adapts on open, non-stationary unlabeled test streams that mix seen and unseen tasks with partial overlap in label and intent space. Across 18 NLP datasets and the MOA setting, SyCo consistently outperforms strong baselines, achieving 78.31% on unseen-task adaptation and 85.37% on unseen-data shifts.

关键词: Large Language Models, Test-Time Adaptation, Parameter-efficient Fine-tuning, Open-set Adaptation, Low-rank Adapters, Multi-source Adaptation, Distribution Shift, Synapse Consolidation

267. ❌ Learning Shared Representations for Multi-Task Linear Bandits

作者: Jiabin Lin, Shana Moothedath 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00531v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多任务线性赌博机中的表示学习，提出了一种基于共享低维表示的OFUL算法，并提供了理论保证和数值验证。所有评分关键词均与大模型、深度学习技术原理或科学应用相关，而本文专注于经典的多任务强化学习/在线学习领域，使用线性模型和谱方法，未涉及任何大模型、深度学习、AI for Science或相关技术。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种用于多任务线性赌博机的新型表示学习算法，通过共享低维表示实现了比独立解决任务更优的累积遗憾界。

摘要翻译

多任务表征学习是一种在相关任务间学习共享潜在表征的方法，它促进了知识迁移并提升了样本效率。本文针对线性赌博机问题提出了一种新颖的多任务表征学习方法。我们考虑一个包含T个并发的线性赌博机任务的环境，每个任务的特征维度为d，它们共享一个维度为r \ll min{d,T}的公共潜在表征，以捕捉其内在关联性。我们提出了一种新的“面对不确定性的线性乐观”（OFUL）算法，该算法利用共享的低秩表征，以样本高效的方式增强决策能力。我们的算法首先通过探索阶段收集数据，通过谱初始化估计共享模型，随后基于新构建的置信集进行OFUL学习。我们为置信集提供了理论保证，并证明了未知奖励向量以高概率位于置信集内。我们推导了累积遗憾界，并表明所提方法实现了\tilde{O}(\sqrt{drNT})的遗憾，相较于独立求解T个任务所得的\tilde{O}(dT\sqrt{N})遗憾有显著改进。我们进行了数值模拟，以验证算法在不同问题规模下的性能。

摘要 (Abstract)

Multi-task representation learning is an approach that learns shared latent representations across related tasks, facilitating knowledge transfer and improving sample efficiency. This paper introduces a novel approach to multi-task representation learning in linear bandits. We consider a setting with T concurrent linear bandit tasks, each with feature dimension d, that share a common latent representation of dimension r \ll min{d,T}$, capturing their underlying relatedness. We propose a new Optimism in the Face of Uncertainty Linear (OFUL) algorithm that leverages shared low-rank representations to enhance decision-making in a sample-efficient manner. Our algorithm first collects data through an exploration phase, estimates the shared model via spectral initialization, and then conducts OFUL based learning over a newly constructed confidence set. We provide theoretical guarantees for the confidence set and prove that the unknown reward vectors lie within the confidence set with high probability. We derive cumulative regret bounds and show that the proposed approach achieves \tilde{O}(\sqrt{drNT}), a significant improvement over solving the T tasks independently, resulting in a regret of \tilde{O}(dT\sqrt{N}). We performed numerical simulations to validate the performance of our algorithm for different problem sizes.

关键词: multi-task representation learning, linear bandits, shared latent representations, OFUL algorithm, regret bounds, spectral initialization, sample efficiency, low-rank representations

268. ❌ Lipschitz Dueling Bandits over Continuous Action Spaces

作者: Mudit Sharma, Shweta Jain, Vaneet Aggarwal, Ganesh Ghalme 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00523v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是连续动作空间上的Lipschitz对决赌博机问题，属于强化学习/在线学习领域，专注于算法设计和理论分析。所有给定的关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文完全不涉及这些内容。论文没有提到任何语言模型、深度学习架构、训练方法、推理优化、AI代理或科学AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文首次研究了具有Lipschitz结构的连续动作空间上的随机对决赌博机问题，提出了一种基于轮次探索和自适应参考臂引导的递归区域消除算法，并证明了其具有次线性遗憾界和仅对数空间复杂度。

摘要翻译

我们首次研究了具有利普希茨结构的连续动作空间上的随机对决赌博机问题，其反馈机制为纯比较式反馈。尽管对决赌博机和利普希茨赌博机已分别得到研究，但二者的结合此前尚未被探索。我们提出了首个针对利普希茨对决赌博机的算法，该算法采用基于轮次的探索策略，并通过自适应参考臂引导的递归区域消除机制进行优化。我们针对相对反馈开发了新的分析工具，并证明了遗憾上界为 $\tilde O\left(T^{\frac{d_z+1}{d_z+2}}\right)$，其中 $d_z$ 为近优区域的缩放维度。此外，我们的算法在总时间跨度上仅需对数级存储空间，这是连续动作空间上任何赌博机算法可达到的最佳空间复杂度。

摘要 (Abstract)

We study for the first time, stochastic dueling bandits over continuous action spaces with Lipschitz structure, where feedback is purely comparative. While dueling bandits and Lipschitz bandits have been studied separately, their combination has remained unexplored. We propose the first algorithm for Lipschitz dueling bandits, using round-based exploration and recursive region elimination guided by an adaptive reference arm. We develop new analytical tools for relative feedback and prove a regret bound of $\tilde O\left(T^{\frac{d_z+1}{d_z+2}}\right)$, where $d_z$ is the zooming dimension of the near-optimal region. Further, our algorithm takes only logarithmic space in terms of the total time horizon, best achievable by any bandit algorithm over a continuous action space.

关键词: dueling bandits, continuous action spaces, Lipschitz structure, relative feedback, regret bound, zooming dimension, round-based exploration, recursive region elimination

269. ❌ A Decoupled Basis-Vector-Driven Generative Framework for Dynamic Multi-Objective Optimization

作者: Yaoming Yang, Shuai Wang, Bingdong Li, Peng Yang, Ke Tang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00508v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于动态多目标优化的传统机器学习/进化算法领域，提出了一种基于离散小波变换和稀疏字典学习的生成框架。论文内容完全不涉及大语言模型、深度学习、AI for Science或任何评分关键词中的技术概念。所有关键词均与大模型、深度学习技术原理或AI科学应用相关，而本文研究的是优化算法问题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对动态多目标优化中存在的非线性耦合、负迁移和冷启动问题，提出了一种解耦基向量驱动的生成框架（DB-GEN），通过分离进化轨迹、学习可迁移基向量和构建结构化潜在流形，实现了无需重新训练或微调的零样本在线推理，在多种动态基准测试中提高了跟踪精度。

摘要翻译

动态多目标优化需要持续追踪移动的帕累托前沿。现有方法难以处理不规则突变与数据稀疏性问题，主要面临三大挑战：动态模式间的非线性耦合、过时历史数据导致的负迁移，以及环境切换时的冷启动问题。为解决这些难题，本文提出一种解耦基向量驱动的生成框架（DB-GEN）。首先，针对非线性耦合问题，该框架采用离散小波变换将进化轨迹分解为低频趋势与高频细节。其次，为缓解负迁移效应，框架通过稀疏字典学习迁移性基向量而非直接记忆历史实例，并在拓扑感知对比约束下重组这些基向量，从而构建结构化隐流形。最后，为克服冷启动问题，采用代理辅助搜索范式从该流形中采样初始种群。基于1.2亿个解进行预训练的DB-GEN可直接进行在线推理，无需重新训练或微调。这种零样本生成过程在毫秒级内完成，每次环境变化仅需约0.2秒。实验结果表明，相较于现有算法，DB-GEN在多种动态基准测试中显著提升了追踪精度。

摘要 (Abstract)

Dynamic multi-objective optimization requires continuous tracking of moving Pareto fronts. Existing methods struggle with irregular mutations and data sparsity, primarily facing three challenges: the non-linear coupling of dynamic modes, negative transfer from outdated historical data, and the cold-start problem during environmental switches. To address these issues, this paper proposes a decoupled basis-vector-driven generative framework (DB-GEN). First, to resolve non-linear coupling, the framework employs the discrete wavelet transform to separate evolutionary trajectories into low-frequency trends and high-frequency details. Second, to mitigate negative transfer, it learns transferable basis vectors via sparse dictionary learning rather than directly memorizing historical instances. Recomposing these bases under a topology-aware contrastive constraint constructs a structured latent manifold. Finally, to overcome the cold-start problem, a surrogate-assisted search paradigm samples initial populations from this manifold. Pre-trained on 120 million solutions, DB-GEN performs direct online inference without retraining or fine-tuning. This zero-shot generation process executes in milliseconds, requiring approximately 0.2 seconds per environmental change. Experimental results demonstrate that DB-GEN improves tracking accuracy across various dynamic benchmarks compared to existing algorithms.

关键词: Dynamic Multi-Objective Optimization, Decoupled Basis-Vector-Driven Generative Framework, Discrete Wavelet Transform, Sparse Dictionary Learning, Zero-shot Generation, Online Inference, Pareto Front Tracking, Surrogate-assisted Search

270. ❌ Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

作者: Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00499v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LLM推理调度优化，核心贡献是提出基于不确定性感知的输出长度预测的调度方法（TIE调度器）。因此，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为论文直接研究LLM推理。与"Speculative Decoding OR Inference Acceleration"高度相关（10分），因为论文旨在通过改进调度策略来加速LLM推理（减少延迟、提高吞吐量）。其他关键词（如MoE、SFT、RAG、量化等）涉及模型架构、训练方法、应用场景或特定技术，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM推理中输出长度不确定导致的调度低效问题，提出了一种基于尾部膨胀期望（TIE）的调度方法，实验表明TIE调度器能显著降低在线推理的每令牌延迟并提高离线数据生成的吞吐量。

摘要翻译

在大型语言模型（LLM）推理调度中，采用最短作业优先（SJF）原则有利于通过优先处理输出长度较短的请求来避免队头（HOL）阻塞。现有方法通常为每个请求预测单一的输出长度以辅助调度。我们认为，这种点估计方式与LLM推理的随机解码过程并不匹配，因为输出长度本质上是不确定的，其实际值取决于采样到序列结束（EOS）标记的时机。因此，每个请求的输出长度应拟合为一个分布而非单一数值。通过对实际数据与随机解码过程的深入分析，我们发现输出长度服从重尾分布，并可用对数t分布进行拟合。基于此，我们提出一种称为尾部膨胀期望（Tail Inflated Expectation, TIE）的简单度量，以替代SJF调度中的输出长度。该度量通过对数t分布的尾部概率调整其期望值，从而兼顾请求可能生成长输出的风险。为评估所提出的TIE调度器，我们将其与三种强基线方法进行比较，结果表明：TIE在在线推理场景中将每令牌延迟降低至$2.31\times$，在离线数据生成场景中将吞吐量提升至$1.42\times$。

摘要 (Abstract)

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by $2.31\times$ for online inference and improves throughput by $1.42\times$ for offline data generation.

关键词: LLM inference, scheduling, output length prediction, uncertainty, shortest job first, Tail Inflated Expectation, latency reduction, throughput improvement

271. ❌ The Rashomon Effect for Visualizing High-Dimensional Data

作者: Yiyang Sun, Haiyang Huang, Gaurav Rajesh Parikh, Cynthia Rudin 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00485v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于高维数据可视化的降维方法，特别是Rashomon集合的概念，不涉及大模型、深度学习技术原理或科学领域的AI应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等主题相关，而本文研究的是传统的数据可视化降维技术，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了高维数据降维中的Rashomon效应，提出了通过PCA对齐、概念对齐正则化和提取共同知识的方法，构建更可解释、鲁棒和目标对齐的可视化表示。

摘要翻译

降维（DR）本质上具有非唯一性：多种嵌入方式可以在保持高维数据结构的同时，在布局或几何形态上存在差异。本文正式定义了降维的“拉什莫尔集合”——即所有“优质”嵌入的集合，并展示了如何利用这种多重性来获得更强大且可信的表征。具体而言，我们追求三个目标。首先，我们引入基于主成分分析（PCA-informed）的对齐方法，引导嵌入向主成分方向靠拢，从而在不扭曲局部邻域结构的前提下使坐标轴具有可解释性。其次，我们设计了概念对齐正则化方法，将嵌入维度与外部知识（如类别标签或用户定义的概念）进行对齐。第三，我们提出一种从拉什莫尔集合中提取共性知识的方法，通过识别可信且稳定的最近邻关系，构建出既能保持全局关系、又具有优化局部结构的精炼嵌入。通过超越单一嵌入并利用拉什莫尔集合，我们为构建可解释、鲁棒且与目标对齐的可视化提供了一个灵活框架。

摘要 (Abstract)

Dimension reduction (DR) is inherently non-unique: multiple embeddings can preserve the structure of high-dimensional data equally well while differing in layout or geometry. In this paper, we formally define the Rashomon set for DR – the collection of `good’ embedding – and show how embracing this multiplicity leads to more powerful and trustworthy representations. Specifically, we pursue three goals. First, we introduce PCA-informed alignment to steer embeddings toward principal components, making axes interpretable without distorting local neighborhoods. Second, we design concept-alignment regularization that aligns an embedding dimension with external knowledge, such as class labels or user-defined concepts. Third, we propose a method to extract common knowledge across the Rashomon set by identifying trustworthy and persistent nearest-neighbor relationships, which we use to construct refined embeddings with improved local structure while preserving global relationships. By moving beyond a single embedding and leveraging the Rashomon set, we provide a flexible framework for building interpretable, robust, and goal-aligned visualizations.

关键词: Dimension Reduction, Rashomon Set, PCA-informed Alignment, Concept-alignment Regularization, High-dimensional Data Visualization, Nearest-neighbor Relationships, Interpretable Embeddings

272. ❌ Phase space integrity in neural network models of Hamiltonian dynamics: A Lagrangian descriptor approach

作者: Abrari Noor Hasmi, Haralampos Hatzikirou, Hadi Susanto 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究神经网络在哈密顿系统建模中的应用，并提出拉格朗日描述符作为评估框架。论文主题属于科学计算和物理系统建模，与深度学习在科学领域的应用相关，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（评分5分）。然而，论文未涉及大语言模型（LLMs）、模型架构创新（如MoE、SLMs）、训练技术（如预训练、微调、对齐）、推理优化、智能体系统或模型压缩等主题，因此其他所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出使用拉格朗日描述符作为评估神经网络哈密顿系统模型的诊断框架，发现物理约束架构能保持能量但可能扭曲相空间拓扑，而数据驱动的储层计算虽无显式物理约束却能高保真地再现同宿结构。

摘要翻译

我们提出拉格朗日描述子（Lagrangian Descriptors, LDs）作为一种诊断框架，用于评估哈密顿系统的神经网络模型，其能力超越传统的基于轨迹的度量标准。常规误差度量侧重于量化短期预测精度，但难以揭示全局几何结构（如轨道和分界线）的特性。由于系统本质差异，耗散系统中现有的评估工具并不适用于哈密顿动力学。通过构建以LD值加权的概率密度函数，我们将几何信息嵌入到适用于信息论比较的统计框架中。我们在两个经典系统中，对物理约束架构（SympNet、HénonNet、广义哈密顿神经网络）与数据驱动的储备池计算进行了基准测试。对于杜芬振子，所有模型在适度数据需求下均能恢复同宿轨道几何结构，但其在临界结构附近的精度存在差异。然而，在三模非线性薛定谔方程中，差异显著显现：辛架构虽能保持能量，但扭曲了相空间拓扑结构；而储备池计算尽管缺乏显式物理约束，却能以高保真度复现同宿结构。这些结果证明了基于LD的诊断方法不仅可用于评估预测性能，还能检验所学哈密顿模型的全局动力学完整性，具有重要价值。

摘要 (Abstract)

We propose Lagrangian Descriptors (LDs) as a diagnostic framework for evaluating neural network models of Hamiltonian systems beyond conventional trajectory-based metrics. Standard error measures quantify short-term predictive accuracy but provide little insight into global geometric structures such as orbits and separatrices. Existing evaluation tools in dissipative systems are inadequate for Hamiltonian dynamics due to fundamental differences in the systems. By constructing probability density functions weighted by LD values, we embed geometric information into a statistical framework suitable for information-theoretic comparison. We benchmark physically constrained architectures (SympNet, HénonNet, Generalized Hamiltonian Neural Networks) against data-driven Reservoir Computing across two canonical systems. For the Duffing oscillator, all models recover the homoclinic orbit geometry with modest data requirements, though their accuracy near critical structures varies. For the three-mode nonlinear Schrödinger equation, however, clear differences emerge: symplectic architectures preserve energy but distort phase-space topology, while Reservoir Computing, despite lacking explicit physical constraints, reproduces the homoclinic structure with high fidelity. These results demonstrate the value of LD-based diagnostics for assessing not only predictive performance but also the global dynamical integrity of learned Hamiltonian models.

关键词: Lagrangian Descriptors, Hamiltonian systems, neural network models, phase space integrity, symplectic architectures, Reservoir Computing, geometric structures, dynamical systems

273. ❌ Convergence of Byzantine-Resilient Gradient Tracking via Probabilistic Edge Dropout

作者: Amirhossein Dezhboro, Fateme Maleki, Arman Adibi, Erfan Amini, Jose E. Ramirez-Marquez 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00449v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究分布式优化中的拜占庭容错梯度跟踪算法，属于分布式机器学习/优化领域，与所有评分关键词（均围绕大模型/深度学习技术原理、训练方法、应用等）完全无关。论文未涉及任何大模型、深度学习、AI for Science等内容，而是专注于分布式系统中的安全优化算法。

!!! tip deepseek-chat TL;DR

该论文提出了一种拜占庭容错的梯度跟踪算法GT-PD，通过概率边丢弃和泄漏积分机制，在存在恶意代理的网络中实现了线性收敛，并在MNIST数据集上验证了其优于现有方法的鲁棒性。

摘要翻译

我们研究存在拜占庭代理（可能发送任意对抗性消息）的网络中的分布式优化问题。本文提出带概率边丢弃的梯度跟踪方法（Gradient Tracking with Probabilistic Edge Dropout, GT-PD），这是一种随机梯度跟踪方法，能够在对抗性通信环境下保持梯度跟踪的收敛特性。GT-PD结合了两个互补的防御层：一是通用的以自我为中心的投影操作，将每个接收到的消息裁剪到接收代理周围半径为$τ$的球内；二是完全去中心化的概率丢弃规则，该规则由决策通道和跟踪通道中的双度量信任分数驱动。该设计在限制对抗性扰动的同时，保留了双随机混合结构——这一性质在去中心化场景的鲁棒聚合中常会丢失。在完全隔离拜占庭代理（$p_b=0$）的情况下，GT-PD线性收敛到一个仅由随机梯度方差决定的邻域。对于部分隔离（$p_b>0$）的情况，我们进一步提出带概率边丢弃与泄漏积分的梯度跟踪方法（Gradient Tracking with Probabilistic Edge Dropout and Leaky Integration, GT-PD-L），该方法通过泄漏积分器控制由持续扰动引起的跟踪误差累积，并线性收敛到一个由随机方差及裁剪-泄漏比率决定的有界邻域。我们还证明，在采用双层丢弃（$p_h=1$）时，隔离拜占庭代理不会向诚实节点的共识动态引入额外方差。在MNIST数据集上针对符号翻转（Sign Flip）、ALIE及内积操纵（Inner Product Manipulation）攻击的实验表明，在隐蔽攻击下，GT-PD-L比逐坐标裁剪均值方法（coordinate-wise trimmed mean）的性能高出最多4.3个百分点。

摘要 (Abstract)

We study distributed optimization over networks with Byzantine agents that may send arbitrary adversarial messages. We propose \emph{Gradient Tracking with Probabilistic Edge Dropout} (GT-PD), a stochastic gradient tracking method that preserves the convergence properties of gradient tracking under adversarial communication. GT-PD combines two complementary defense layers: a universal self-centered projection that clips each incoming message to a ball of radius $τ$ around the receiving agent, and a fully decentralized probabilistic dropout rule driven by a dual-metric trust score in the decision and tracking channels. This design bounds adversarial perturbations while preserving the doubly stochastic mixing structure, a property often lost under robust aggregation in decentralized settings. Under complete Byzantine isolation ($p_b=0$), GT-PD converges linearly to a neighborhood determined solely by stochastic gradient variance. For partial isolation ($p_b>0$), we introduce \emph{Gradient Tracking with Probabilistic Edge Dropout and Leaky Integration} (GT-PD-L), which uses a leaky integrator to control the accumulation of tracking errors caused by persistent perturbations and achieves linear convergence to a bounded neighborhood determined by the stochastic variance and the clipping-to-leak ratio. We further show that under two-tier dropout with $p_h=1$, isolating Byzantine agents introduces no additional variance into the honest consensus dynamics. Experiments on MNIST under Sign Flip, ALIE, and Inner Product Manipulation attacks show that GT-PD-L outperforms coordinate-wise trimmed mean by up to 4.3 percentage points under stealth attacks.

关键词: Byzantine-resilient, gradient tracking, probabilistic edge dropout, distributed optimization, adversarial communication, stochastic gradient, linear convergence, trust score

274. ❌ Internal State-Based Policy Gradient Methods for Partially Observable Markov Potential Games

作者: Wonseok Yang, Thinh T. Doan 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00433v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多智能体强化学习在部分可观测马尔可夫势博弈中的应用，提出了一种基于内部状态的自然策略梯度方法，并建立了非渐近收敛界。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型、深度学习技术原理、模型训练优化等，而本文专注于多智能体强化学习的理论分析和算法设计。唯一相关的关键词是’Multi-agent Systems OR Agent Coordination’，因为论文明确研究多智能体系统（multi-agent reinforcement learning）和智能体协调（通过内部状态压缩信息以实现协调），因此给予10分（高度相关）。其他关键词如’LLM Agents’、‘AI for Science’等虽然涉及智能体或科学应用，但本文未使用大语言模型或特定科学领域方法，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了部分可观测马尔可夫势博弈中的多智能体强化学习问题，提出了一种基于内部状态的自然策略梯度方法，并理论证明了其非渐近收敛界，仿真实验表明该方法相比仅使用当前观测的设置能持续提升性能。

摘要翻译

本文研究了部分可观测马尔可夫势博弈中的多智能体强化学习问题。由于部分可观测性、信息分散性以及维度灾难，解决该问题具有挑战性。首先，为应对前两个挑战，我们利用公共信息框架，使智能体能够基于共享信息与局部信息采取行动。其次，为确保问题可处理性，我们研究了一种内部状态，该状态对累积信息进行压缩，防止其随时间无限增长。随后，我们实现了一种基于内部状态的自然策略梯度方法，以寻找马尔可夫势博弈的纳什均衡。我们的主要贡献是为该方法建立了非渐近收敛性界。该理论界可分解为两个可解释的部分：一个在标准马尔可夫势博弈中同样存在的统计误差项，以及一个捕捉有限状态控制器使用情况的近似误差项。最后，通过在多个部分可观测环境中的仿真实验表明，与仅使用当前观测信息的设定相比，所提出的采用有限状态控制器的方法在性能上实现了持续提升。

摘要 (Abstract)

This letter studies multi-agent reinforcement learning in partially observable Markov potential games. Solving this problem is challenging due to partial observability, decentralized information, and the curse of dimensionality. First, to address the first two challenges, we leverage the common information framework, which allows agents to act based on both shared and local information. Second, to ensure tractability, we study an internal state that compresses accumulated information, preventing it from growing unboundedly over time. We then implement an internal state-based natural policy gradient method to find Nash equilibria of the Markov potential game. Our main contribution is to establish a non-asymptotic convergence bound for this method. Our theoretical bound decomposes into two interpretable components: a statistical error term that also arises in standard Markov potential games, and an approximation error capturing the use of finite-state controllers. Finally, simulations across multiple partially observable environments demonstrate that the proposed method using finite-state controllers achieves consistent improvements in performance compared to the setting where only the current observation is used.

关键词: multi-agent reinforcement learning, partially observable Markov potential games, internal state, natural policy gradient, Nash equilibria, non-asymptotic convergence bound, finite-state controllers, decentralized information

275. ❌ Denoising distances beyond the volumetric barrier

作者: Han Huang, Pakawut Jiradilok, Elchanan Mossel 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00432v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究黎曼流形从随机几何图中的潜在几何重建问题，属于计算几何、流形学习和统计推断领域。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于纯数学和统计理论，未涉及任何大模型、深度学习、AI技术或科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ORDER的新方法，用于从随机几何图中重建黎曼流形的潜在几何结构，在多项式时间内实现了超越体积障碍的距离估计精度，并证明了重建度量空间与真实流形之间的Gromov-Wasserstein距离收敛速率与经验测度的Wasserstein收敛速率相匹配。

摘要翻译

我们研究从随机几何图重建 $d$ 维黎曼流形潜在几何结构的问题。尽管近期研究在从随机几何图（更一般地，从含噪距离）恢复流形方面取得了显著进展，但成对距离估计的精度一直受到体积障碍的根本性限制，即由流形上一般点与最近采样点之间距离通常为 $n^{-1/d}$ 量级这一事实产生的自然样本间距尺度 $n^{-1/d}$。本文提出一种新方法——正交环距离估计程序（ORDER），该方法可在多项式时间内实现高达 $n^{-2/(d+5)}$ 量级（含 $n$ 的多对数因子）的逐点距离估计精度。对于维度 $d > 5$ 的情况，该精度严格突破了体积障碍的限制。
由于获得了优于 $n^{-1/d}$ 的逐点精度，我们证明重建的度量测度空间与真实潜在流形之间的格罗莫夫-瓦瑟斯坦距离为 $n^{-1/d}$ 量级。这与经验测度的瓦瑟斯坦收敛速率相匹配，表明我们重建的图度量渐近等价于拥有采样点完整成对距离矩阵的效果。我们的结果在非常一般的设定下得到证明，该设定涵盖含噪成对距离的一般模型、稀疏随机几何图以及未知连接概率函数。

摘要 (Abstract)

We study the problem of reconstructing the latent geometry of a $d$-dimensional Riemannian manifold from a random geometric graph. While recent works have made significant progress in manifold recovery from random geometric graphs, and more generally from noisy distances, the precision of pairwise distance estimation has been fundamentally constrained by the volumetric barrier, namely the natural sample-spacing scale $n^{-1/d}$ coming from the fact that a generic point of the manifold typically lies at distance of order $n^{-1/d}$ from the nearest sampled point. In this paper, we introduce a novel approach, Orthogonal Ring Distance Estimation Routine (ORDER), which achieves a pointwise distance estimation precision of order $n^{-2/(d+5)}$ up to polylogarithmic factors in $n$ in polynomial time. This strictly beats the volumetric barrier for dimensions $d > 5$. As a consequence of obtaining pointwise precision better than $n^{-1/d}$, we prove that the Gromov–Wasserstein distance between the reconstructed metric measure space and the true latent manifold is of order $n^{-1/d}$. This matches the Wasserstein convergence rate of empirical measures, demonstrating that our reconstructed graph metric is asymptotically as good as having access to the full pairwise distance matrix of the sampled points. Our results are proven in a very general setting which includes general models of noisy pairwise distances, sparse random geometric graphs, and unknown connection probability functions.

关键词: manifold recovery, random geometric graph, distance estimation, volumetric barrier, Gromov-Wasserstein distance, metric reconstruction, Riemannian manifold, noisy distances

276. ❌ Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

作者: Xinyu Sun, Wanwei Liu, Haoang Chi, Tingyu Chen, Xiaoguang Mao, Shangwen Wang, Lei Bu, Jingyi Wang, Yang Tan, Zhenyi Qi 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00422v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SHARPEN专注于深度神经网络（DNN）的缺陷修复，如后门、对抗攻击和不公平性，采用基于Shapley值的可解释故障定位和CMA-ES无导数优化进行修复。所有评分关键词均与大模型（LLMs）或深度学习在科学领域的应用相关，但论文主题是通用DNN修复，未涉及大模型技术、训练方法、推理优化、对齐、代理系统或科学AI应用。因此，所有关键词得分为0，与评审背景无直接关联。

!!! tip deepseek-chat TL;DR

论文提出SHARPEN方法，通过基于Shapley值的可解释故障定位和无导数优化（CMA-ES）修复深度神经网络中的后门、对抗攻击和不公平性等缺陷，在多个任务上优于基线方法。

摘要翻译

深度神经网络易受后门、对抗性攻击及不公平性等缺陷影响，其可靠性因此受到削弱。现有方法主要涉及重训练、优化、约束求解或搜索算法。然而，大多数方法依赖梯度计算，限制了其对特定激活函数（如ReLU）的适用性，或采用定位与修复过程难以解释的搜索算法。此外，这些方法通常缺乏跨多种属性的泛化能力。我们提出SHARPEN方法，将可解释的故障定位与无导数优化策略相结合。首先，SHARPEN引入基于Deep SHAP的定位策略，量化每一层及每个神经元对错误输出的边际贡献。具体而言，采用分层由粗到细的方法：通过聚合影响对网络层进行重排序，随后通过分析违反属性状态与良性状态间的激活差异，定位故障神经元/滤波器。接着，SHARPEN整合CMA-ES算法对识别出的神经元进行修复。CMA-ES利用协方差矩阵捕捉变量间依赖关系，实现无需梯度的搜索，并对耦合神经元进行协同调整。通过将可解释定位与进化优化相结合，SHARPEN实现了跨网络架构的无导数修复，对梯度异常和超参数敏感性较低。我们在三项修复任务中验证了SHARPEN的有效性。该方法在属性修复与精度保持之间取得平衡，在后门移除（+10.56%）、对抗性缓解（+5.78%）和不公平性修复（+11.82%）任务中均优于基线方法。值得注意的是，SHARPEN能处理多样化任务，其模块化设计可与不同无导数优化器即插即用，体现了高度的灵活性。

摘要 (Abstract)

DNNs are susceptible to defects like backdoors, adversarial attacks, and unfairness, undermining their reliability. Existing approaches mainly involve retraining, optimization, constraint-solving, or search algorithms. However, most methods rely on gradient calculations, restricting applicability to specific activation functions (e.g., ReLU), or use search algorithms with uninterpretable localization and repair. Furthermore, they often lack generalizability across multiple properties. We propose SHARPEN, integrating interpretable fault localization with a derivative-free optimization strategy. First, SHARPEN introduces a Deep SHAP-based localization strategy quantifying each layer’s and neuron’s marginal contribution to erroneous outputs. Specifically, a hierarchical coarse-to-fine approach reranks layers by aggregated impact, then locates faulty neurons/filters by analyzing activation divergences between property-violating and benign states. Subsequently, SHARPEN incorporates CMA-ES to repair identified neurons. CMA-ES leverages a covariance matrix to capture variable dependencies, enabling gradient-free search and coordinated adjustments across coupled neurons. By combining interpretable localization with evolutionary optimization, SHARPEN enables derivative-free repair across architectures, being less sensitive to gradient anomalies and hyperparameters. We demonstrate SHARPEN’s effectiveness on three repair tasks. Balancing property repair and accuracy preservation, it outperforms baselines in backdoor removal (+10.56%), adversarial mitigation (+5.78%), and unfairness repair (+11.82%). Notably, SHARPEN handles diverse tasks, and its modular design is plug-and-play with different derivative-free optimizers, highlighting its flexibility.

关键词: DNN repair, fault localization, Shapley value, derivative-free optimization, CMA-ES, backdoor removal, adversarial mitigation, unfairness repair

277. ❌ Analytical characterisation of the Mi- and To-phases in HeMiTo dynamics: exponential growth and logistic saturation of toxic prion-like proteins

作者: Johannes G. Borgqvist 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00871v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究神经退行性疾病中朊病毒样蛋白质传播的数学模型（HeMiTo框架），属于生物信息学/计算生物学领域。与大多数大模型/深度学习关键词完全无关（评0分）。唯一相关的是：1）‘Mechanistic Interpretability OR Explainable AI’（评5分），因为论文提到’mechanistically interpretable growth rate’，涉及机制解释性；2）‘AI for Science OR Bioinformatics OR Cheminformatics’（评5分），因为研究蛋白质动力学属于生物信息学/科学AI应用。但论文未使用大模型或深度学习技术，而是基于微分方程的解析建模，因此相关性有限。

!!! tip deepseek-chat TL;DR

该论文通过解析方法完整表征了异二聚体模型中朊病毒样蛋白质传播的混合相和毒性相动力学，推导出毒性物种指数增长的明确条件并将其与逻辑增长方程联系起来，为疾病进展的预测建模奠定了基础。

摘要翻译

错误折叠蛋白的朊病毒样传播是阿尔茨海默病等神经退行性疾病进展的关键机制。在先前的研究中，我们提出了HeMiTo框架，针对一类异二聚体模型，用三个相来描述这种朊病毒样动力学：健康相（He）、混合相（Mi）和毒性相（To）。其中，健康相通过解析方法表征，混合相通过数值方法描述，而毒性相则基于线性稳定性分析推断得出。
在本研究中，我们对此类异二聚体模型的混合相和毒性相提供了完整的解析表征。我们推导出支配混合相的精确内解，并将其与健康相的外解相匹配，从而解释了健康物种的类凹曲线行为，并为毒性物种的指数增长建立了明确的、具有机制可解释性增长率的条件。此外，我们形式化了毒性稳态附近的准稳态约化方法，并证明动力学可简化为逻辑增长方程，从而将指数增长与饱和过程联系起来。
这些结果共同为疾病进展所有阶段的朊病毒样动力学提供了统一且机制化的描述，并为生物标志物轨迹的预测建模奠定了基础。

摘要 (Abstract)

Prion-like propagation of misfolded proteins is a key mechanism underlying the progression of neurodegenerative diseases such as Alzheimer’s disease. In previous work, we introduced the HeMiTo framework, describing these prion-like dynamics for a class of heterodimer models in terms of three phases: the healthy (He), mixed (Mi), and toxic (To) phases. While the He-phase was characterised analytically, the Mi-phase was described numerically and the To-phase was inferred from linear stability arguments. In this work, we provide a complete analytical characterisation of the Mi- and To-phases for our class of heterodimer models. We derive exact inner solutions governing the Mi-phase and match them with outer solutions from the He-phase, explaining the concave-like behaviour of the healthy species and establishing explicit conditions for exponential growth of the toxic species with a mechanistically interpretable growth rate. Furthermore, we formalise a quasi steady-state reduction near the toxic steady state and show that the dynamics reduce to a logistic growth equation, linking exponential growth to saturation. Together, these results provide a unified and mechanistic description of prion-like dynamics across all phases of disease progression and establish a foundation for predictive modelling of biomarker trajectories.

关键词: prion-like propagation, neurodegenerative diseases, heterodimer models, analytical characterization, exponential growth, logistic saturation, toxic proteins, biomarker trajectories

278. ❌ Evaluation of neuroCombat and deep learning harmonization for multi-site magnetic resonance neuroimaging in youth with prenatal alcohol exposure

作者: Chloe Scholten, Elyssa M. McMaster, Adam M. Saunders, Michael E. Kim, Gaurav Rudravaram, Elias Levy, Bryce Geeraert, Lianrui Zuo, Simon Vandekar, Catherine Lebel, Bennett A. Landman 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00251v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是医学影像（MRI）的深度学习去噪方法（HACA3）与统计方法（neuroCombat）在儿科人群中的比较，属于AI在生物医学领域的应用。论文与绝大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大语言模型、训练技术、推理优化等，而本文专注于医学影像处理的深度学习模型。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在生物医学（神经影像学）中的应用，但并非核心创新（主要比较现有方法），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究比较了深度学习方法HACA3和统计方法neuroCombat在去除多站点MRI数据中站点相关变异、同时保留生物相关信号方面的效果，发现统计方法在减少站点变异方面更有效，而HACA3需要结合统计方法才能达到最佳生物信号保留。

摘要翻译

在诸如产前酒精暴露（PAE）等常见疾病与障碍的研究中，多中心数据采集有助于扩大研究样本量。然而，多中心研究因采集材料（如扫描设备和采集协议）的异质性引入了额外的变异，这些变异会与生物学相关信号相互混淆。神经科学家通常在全部图像处理后，对图像衍生的指标（如感兴趣区域体积）采用统计方法，以最小化中心相关的变异。HACA3作为一种深度学习协调方法，为在指标量化前协调图像信号提供了可能；但该方法尚未在儿科队列中得到验证。本研究以儿科人群（年龄7至21岁）为对象，在三种不同扫描设备上采集了对照组和PAE病例的数据，并基于下游MaCRUISE体积指标，评估了HACA3与统计方法neuroCombat在消除中心相关变异和保留生物学相关信号方面的能力，同时将HACA3处理与neuroCombat结合，以检验多种协调方法的有效性。我们发现，HACA3在定性上改善了中心间的对比度差异，但ANCOVA检验显示，统计方法在MaCRUISE体积指标中能更大程度地减少中心相关变异；在此背景下，HACA3需依赖后续统计方法才能接近最大程度的生物学信号保留。

摘要 (Abstract)

In cases of prevalent diseases and disorders, such as Prenatal Alcohol Exposure (PAE), multi-site data collection allows for increased study samples. However, multi-site studies introduce additional variability through heterogeneous collection materials, such as scanner and acquisition protocols, which confound with biologically relevant signals. Neuroscientists often utilize statistical methods on image-derived metrics, such as volume of regions of interest, after all image processing to minimize site-related variance. HACA3, a deep learning harmonization method, offers an opportunity to harmonize image signals prior to metric quantification; however, HACA3 has not yet been validated in a pediatric cohort. In this work, we investigate HACA3’s ability to remove site-related variance and preserve biologically relevant signal compared to a statistical method, neuroCombat, and pair HACA3 processing with neuroCombat to evaluate the efficacy of multiple harmonization methods in a pediatric (age 7 to 21) population across three unique scanners with controls and cases of PAE with downstream MaCRUISE volume metrics. We find that HACA3 qualitatively improves inter-site contrast variations, but statistical methods reduce greater site-related variance within the MaCRUISE volume metrics following an ANCOVA test, and HACA3 relies on follow-up statistical methods to approach maximal biological preservation in this context.

关键词: deep learning harmonization, multi-site MRI, neuroimaging, prenatal alcohol exposure, HACA3, neuroCombat, site-related variance, biological signal preservation

279. ❌ Harmonization mitigates diffusion MRI scanner effects in infancy: insights from the HEALthy Brain and Childhood Development (HBCD) study

作者: Elyssa M. McMaster, Gaurav Rudravaram, Michael E. Kim, Trent M. Schwartz, Chloe Scholten, Jongyeon Yoon, Adam M. Saunders, Andre T. S. Hucke, Karthik Ramadass, Emily M. Harriott, Steven L. Meisler, Simon N. Vandekar, Allen Newton, Seth A. Smith, Saikat Sengupta, Kathryn L. Humphreys, Sarah Osmundson, Daniel Moyer, Laurie E. Cutting, Bennett A. Landman 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00246v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是婴儿期扩散磁共振成像（dMRI）数据的扫描仪效应和谐波化方法，属于医学影像处理领域。论文内容完全不涉及大模型、深度学习技术原理、AI for Science应用或任何评分关键词中的技术概念。所有关键词均与大语言模型、深度学习技术、AI科学应用等相关，而本文是纯粹的医学影像方法学研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究分析了HBCD研究中扩散MRI数据的扫描仪模型相关差异，并证明ComBat-GAM谐波化方法能有效消除这些差异，为大规模脑发育研究提供了数据质量控制方法。

摘要翻译

健康大脑与儿童发育（HEALthy Brain and Childhood Development, HBCD）研究是一项旨在理解群体水平大脑发育的持续纵向研究；然而，大规模研究必须克服与扫描站点相关的变异并保留生物学相关信号。除扩散加权磁共振成像图像外，HBCD数据集还为科研人员提供了可直接用于分析的数据衍生指标，包括一组预设纤维束中的标量扩散张量（DTI）度量。本研究旨在描述HBCD研究中扩散MRI数据特有的站点效应，此类效应尚未被系统报告。在本工作中，我们探究了HBCD纤维束度量对扫描仪型号相关变异的敏感性，并利用ComBat-GAM协调方法，在当前HBCD数据版本1.1中针对六种扫描仪型号处理了这些变异。经ComBat-GAM处理后，我们观察到在错误发现率校正后，任意扫描仪型号间的数据分布均无统计学显著差异，且所有指标的Cohen’s f效应量均有所降低。我们的工作强调了大规模研究中严格进行数据协调的重要性，并建议未来对HBCD数据的研究应控制此类效应。

摘要 (Abstract)

The HEALthy Brain and Childhood Development (HBCD) Study is an ongoing longitudinal initiative to understand population-level brain maturation; however, large-scale studies must overcome site-related variance and preserve biologically relevant signal. In addition to diffusion-weighted magnetic resonance imaging images, the HBCD dataset offers analysis-ready derivatives for scientists to conduct their analysis, including scalar diffusion tensor (DTI) metrics in a predetermined set of bundles. The purpose of this study is to characterize HBCD-specific site effects in diffusion MRI data, which have not been systematically reported. In this work, we investigate the sensitivity of HBCD bundle metrics to scanner model-related variance and address these variations with ComBat-GAM harmonization within the current HBCD data release 1.1 across six scanner models. Following ComBat-GAM, we observe zero statistically significant differences between the distributions from any scanner model following FDR correction and reduce Cohen’s f effect sizes across all metrics. Our work underscores the importance of rigorous harmonization efforts in large-scale studies, and we encourage future investigations of HBCD data to control for these effects.

关键词: diffusion MRI, scanner effects, harmonization, ComBat-GAM, HBCD study, brain development, data quality, longitudinal study

280. ❌ UCell: rethinking generalizability and scaling of bio-medical vision models

作者: Nicholas Kuang, Vanessa Scalon, Ji Yu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00243v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于生物医学图像分析（单细胞分割），属于AI for Science/Bioinformatics领域，与最后一个关键词高度相关（10分）。但论文研究的是计算机视觉模型（UCell），而非大语言模型（LLMs）或相关技术（如MoE、SFT、RLHF、RAG等），因此其他所有关键词均不相关（0分）。论文未提及任何指定的专家作者。

!!! tip deepseek-chat TL;DR

该论文研究了在生物医学单细胞分割任务中，通过设计递归结构的参数高效小模型（UCell，10-30M参数），使其性能匹配大10-20倍的模型，并展示了良好的泛化能力和少样本微调适应性，从而挑战了模型规模必须持续增大的主流范式。

摘要翻译

现代深度学习领域是一个以规模为核心的领域。研究表明，在架构相似的情况下，较大模型的表现始终优于较小模型。然而，在生物医学研究的许多子领域中，模型规模的扩展受限于可用训练数据的数量，以及生成和验证额外高质量数据所需的高昂成本。尽管存在这一实际障碍，当前大多数研究仍聚焦于构建更大的基础模型，而提升小模型能力的替代路径则未得到充分探索。本文尝试构建参数规模为1000万至3000万的模型（按现代标准属于微型模型）来执行单细胞分割任务。一个关键的设计选择是将递归结构融入模型的前向计算图中，从而形成一种参数效率更高的架构。我们发现，在单细胞分割任务上，于多个基准测试中，我们的小模型UCell达到了比其大10至20倍的模型的性能，并且对未见过的域外数据具有类似的泛化能力。更重要的是，我们发现UCell仅需使用一组显微成像数据即可从头开始训练，无需依赖在自然图像上进行的大规模预训练，从而将模型构建与任何外部商业利益脱钩。最后，我们通过在多种小型数据集上进行广泛的单样本和少样本微调实验，检验并确认了UCell的适应能力。实现代码可在 https://github.com/jiyuuchc/ucell 获取。

摘要 (Abstract)

The modern deep learning field is a scale-centric one. Larger models have been shown to consistently perform better than smaller models of similar architecture. In many sub-domains of biomedical research, however, the model scaling is bottlenecked by the amount of available training data, and the high cost associated with generating and validating additional high quality data. Despite the practical hurdle, the majority of the ongoing research still focuses on building bigger foundation models, whereas the alternative of improving the ability of small models has been under-explored. Here we experiment with building models with 10-30M parameters, tiny by modern standards, to perform the single-cell segmentation task. An important design choice is the incorporation of a recursive structure into the model’s forward computation graph, leading to a more parameter-efficient architecture. We found that for the single-cell segmentation, on multiple benchmarks, our small model, UCell, matches the performance of models 10-20 times its size, and with a similar generalizability to unseen out-of-domain data. More importantly, we found that ucell can be trained from scratch using only a set of microscopy imaging data, without relying on massive pretraining on natural images, and therefore decouples the model building from any external commercial interests. Finally, we examined and confirmed the adaptability of ucell by performing a wide range of one-shot and few-shot fine tuning experiments on a diverse set of small datasets. Implementation is available at https://github.com/jiyuuchc/ucell

关键词: biomedical vision models, single-cell segmentation, small models, parameter-efficient architecture, generalizability, few-shot fine-tuning, model scaling, recursive structure

281. ❌ Analytic nuclear gradients including oriented external electric fields in a molecule-fixed frame

作者: Duc Anh Lai, Devin A. Matthews 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01189v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究分子在外部电场下的计算化学方法，属于计算化学/量子化学领域，与所有大模型/深度学习技术关键词完全无关（评分为0）。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及计算化学建模，属于科学计算应用，但论文未使用AI/机器学习方法，而是基于传统量子化学计算，因此给予5分（有一定关联但非核心）。

!!! tip deepseek-chat TL;DR

该论文提出了两种分子参考框架（主轴框架和局部参考框架）来定义分子内的定向电场，推导并实现了外部电场下的解析核梯度，应用于顺式和反式甲酰苯胺的场依赖几何优化，验证了该形式主义的准确性并开启了电场控制化学计算建模的新机会。

摘要翻译

近年来，电场辅助化学受到广泛关注，特别是在利用定向外电场控制分子结构与反应性方面。此类电场已在诸多领域中得到探索，包括开关材料、纳米粒子、可控催化剂、药物及临床治疗等。然而，对于柔性分子而言，实验室坐标系中固定方向电场的应用效果有限，因为构象变化会显著改变分子的空间取向。本研究提出了两种分子参考系——主轴坐标系与局部参考坐标系——以在分子框架内定义定向电场。这些坐标系能有效消除外加电场与分子间相对取向的模糊性。我们推导并实现了外场存在下的解析核梯度方法，并首次应用于顺式和反式甲酰苯胺的电场依赖几何结构优化。对电场诱导平衡结构的分析揭示了显著差异的结构响应，验证了所提理论方法的准确性与稳健性。该解析梯度框架使得在任意取向电场下系统研究分子性质与反应性成为可能，为电场控制化学的计算模拟与理性设计开辟了新途径。

摘要 (Abstract)

Electric field-assisted chemistry has attracted much attention in recent years, particularly in the context of oriented external electric fields for controlling molecular structure and reactivity. Such fields have been explored in a wide range of applications, including switching materials, nanoparticles, controllable catalysts, medicines, and clinical therapies. However, the use of fixed fields in the laboratory frame becomes ineffective for flexible molecules, as conformational changes can significantly alter their orientations. In this work, we propose two molecular reference frames – the principal axis frame and the local reference frame – to define oriented electric fields within the molecular framework. These coordinate systems powerfully eliminate ambiguities in the relative orientation between the applied field and the molecule. Analytic nuclear gradients in the presence of external electric fields are derived and implemented, with an initial application to field-dependent geometry optimizations of cis- and trans-formanilide. Analysis of the resulting field-induced equilibrium structures reveals distinct structural responses, validating the accuracy and robustness of the proposed formalism. The analytic gradient framework enables systematic investigations of molecular properties and reactivity under arbitrarily oriented electric fields, opening new opportunities for computational modeling and rational design in electric field-controlled chemistry.

关键词: oriented external electric fields, molecular reference frames, analytic nuclear gradients, field-dependent geometry optimization, electric field-controlled chemistry, computational modeling, cis-trans formanilide, molecular structure and reactivity

282. ❌ High Performance Quantum Emulation for Chemistry Applications with Hyperion

作者: Olivier Adjoua, Siwar Badreddine, César Feniou, Igor Chollet, Diata Traore, Guillaume Michel, Jean-Philip Piquemal 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.01176v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于量子计算模拟器Hyperion的开发，用于量子化学模拟，属于科学计算领域。所有关键词均与深度学习、大模型技术相关，而本文不涉及这些技术，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有弱关联（5分），因为量子化学模拟可视为科学计算应用，但未明确使用AI方法。其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文开发了名为Hyperion的高性能GPU加速量子模拟器，通过SV-MPS混合策略解决了量子化学模拟中的内存和精度问题，实现了对36-40量子比特系统的准确模拟。

摘要翻译

当前量子硬件的战略需求已超越近期可用设备的承载能力，亟需高性能软件模拟器以验证新型协议。本文介绍Hyperion——一款大规模并行、GPU加速的量子模拟器，其架构设计旨在突破强关联量子化学模拟中固有的经典内存墙。Hyperion采用定制优化的稀疏矩阵-稀疏向量（SpMspV）计算核心，原生加速精确的矩阵-向量乘法，从而在多节点平台上实现严格精确的态矢量（SV）ADAPT-VQE模拟，最高支持32量子比特。为突破此硬件限制，我们针对纯矩阵乘积态（MPS）模拟器中的权衡问题提出解决方案：标准压缩会导致严重截断误差，而严格压缩则会引发难以处理的张量秩爆炸。我们提出一种创新的分区模拟策略，即SV-MPS方法：通过将非相互作用项路由至精确稀疏SV核心，同时将相互作用项委托给MPS引擎，该方案能以可控近似实现36至40量子比特的模拟。这种分区策略显著降低了GPU资源需求，并在ADAPT-VQE迭代过程中保持稳健的精度。最终，Hyperion为化学领域新型量子算法的开发提供了一个高保真平台，能够以接近精确全组态相互作用（FCI）/完全基组（CBS）极限的精度模拟真实化学体系。

摘要 (Abstract)

The strategic demand for quantum hardware currently outpaces the availability of near-term devices, necessitating high-performance software emulators to validate novel protocols. We introduce Hyperion, a massively parallel, GPU-accelerated quantum emulator architected to bypass the classical memory walls inherent in strongly correlated quantum chemistry simulations. Hyperion leverages custom-optimized Sparse Matrix-Sparse Vector (SpMspV) kernels to natively accelerate exact matrix-vector multiplications, enabling strictly accurate State-Vector (SV) ADAPT-VQE simulations for up to 32 qubits on multi-node platforms. To scale beyond this hardware limit, we address the trade-off in pure Matrix Product State (MPS) emulators, where standard compression yields severe truncation errors and strict compression triggers intractable tensor rank explosions. We propose a novel partitioned emulation, namely the SV-MPS strategy: by routing non-interacting terms into an exact sparse SV core and delegating interacting terms to the MPS engine, this approach achieves emulation of 36 to 40 qubits with controlled approximations. This partitioning significantly reduces GPU resource requirements while maintaining robust accuracy across ADAPT-VQE iterations. Ultimately, Hyperion offers a high-fidelity platform dedicated to the development of new quantum algorithms for chemistry, enabling the modeling of realistic chemical systems at accuracies approaching the exact Full Configuration Interaction (FCI) / Complete Basis Set (CBS) limit.

关键词: quantum emulator, GPU-accelerated, quantum chemistry, ADAPT-VQE, Matrix Product State, sparse matrix-vector multiplication, high-performance computing, Full Configuration Interaction

283. ❌ Thermodynamics-Informed Accurate pKa Prediction and Protonation State Generation in PlayMolecule AI

作者: Francesco Pesce, Stephen Farr, Gianni de Fabritiis 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00841v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于药物发现中的pKa预测和质子化状态生成，属于AI在科学领域的应用（具体为化学信息学/生物信息学）。论文的核心技术是基于Uni-pKa框架和Uni-Mol模型，涉及分子表示学习、GPU加速的构象生成和热力学一致性建模，但未涉及任何大语言模型（LLM）或深度学习技术原理的创新。因此，仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），其他关键词均与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文开发了AcepKa应用，通过基于Uni-pKa框架和Uni-Mol模型的先进方法，实现了热力学一致的pKa准确预测和质子化状态生成，并集成到PlayMolecule AI平台以加速药物发现。

摘要翻译

准确预测酸解离常数（p$K_{\rm a}$）并确定优势质子化状态在药物发现中至关重要，它影响着溶解度、渗透性以及蛋白质-配体结合等分子性质。我们推出Acep$K_{\rm a}$，这是一个集成于PlayMolecule AI平台的先进应用。Acep$K_{\rm a}$建立在理论严谨的Uni-p$K_{\rm a}$框架之上，该框架将统计力学与表示学习相统一。通过模拟完整的质子化系综，而非将p$K_a$视为标量回归目标，Acep$K_{\rm a}$确保了耦合电离位点间的热力学一致性。我们描述了该应用的增强架构，其特点是采用重新训练的Uni-Mol主干网络，在标准基准测试中实现了最先进的性能。此外，我们详述了关键的工程进展。这包括AceConfgen——一个专有的GPU加速构象生成器，与英伟达的nvmolkit相比实现了约40倍的加速；一个用于直接质子化分子的高效推理引擎；以及一种将质子化状态应用于结合配体构象的3D感知模态。最后，我们讨论了Acep$K_{\rm a}$与PlayMolecule AI生态系统的集成，该生态系统是一个用于分子建模与药物发现的现代化人工智能辅助环境。

摘要 (Abstract)

Accurate prediction of acid dissociation constants (p$K_{\rm a}$) and the determination of dominant protonation states is critical in drug discovery, influencing molecular properties such as solubility, permeability, and protein-ligand binding. We present Acep$K_{\rm a}$, an advanced application integrated into the PlayMolecule AI platform. Acep$K_{\rm a}$ is built upon the theoretically rigorous Uni-p$K_{\rm a}$ framework, which unifies statistical mechanics with representation learning. By modeling the complete protonation ensemble rather than treating p$K_a$ as a scalar regression target, Acep$K_{\rm a}$ ensures thermodynamic consistency across coupled ionization sites. We describe the application’s enhanced architecture, which features a retrained Uni-Mol backbone achieving state-of-the-art performance on standard benchmarks. Furthermore, we detail critical engineering advancements. These include AceConfgen, a proprietary GPU-accelerated conformer generator that achieves a ~40x speed-up compared to NVIDIA’s nvmolkit, a streamlined inference engine to directly protonate molecules, and a 3D-aware modality for applying protonation states to bound ligand poses. Finally, we discuss the integration of Acep$K_{\rm a}$ into the PlayMolecule AI ecosystem, a modern AI-assisted environment for molecular modelling and drug discovery.

关键词: pKa prediction, protonation state generation, drug discovery, Uni-pKa framework, Uni-Mol model, thermodynamic consistency, GPU-accelerated conformer generator, PlayMolecule AI platform

284. ❌ Role of anisotropic electronic friction in laser-driven hydrogen recombination on copper

作者: Alexander Spears, Wojciech G. Stark, Reinhard J. Maurer 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00709v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究激光驱动下铜表面氢复合反应中的各向异性电子摩擦作用，属于计算化学/表面科学领域，使用了机器学习辅助的模拟框架。所有关键词均与大语言模型、深度学习技术原理或AI应用直接相关，但论文内容聚焦于物理化学过程的模拟，未涉及任何大模型技术、训练方法、推理优化或AI代理等主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文使用了机器学习方法研究科学问题（计算化学），但并非核心内容，因此给予5分（有一定关联）。其他关键词与论文主题完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该研究使用机器学习模拟框架，比较了激光驱动铜表面氢复合反应中各向同性和各向异性电子摩擦模型的影响，发现各向异性摩擦显著影响能量转移速率和反应概率，但对最终分子能量分布影响不大。

摘要翻译

表面超快光驱动化学动力学受激发电子向振动自由度能量转移的调控。当这种非绝热能量转移呈现各向异性时，会导致动力学导向效应，从而影响分子反应概率或非热最终能量分布。本文采用机器学习辅助的模拟框架，比较了铜(111)晶面上激光驱动析氢反应中各向同性与各向异性电子摩擦模型。研究发现，虽然各向异性摩擦强烈决定了能量向吸附物的转移速率以及反应概率对光通量的依赖关系，但它对最终平动、振动和转动能量分布的影响甚微，因为这些分布主要受势能面在能垒区域的形貌所主导。

摘要 (Abstract)

Ultrafast light-driven chemical dynamics at surfaces are governed by energy transfer from excited electrons to vibrational degrees of freedom. When this nonadiabatic energy transfer is anisotropic, it can lead to dynamical steering effects that affect reaction probabilities or non-thermal final energy distributions in molecules. Here, we use a machine-learning-enabled simulation framework to compare isotropic and anisotropic models of electronic friction during laser-driven hydrogen evolution on the (111) facet of copper. While anisotropic friction strongly determines the rate of energy transfer into the adsorbate and the fluence dependence of reaction probabilities, it has little effect on final translational, vibrational and rotational energy distributions as these are mainly governed by the potential energy landscape at the barrier.

关键词: ultrafast light-driven chemical dynamics, anisotropic electronic friction, laser-driven hydrogen recombination, copper surface, machine-learning-enabled simulation, nonadiabatic energy transfer, reaction probabilities, energy distribution

285. ❌ Stochastic GW with the Orthogonalized Projector Augmented Wave Method

作者: Dimitri Bazile, Minh Nguyen, Yuji Kon, Tucker Allen, Daniel Neuhauser 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是计算材料科学中的随机GW方法（sGW）与正交化投影缀加波方法（OPAW）的结合，属于计算物理/化学领域，与深度学习、大模型技术无关。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联，因为论文涉及科学计算（材料科学），属于AI for Science的广义范畴，但论文本身并未使用AI或机器学习方法，而是传统的计算物理方法，因此给予5分（有一定关联）。其他关键词均与论文内容完全无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合正交化投影缀加波方法的随机GW实现（OPAW-sGW），用于计算材料准粒子带隙，相比传统方法能在更粗的实空间网格上获得准确结果。

摘要翻译

我们介绍了基于正交化投影缀加波方法的随机GW方法（OPAW-sGW）。该实现能够在比传统模守恒赝势随机GW方法（NCPP-sGW）明显更粗的实空间网格上，获得精确的准粒子带隙。正交化PAW表示保持了形式上的全电子特性，并实现了格林函数与屏蔽库仑相互作用的随机采样。

摘要 (Abstract)

We introduce stochastic GW with the orthogonalized projector augmented-wave method (OPAW-sGW). This implementation enables accurate quasiparticle band gaps on significantly coarser real-space grids than norm-conserving pseudopotential sGW (NCPP-sGW). The orthogonalized PAW representation preserves the formal all-electron character and enables stochastic sampling of the Green’s function and screened Coulomb interaction.

关键词: stochastic GW, orthogonalized projector augmented-wave method, OPAW-sGW, quasiparticle band gaps, real-space grids, Green’s function, screened Coulomb interaction, computational materials science

286. ❌ Reliable and Efficient Automated Transition-State Searches with Machine-Learned Interatomic Potentials

作者: Jonah Marks, Jonathon Vandezande, Joseph Gomes 期刊/来源: arxiv 发布日期: 2026-04-01 arXiv链接: http://arxiv.org/abs/2604.00405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用机器学习原子间势（MLIPs）加速过渡态搜索，属于计算化学和材料科学领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究是机器学习在科学计算（具体是化学和材料科学）中的应用，属于AI for Science范畴，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究解决了使用密度泛函理论（DFT）进行过渡态搜索计算成本高的问题，通过系统评估多种机器学习原子间势（MLIPs）与反应路径寻找算法的组合，证明了MLIP加速的工作流程能以极低的计算成本（如MACE-OMol25在有机系统中每反应仅需少于4次DFT梯度评估，成本降低94-96%）实现接近DFT的精度，为自动化反应发现提供了实用工具。

摘要翻译

过渡态搜索是理解反应机理的核心，但密度泛函理论（DFT）的高计算成本限制了其在高通量催化剂和材料发现中的应用。机器学习原子间势（MLIPs）能以数量级更低的成本提供接近DFT的精度，但其在过渡态搜索中的可靠性仍未得到充分探索。本研究系统性地评估了混合过渡态搜索工作流程，该流程结合了六种可自由获取的势函数（MACE-OMol25、UMA-Small、UMA-Medium、eSEN-S、AIMNet2和GFN2-xTB）与两种反应路径寻找算法（冻结弦方法和爬坡图像微动弹性带法），测试范围涵盖58个多样化反应，包括小分子有机反应、聚合化学和过渡金属催化。我们发现，基于Open Molecules 2025数据集训练的模型表现出显著优越的性能，其中MACE-OMol25在有机体系上实现了96.6%的成功率，且每个反应所需的DFT梯度评估次数少于四次——相比传统的基于DFT的搜索，计算量减少了94-96%。在高精度DFT优化前，先在MLIP势能面上进行低精度优化，可在可靠性损失极小的前提下将计算成本降低三倍。对于过渡金属体系，UMA-Medium模型在分布内过渡金属配合物反应以及分布外有机金属C-H活化反应中均展现出良好的可迁移性。这些结果确立了MLIP加速的工作流程可作为自动化反应发现的实用工具，以传统成本的一小部分实现接近DFT的精度。

摘要 (Abstract)

Transition-state searches are central to understanding reaction mechanisms, but the high computational cost of density-functional theory (DFT) limits their application in high-throughput catalyst and materials discovery. Machine-learned interatomic potentials (MLIPs) offer near-DFT accuracy at orders-of-magnitude lower cost, yet their reliability for transition-state searches remains underexplored. Here, we systematically benchmark hybrid transition-state-search workflows combining six freely available potentials (MACE-OMol25, UMA-Small, UMA-Medium, eSEN-S, AIMNet2, and GFN2-xTB) with two reaction-path-finding algorithms (the freezing-string method and climbing-image nudged elastic band) across 58 diverse reactions spanning small organics, polymerization chemistry, and transition-metal catalysis. We find that models trained on the Open Molecules 2025 dataset exhibit markedly superior performance, with MACE-OMol25 achieving a 96.6% success rate while requiring fewer than four DFT-gradient evaluations per reaction on organic systems - a 94-96% reduction compared to conventional DFT-based searches. Low-level refinement on the MLIP surface before high-level DFT optimization reduces computational cost three-fold with minimal loss in reliability. For transition-metal systems, UMA-Medium demonstrates promising transferability to in-distribution transition metal complex reactions and out-of-distribution organometallic C-H activation. These results establish MLIP-accelerated workflows as practical tools for automated reaction discovery, enabling near-DFT accuracy at a fraction of traditional expense.

关键词: transition-state searches, machine-learned interatomic potentials, density-functional theory, computational cost reduction, reaction discovery, catalyst discovery, materials discovery, benchmarking

287. ❌ Short-lived memory in multidimensional spectra encodes full signal evolution

作者: Thomas Sayer, Ethan H. Fink, Zachary R. Wiethorn, Devin R. Williams, Anthony J. Dominic, Luke Guerrieri, Yi Ji, Veronica Policht, Jennifer Ogilvie, Gabriela Schlau-Cohen, Amber Krummel, Andrés Montoya-Castillo 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29814v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于开发一种新的光谱技术（spectral generalized master equation），用于改进二维光谱测量方法，属于实验物理/化学光谱学领域。论文内容与绝大多数关键词（涉及大模型、深度学习、AI技术原理等）完全无关，因此评分为0。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于科学仪器/测量技术领域，可视为广义的’AI for Science’（科学中的技术应用），但论文并未涉及人工智能或机器学习方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对当前多维光谱技术存在设备复杂、测量时间长、易导致样品降解等问题，开发了一种名为spectral generalized master equation的新技术，能够利用短等待时间的二维光谱数据准确预测任意等待时间下的完整光谱演化，大幅降低了实验成本并提高了时间分辨率。

摘要翻译

超快多维光谱是能够探测复杂材料中电荷与能量流动、追踪化学动力学过程、乃至揭示关联物质中多体相互作用的强大工具。然而，当前的技术实现通常依赖于复杂装置和冗长的信号平均时间。因此，这些方法仅限于对少数样品进行细致的机理研究，难以广泛应用于分子系统的普适性表征或功能材料的性能优化。例如，随着等待时间的增加，二维光谱中统计噪声的收敛所需成本呈指数级上升，而长时间的激光照射也提高了样品损伤的概率。为应对这一根本性挑战，我们开发了一种新技术——谱学广义主方程（spectral generalized master equation, GME），该技术使得利用短等待时间的二维光谱来高时间分辨率地确定任意等待时间下二维光谱的完整演化成为可能。除了将实验成本降低数个数量级之外，我们的方法能精确消除统计噪声，减少时间平均的需求，同时规避了因等待时间延长而急剧增加的收敛成本。我们为谱学广义主方程提供了严格的理论基础，并通过理论生成和实验测得的二维电子光谱与二维红外光谱验证了其适用性。我们预期这一进展有望推动对当前多维光谱技术而言过于脆弱、难以研究的体系进行探索，并加速基于二维光谱的显微技术发展，从而在非均相环境中实现兼具高空间分辨率与高分辨激发动力学的观测能力。

摘要 (Abstract)

Ultrafast multidimensional spectroscopies are powerful tools that can access charge and energy flow in complex materials, shifting chemical kinetics, and even many-body interactions in correlated matter. However, current implementations typically involve complex apparatuses and long averaging times. As a result, these methods have been limited to detailed mechanistic investigations of a few samples, precluding the broad characterization of molecular systems and/or the optimization of material ones. For example, converging the statistical noise in 2D spectra becomes exponentially expensive with increasing waiting times, and extended laser exposure heightens the probability of sample degradation. We address this fundamental challenge by developing a new technique, the spectral generalized master equation (GME), that enables one to employ short-waiting time 2D spectra to determine the full evolution of 2D spectra over arbitrary waiting times with high temporal resolution. In addition to reducing the cost of experiments by multiple orders of magnitude, our approach accurately removes statistical noise, reducing the need for time averaging, while circumventing the increasing convergence costs with longer waiting times. We provide a rigorous theoretical footing for the spectral GME and demonstrate its applicability on theoretically generated and experimentally measured 2D electronic and 2D infrared spectra. We anticipate that this advance has the potential to enable the investigation of systems that are too delicate for study with current multidimensional spectroscopies and accelerate the progress of 2D spectroscopy-based microscopies that can offer highly resolved excitation dynamics with spatial resolution over heterogeneous environments.

关键词: ultrafast multidimensional spectroscopy, 2D spectra, spectral generalized master equation, waiting time, temporal resolution, experimental cost reduction, statistical noise removal, sample degradation

288. ❌ Optical frequency comb Fourier transform spectroscopy of the CH$_2$$^{79}$Br$^{81}$Br, CH$_2$$^{79}$Br$_2$, and CH$_2$$^{81}$Br$_2$ isotopologues in the 1180-1210 cm$^{-1}$ region

作者: Ibrahim Sadiek, Aleksandr A. Balashov, Adrian Hjältén, Michael Rey, Oleg Egorov, Aleksandra Foltynowicz 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2604.00244v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是关于二溴甲烷（CH2Br2）的高分辨率光谱学研究，使用光学频率梳傅里叶变换光谱技术测量吸收截面，并分析同位素变体的振动-转动特征。论文内容完全属于实验物理化学和光谱学领域，与所有大模型、深度学习、AI技术原理等关键词均无直接关联。唯一可能的相关性是“AI for Science OR Bioinformatics OR Cheminformatics”，因为该研究属于科学领域（化学光谱学），但论文本身并未使用任何AI或机器学习方法，而是基于传统光谱拟合和从头计算，因此给予5分（有一定关联）。其他所有关键词均评0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究通过光学频率梳傅里叶变换光谱技术首次测量了二溴甲烷在1180-1210 cm⁻¹区域的高分辨率吸收截面，并利用经验拟合和从头计算方法分析了三种同位素变体的振动-转动特征，为环境监测和光谱建模提供了准确数据。

摘要翻译

二溴甲烷（CH$_2$Br$_2$）的定量光谱检测在环境监测、工作场所安全及系外行星研究中，因缺乏精确的吸收截面数据和严谨的光谱模型而受到限制。本文首次利用光学频率梳傅里叶变换光谱技术，测量了CH$_2$Br$_2$在1180-1210 cm$^{-1}$范围内的高分辨率（点间距6.3 MHz）吸收截面。该光谱区域主要受强CH$_2$摇摆（$ν$$_8$）基频振动主导，其强度约为3077 cm$^{-1}$附近C-H伸缩基频的50倍。测量解析了CH$_2$$^{79}$Br$^{81}$Br、CH$_2$$^{79}$Br$_2$和CH$_2$$^{81}$Br$_2$等同位素体的特定振转谱线特征，并通过两种方法对$ν$$_8$基频及重叠的$ν$$_4$+$ν$$_8$-$ν$$_4$热谱带振转跃迁进行了归属。首先，基于PGOPHER实现的经验性非线性最小二乘拟合为三种同位素体提供了高精度的谱线归属和光谱常数（包括精确的谱带原点、转动常数及四阶离心畸变参数），覆盖了高达K$_a$ = 25和J = 144的转动能级，平均均方根残差为0.00037 cm$^{-1}$（11.1 MHz）。与先前基于窄带（1.78 cm$^{-1}$）超冷光谱拟合获得的谱带参数（B. E. Brumfield等人，J. Mol. Spectrosc., 2011, 266, 57-62）相比，我们的拟合显著提升了测量光谱与模拟光谱的整体一致性。同时，采用基于从头算的有效哈密顿量方法对完整的振转多能级体系进行了建模，该方法涵盖了纯经验拟合无法处理的弱热谱带跃迁及多能级相互作用，并首次提供了CH$_2$Br$_2$在8 μm光谱区域内基于从头算的谱线强度数据。

摘要 (Abstract)

Quantitative spectroscopic detection of dibromomethane, CH$_2$Br$_2$, for environmental monitoring, workplace safety, and exoplanetary studies is limited by the lack of accurate absorption cross-section data and rigorous spectroscopic models. We report the first high-resolution (6.3 MHz point spacing) absorption cross-section of CH$_2$Br$_2$ in the 1180-1210 cm$^{-1}$ region measured using optical frequency comb Fourier transform spectroscopy. This region is dominated by the strong CH$_2$ wagging ($ν$$_8$) fundamental vibration, which is about 50 times stronger than the fundamental C-H stretch around 3077 cm$^{-1}$. The measurements resolve isotopologue-specific rovibrational features of CH$_2$$^{79}$Br$^{81}$Br, CH$_2$$^{79}$Br$_2$, and CH$_2$$^{81}$Br$_2$, and we assign rovibrational transitions of the $ν$$_8$ fundamental and the overlapping $ν$$_4$+$ν$$_8$-$ν$$_4$ hot bands using two methods. First, an empirical non-linear least square fit implemented in PGOPHER provides high-precision line assignment and spectroscopic constants, including accurate band origins, rotational constants, and quartic centrifugal distortion parameters, for the three isotopologues, covering rotational levels up to K$_a$ = 25 and J = 144, with an average RMS residual of 0.00037 cm$^{-1}$ (11.1 MHz). Compared with previously reported band parameters retrieved from a fit to narrowband (1.78 cm$^{-1}$) supersonically cooled spectra (B. E. Brumfield et al., J. Mol. Spectrosc., 2011, 266, 57-62), our fit provides much improved global agreement between measured and simulated spectra. In parallel, an ab initio-based effective Hamiltonian approach was used to model the complete rovibrational polyads, including weak hot-band transitions and polyad interactions inaccessible to purely empirical fits, and provided the first ab initio-based line intensities of CH$_2$Br$_2$ in the 8 $μ$m spectral region.

关键词: CH2Br2, optical frequency comb Fourier transform spectroscopy, absorption cross-section, rovibrational transitions, isotopologues, ab initio calculation, spectroscopic modeling, environmental monitoring

289. ❌ Quantum Sensing with Triplet Pair States: A Theoretical Study

作者: Maria Grazia Concilio, Yiwen Wang, Siyuan Wang, Xueqian Kong 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29509v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究量子传感，属于物理/量子计算领域，与所有大模型/深度学习技术关键词完全无关；仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有微弱关联，因为量子传感可视为科学应用，但论文未使用AI方法，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该理论研究比较了基于五苯二聚体与单体的量子传感架构，发现二聚体在检测核自旋小系综时具有更优的相互作用截面，为利用高自旋多激子态作为化学可调谐、高灵敏度量子探针提供了理论基础。

摘要翻译

分子量子传感器代表了在纳米尺度上检测核磁共振信号与交流磁场的前沿方向，其灵敏度有望达到单质子水平。尽管并五苯分子的三重态提供了一种可行的传感架构，但通过并五苯二聚体单线态裂变产生的三重态对态，有望借助纠缠实现更灵活的量子操控。本研究模拟了经由分子内单线态裂变产生的光激发并五苯二聚体中，自旋极化的五重态流形用于量子传感的性能。采用林德布拉德主方程方法，我们模拟了三重态对态在标准动态解耦序列（包括自旋回波、XY4和XY8）下的演化过程，并与传统的并五苯单体基准进行了直接性能比较。尽管两种架构在孤立单自旋检测中表现出相近的灵敏度，我们的研究结果表明，二聚体架构在检测小规模核自旋系综时具有更优的相互作用截面。通过对荧光调制推导出的解析表达式表明，灵敏度在低磁场条件下达到最优，并随传感协议中的脉冲数量增加而提升。本研究为利用高自旋多激子态作为化学可调谐的高灵敏度量子探针奠定了理论基础。

摘要 (Abstract)

Molecular quantum sensors represent a promising frontier for the detection of nuclear magnetic resonance signals and alternating current magnetic fields at the nanoscale, potentially reaching single-proton sensitivity. Although the triplet states of molecular pentacene provide a viable sensing architecture, the triplet pair states produced by singlet fission of pentacene dimers could enable more flexible quantum manipulations through entanglement. In this work, we model the quantum sensing efficacy of a spin-polarized quintet manifold in a photoexcited pentacene dimer generated via intramolecular singlet fission. Using a Lindblad master equation approach, we simulate the evolution of the triplet pair state under standard dynamical decoupling sequences, including spin echo, XY4, and XY8 and provide a direct performance comparison to the traditional pentacene monomer benchmark. While both architectures exhibit comparable sensitivity for isolated single-spin detection, our findings indicate that the dimer architecture provides a superior interaction cross-section for detecting small ensembles of nuclear spins. Analytical expressions derived for fluorescence modulation demonstrate that sensitivity is optimized in the low-magnetic field regime and scales with the number of pulses in the sensing protocol. This study establishes a theoretical baseline for utilizing high-spin multi-excitonic states as chemically tunable, high-sensitivity quantum probes.

关键词: quantum sensing, triplet pair states, pentacene dimer, singlet fission, Lindblad master equation, dynamical decoupling, fluorescence modulation, high-spin multi-excitonic states

Token 消耗统计

总计: 931,307 tokens（输入 644,346 / 输出 286,961）

模型	输入	输出	合计
deepseek-chat	516,321	286,961	803,282
glm-4.7	128,025	0	128,025

📊 ArXiv 研究报告 (2026-04-03)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

2. HippoCamp: Benchmarking Contextual Agents on Personal Computers

3. Execution-Verified Reinforcement Learning for Optimization Modeling

4. Dual Optimal: Make Your LLM Peer-like with Dignity

5. MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

6. True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness

7. Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

8. IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

9. Cost-Penalized Fitness in FMA-Orchestrated Mixture of Experts: Experimental Evidence for Molecular M

10. EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

11. Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

📋 所有论文列表

1. ✅ Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

2. ✅ HippoCamp: Benchmarking Contextual Agents on Personal Computers

3. ✅ Execution-Verified Reinforcement Learning for Optimization Modeling

4. ✅ Dual Optimal: Make Your LLM Peer-like with Dignity

5. ✅ MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

6. ✅ True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

7. ✅ Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

8. ✅ IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

9. ✅ Cost-Penalized Fitness in FMA-Orchestrated Mixture of Experts: Experimental Evidence for Molecular Memory in Domain Adaptation

10. ✅ EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

11. ✅ Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

12. ❌ TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models

13. ❌ A Survey of On-Policy Distillation for Large Language Models

14. ❌ Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation

15. ❌ Hierarchical Pre-Training of Vision Encoders with Large Language Models

16. ❌ Fast and Accurate Probing of In-Training LLMs’ Downstream Performances

17. ❌ Routing-Free Mixture-of-Experts

18. ❌ When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

19. ❌ PHASOR: Anatomy- and Phase-Consistent Volumetric Diffusion for CT Virtual Contrast Enhancement

20. ❌ Multimodal Analysis of State-Funded News Coverage of the Israel-Hamas War on YouTube Shorts

21. ❌ GRASP: Gradient Realignment via Active Shared Perception for Multi-Agent Collaborative Optimization

22. ❌ The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

23. ❌ LAtent Phase Inference from Short time sequences using SHallow REcurrent Decoders (LAPIS-SHRED)

24. ❌ $\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

25. ❌ CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

26. ❌ Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

27. ❌ Therefore I am. I Think

28. ❌ ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

29. ❌ A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

30. ❌ Screening Is Enough

31. ❌ Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

32. ❌ AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation

33. ❌ Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

34. ❌ Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

35. ❌ Looking into a Pixel by Nonlinear Unmixing – A Generative Approach

36. ❌ Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

37. ❌ Trust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators

38. ❌ Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

39. ❌ Adversarial Moral Stress Testing of Large Language Models

40. ❌ Temporal Dependencies in In-Context Learning: The Role of Induction Heads

41. ❌ Approximating Pareto Frontiers in Stochastic Multi-Objective Optimization via Hashing and Randomization

42. ❌ VibeGuard: A Security Gate Framework for AI-Generated Code

43. ❌ TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models

44. ❌ Adversarial Attacks in AI-Driven RAN Slicing: SLA Violations and Recovery

45. ❌ Aligning Recommendations with User Popularity Preferences

46. ❌ Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

47. ❌ Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

48. ❌ Transfer learning for nonparametric Bayesian networks

49. ❌ OrgAgent: Organize Your Multi-Agent System like a Company

50. ❌ Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

51. ❌ OmniMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

52. ❌ EgoSim: Egocentric World Simulator for Embodied Interaction Generation

53. ❌ Do Phone-Use Agents Respect Your Privacy?

54. ❌ Bridging Structured Knowledge and Data: A Unified Framework with Finance Applications

55. ❌ Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization

56. ❌ WARP: Guaranteed Inner-Layer Repair of NLP Transformers

57. ❌ PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

58. ❌ Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting

59. ❌ Representation Selection via Cross-Model Agreement using Canonical Correlation Analysis

60. ❌ Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time

61. ❌ Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts

62. ❌ Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models