📊 ArXiv 研究报告 (2026-04-02)

生成时间: 2026-04-02 09:20:10 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 316 篇
及格论文: 12 篇 (3.8%)

⭐ 及格论文详细分析

1. Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vi

作者: Sowmya Vajrala, Aakash Parmar, Prasanna R, Sravanth Kodavanti, Manjunath Arveti, Srinivas Soumitri Miriyala, Ashok Senapati 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29535v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	15.0/10	15.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	15.0/10	15.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大视觉模型（LVMs）在边缘设备上的部署优化，属于大模型在不同领域（计算机视觉）的研究应用，具有技术创新性。与关键词高度相关（15分）的有：“PEFT/LoRA”（论文核心使用LoRA进行参数高效微调）、“Quantization/Model Compression”（提出QUAD量化方法）。相关（10分）的有：“Small Language Models/On-device AI”（专注于边缘设备部署）、“Speculative Decoding/Inference Acceleration”（实现延迟改进）。有一定关联（5分）的有：“Large Language Models/Foundation Models”（涉及基础大视觉模型）。其他关键词如MoE、Scaling Laws、Alignment等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了在资源受限的边缘设备上部署支持多任务的大视觉模型时存在的存储冗余和运行时开销问题，通过将LoRA权重作为运行时输入并引入QUAD量化感知训练策略，实现了内存占用减少6倍和延迟提升4倍的效果。

摘要翻译

生成式人工智能（Generative Artificial Intelligence, GenAI）的图像编辑、对象移除以及提示引导的图像转换等功能正日益集成到移动应用中。然而，由于大型视觉模型（Large Vision Models, LVMs）对内存和计算资源的高需求，在资源受限的设备上部署此类模型仍面临挑战。尽管低秩适配器（Low-Rank Adapters, LoRAs）能够实现参数高效的任务适应，但现有的移动端部署流程通常需要为每个LoRA单独编译模型二进制文件，并复制一份基础模型，从而导致存储冗余和运行时开销增加。本研究提出一个统一框架，旨在利用单一共享模型在边缘设备上实现多任务GenAI推理。我们的核心思路是将LoRA权重视为运行时输入，而非将其嵌入到编译后的模型图中，从而允许在运行时动态切换任务而无需重新编译。随后，为支持高效的设备端执行，我们引入了QUAD（量化统一自适应蒸馏，Quantization with Unified Adaptive Distillation），这是一种量化感知训练策略，能够在共享的量化配置下对齐多个LoRA适配器。我们通过一个与移动神经处理单元（NPUs）兼容的轻量级运行时栈实现了所提出的系统，并在多种芯片组上进行了评估。实验结果表明，该系统在保持多种GenAI任务高视觉质量的同时，内存占用最高可降低6倍，延迟最高可改善4倍。

摘要 (Abstract)

Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.

关键词: Generative Vision Models, Edge Deployment, LoRA, Quantization, Model Compression, Inference Acceleration, Multi-task Learning, Mobile NPUs

2. One-for-All: A Lightweight Stabilized and Parameter-Efficient Pre-trained LLM for Time Series Foreca

作者: Prasanjit Dey, Soumyabrata Dev, Bianca Schoen-Phelan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29756v1

评分: 51.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究预训练大语言模型（LLMs）在时间序列预测领域的参数高效微调（PEFT）方法，因此与"Large Language Models"和"PEFT/LoRA"高度相关（10分）。论文涉及预训练模型适应和微调，与"Pre-training/Domain Adaptation"和"Post-training/SFT"相关（8分）。论文提到边缘设备部署，与"Small Language Models/On-device AI"和"Quantization/Model Compression"有一定关联（5分）。论文应用于医疗、金融等科学领域，与"AI for Science"相关（5分）。其他关键词如MoE、RAG、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为One-for-All的轻量级参数高效微调方法（rsLoRA），用于将预训练大语言模型适配到多元时间序列预测任务，在显著减少参数和内存占用的同时保持了最先进的预测精度。

摘要翻译

我们致力于解决将预训练大语言模型（LLMs）适配于多元时间序列分析的挑战，该领域部署常受限于极高的计算与内存需求。我们的解决方案“One-for-All”引入了高斯秩稳定低秩适配器（rsLoRA），以实现对冻结大语言模型的参数高效微调。该方法虽受LoRA启发，但rsLoRA引入了一种基于数学理论的秩稳定机制，能够在低秩条件下实现可证明的梯度稳定性——这是现有参数高效微调（PEFT）方法中缺失的创新贡献。我们的框架将可训练的秩分解矩阵（秩为16）注入位置嵌入和输出层，同时保持自注意力权重固定。这一设计将可训练参数减少了6.8倍（对比TimesNet）、21倍（对比GPT4TS）和11.8倍（对比TIME-LLM），同时实现了168至1,776倍更小的内存占用（2.2MiB对比现有先进模型的340MiB-4.18GiB）。在六项时间序列任务上的严格评估表明，One-for-All实现了最优的效率-精度平衡：其参数效率比TimesNet高5.5倍（均方误差MSE=5.50），比GPT4TS高21倍，同时保持了同等的预测精度（MSE=0.33）。该框架的稳定性通过在多样化预测步长（96-720步）和数据集（ETT、Weather、M3、M4）上的一致性能得到验证，其参数量比传统Transformer模型减少98.3%。这些进展使得模型能够在医疗、金融和环境监测等边缘设备上部署，且不损失性能。

摘要 (Abstract)

We address the challenge of adapting pre-trained Large Language Models (LLMs) for multivariate time-series analysis, where their deployment is often hindered by prohibitive computational and memory demands. Our solution, One-for-All, introduces Gaussian Rank-Stabilized Low-Rank Adapters (rsLoRA) to enable parameter-efficient fine-tuning of frozen LLMs. While inspired by LoRA, rsLoRA introduces a mathematically grounded rank-stabilization mechanism that enables provable gradient stability at low ranks a novel contribution absent in prior PEFT methods. Our framework injects trainable rank decomposition matrices (rank 16) into positional embeddings and output layers, while keeping self-attention weights fixed. This design reduces trainable parameters by 6.8$\times$ (vs. TimesNet), 21$\times$ (vs. GPT4TS), and 11.8$\times$ (vs. TIME-LLM), while achieving a 168-1,776$\times$ smaller memory footprint (2.2MiB vs. 340MiB-4.18GiB in SOTA models). Rigorous evaluation across six time-series tasks demonstrates that One-for-All achieves state-of-the-art efficiency-accuracy trade-offs: 5.5$\times$ higher parameter efficiency (MSE=5.50) than TimesNet and 21$\times$ better than GPT4TS, while matching their forecasting accuracy (MSE=0.33). The framework’s stability is validated through consistent performance across diverse horizons (96-720 steps) and datasets (ETT, Weather, M3, M4), with 98.3% fewer parameters than conventional transformers. These advances enable deployment on edge devices for healthcare, finance, and environmental monitoring without compromising performance.

关键词: Large Language Models, Parameter-efficient Fine-tuning, LoRA, Time Series Forecasting, Pre-trained Models, Low-Rank Adaptation, Edge Deployment, Multivariate Analysis

3. Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermed

作者: Luoxin Chen, Yichi Zhou, Huishuai Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29500v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在复杂多步推理任务中的可靠性问题，提出PRoSFI奖励方法，通过结构化中间步骤和形式验证来提升推理可信度。高度相关的关键词包括：LLMs（论文明确研究LLMs）、RLHF/DPO（论文使用强化学习奖励方法）、Chain of Thought/System 2 Thinking（论文专注于多步推理过程）、Hallucination Mitigation（论文旨在解决推理不可靠问题）。其他关键词如MoE、SLMs、Scaling Laws、PEFT等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在多步推理中中间步骤不可靠的问题，提出了一种基于结构化形式中间步骤的过程奖励方法PRoSFI，通过形式验证引导模型生成可机器检查的推理步骤，从而提升推理可信度。

摘要翻译

大型语言模型（LLMs）近期在复杂多步推理任务上展现出令人瞩目的性能，尤其是在经过结果奖励强化学习后训练的情况下（Guo et al. 2025）。然而，研究发现，结果奖励往往忽略存在缺陷的中间步骤，导致即使最终答案正确，其推理过程仍可能不可靠。为解决这种不可靠的推理问题，我们提出了PRoSFI（基于结构化形式中间步骤的过程奖励），这是一种新颖的奖励方法，可在不牺牲准确性的前提下提升推理的可靠性。模型并不直接生成形式化证明（这对于中等规模（7B）模型而言通常难以实现），而是输出与其自然语言推理相对应的结构化中间步骤。每一步随后由形式化验证器进行检验。只有完全通过验证的推理链才能获得高额奖励。形式化验证的整合引导模型逐步生成机器可检验的证明，从而产生更可信的最终答案。PRoSFI为训练可信赖的推理模型提供了一种简单而有效的途径。

摘要 (Abstract)

Large language models (LLMs) have recently demonstrated impressive performance on complex, multi-step reasoning tasks, especially when post-trained with outcome-rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this unreliable reasoning, we propose PRoSFI (Process Reward over Structured Formal Intermediates), a novel reward method that enhances reasoning reliability without compromising accuracy. Instead of generating formal proofs directly, which is rarely accomplishable for a modest-sized (7B) model, the model outputs structured intermediate steps aligned with its natural language reasoning. Each step is then verified by a formal prover. Only fully validated reasoning chains receive high rewards. The integration of formal verification guides the model towards generating step-by-step machine-checkable proofs, thereby yielding more credible final answers. PRoSFI offers a simple and effective approach to training trustworthy reasoning models.

关键词: Large Language Models, Reasoning Reliability, Formal Verification, Process Reward, Structured Intermediates, Multi-step Reasoning, Reinforcement Learning, PRoSFI

4. SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

作者: Léopold Maillard, Francis Engelmann, Tom Durand, Boxiao Pan, Yang You, Or Litany, Leonidas Guibas, Maks Ovsjanikov 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29798v1

评分: 43.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究具身AI中3D场景的功能性验证框架SceneTeract，核心涉及智能体（agent）在3D环境中的交互验证，因此与"LLM Agents"高度相关（10分）。论文使用Vision-Language Models（VLMs）进行功能预测和推理，与"Large Language Models"有一定关联（5分），但非纯LLM研究。论文提到使用SceneTeract作为奖励引擎进行"VLM post-training"，与"Post-training"相关（8分）。论文涉及分解复杂活动为原子动作序列并进行验证，体现了多步推理和深度推理，与"Chain of Thought"和"System 2 Thinking"各得5分。论文揭示VLM在语义置信度与物理可行性之间的系统不匹配，涉及事实性和可解释性，与"Hallucination Mitigation"和"Explainable AI"各得5分。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了SceneTeract框架，用于验证3D场景在具身智能体特定约束下的功能性，通过结合高层语义推理和低层几何检查，评估了合成室内环境的功能性故障以及前沿视觉语言模型预测功能可供性的能力，并利用该框架作为奖励引擎进行VLM后训练以将几何约束蒸馏到推理模型中。

摘要翻译

具身人工智能依赖于能够支持多样化用户进行有意义活动的交互式三维环境，然而评估这些环境的功能可供性仍是一个核心挑战。我们提出SceneTeract框架，用于在特定智能体约束下验证三维场景的功能性。我们的核心贡献是一个耦合高层语义推理与底层几何检查的具身验证引擎。SceneTeract将复杂活动分解为原子动作序列，并依据具身智能体配置文件，通过显式的物理与几何模拟，针对可访问性要求（如可达性、空间净空和可通行性）验证每个步骤。我们运用SceneTeract对以下两方面进行深入评估：（一）合成室内环境，揭示了阻碍基础交互的频繁功能失效现象；（二）前沿视觉语言模型在推理和预测功能可供性方面的能力，发现即使当前最强模型仍存在语义置信度与物理可行性之间的系统性错配。最后，我们将SceneTeract作为视觉语言模型后训练的奖励引擎，实现了几何约束向推理模型的可扩展蒸馏。我们开源SceneTeract验证套件及相关数据，以弥合具身三维场景理解中感知与物理现实之间的鸿沟。

摘要 (Abstract)

Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.

关键词: Embodied AI, 3D scene functionality, agent-specific constraints, grounded verification, Vision-Language Models (VLMs), functional affordances, post-training, geometric constraints

5. 6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management

作者: Jiao Chen, Jianhua Tang, Xiaotong Yang, Zuohong Lv 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29656v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究基于大语言模型（LLMs）构建自主代理（LLM Agents）用于6G网络管理，这些代理能够使用工具（Tool Use）与环境进行闭环交互，并通过监督微调（Supervised Fine-tuning）和强化学习进行训练。因此，与"Large Language Models"、“LLM Agents”、“Tool Use"和"Supervised Fine-tuning"高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等未在摘要中提及或与论文主题无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了6GAgentGym框架，通过构建支持工具使用的交互环境和利用监督微调与强化学习训练LLM代理，解决了6G网络管理中缺乏闭环交互能力的问题，使一个8B开源模型在6GAgentBench上达到了与GPT-5相当的整体成功率。

摘要翻译

自主6G网络管理需要能够执行工具、观察由此产生的状态变化并相应调整决策的智能体。然而，现有基于静态问题或脚本化情景回放的基准测试并不支持此类闭环交互，将智能体局限于被动评估而无法从环境反馈中学习。本文提出6GAgentGym以提供闭环交互能力。该框架构建了一个包含42种类型工具的交互环境，其效果分类机制区分了只读观察与状态变更配置，并依托基于NS-3仿真数据校准的学习型实验模型作为支撑。6G-Forge通过迭代式自我指令生成技术，以NS-3仿真数据为种子生成闭环训练轨迹，并依据实验模型进行执行验证。通过对生成语料进行监督微调，继而结合在线闭环交互的强化学习，使一个80亿参数的开源模型在配套的6GAgentBench测试中达到与GPT-5相当的整体成功率，且在长周期任务上表现更优。这些组件共同为自主闭环网络管理提供了可行的技术路径。

摘要 (Abstract)

Autonomous 6G network management requires agents that can execute tools, observe the resulting state changes, and adapt their decisions accordingly. Existing benchmarks based on static questions or scripted episode replay, however, do not support such closed-loop interaction, limiting agents to passive evaluation without the ability to learn from environmental feedback. This paper presents 6GAgentGym to provide closed-loop capability. The framework provides an interactive environment with 42 typed tools whose effect classification distinguishes read-only observation from state-mutating configuration, backed by a learned Experiment Model calibrated on NS-3 simulation data. 6G-Forge bootstraps closed-loop training trajectories from NS-3 seeds via iterative Self-Instruct generation with execution verification against the Experiment Model. Supervised fine-tuning on the resulting corpus followed by reinforcement learning with online closed-loop interaction enables an 8B open-source model to achieve comparable overall success rate to GPT-5 on the accompanying 6GAgentBench, with stronger performance on long-horizon tasks. Together, these components provide a viable path toward autonomous, closed-loop network management.

关键词: 6G network management, autonomous agents, tool use, closed-loop interaction, supervised fine-tuning, reinforcement learning, LLM agents, NS-3 simulation

6. Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

作者: Huan Zhang, Wei Cheng, Wei Hu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29292v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在代码生成领域的自我改进方法，与"Large Language Models"高度相关（10分）。方法涉及监督微调（SFT）和直接偏好优化（DPO），与"Post-training"和"RLHF/DPO"关键词高度相关（各10分）。核心创新是自我改进框架，与"Self-Correction/Self-Improvement"高度相关（10分）。其他关键词如MoE、量化、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究如何让代码生成大语言模型在没有外部监督（如教师模型或测试预言）的情况下实现自我改进，提出了基于代码语义熵的课程构建和共识驱动的直接偏好优化方法，实验证明该方法能显著提升代码生成性能。

摘要翻译

提升大型语言模型（LLM）的代码生成能力通常依赖于监督微调或偏好优化，这两种方法均需要昂贵的外部资源，例如强大的教师模型或可靠的测试单元。然而，在实际场景中，获取参考答案与测试预言机远比获取问题描述和测试输入更为困难。本文探讨了一个具有挑战性但现实的问题：代码语言模型能否在无法访问更优教师和测试预言机的情况下实现自我提升？为回答此问题，我们提出了ConSelf，一种基于两个核心理念的自改进方法。首先，我们引入了代码语义熵这一新指标，它通过评估程序行为的功能多样性来衡量问题层面的不确定性，从而能够构建包含最具可学性问题的课程。其次，我们提出了共识驱动的直接偏好优化，这是一种基于偏好的微调方法，通过行为共识对每个偏好对进行加权，从而减轻噪声自生成监督的影响。在多种基准测试和骨干LLM上的实验表明，ConSelf显著优于基线方法，验证了基于语义熵的课程构建和共识驱动优化在无需外部监督的情况下提升代码生成的有效性。

摘要 (Abstract)

Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.

关键词: code generation, large language models, self-improving, semantic entropy, behavioral consensus, direct preference optimization, supervised fine-tuning, curriculum learning

7. Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse

作者: Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29259v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究Direct Preference Optimization (DPO)在推荐系统中的应用，并采用稀疏Mixture-of-Experts (MoE)架构，因此与"Direct Preference Optimization"和"Mixture of Experts"高度相关（10分）。论文涉及大模型对齐技术，与"Large Language Models"和"Alignment"有一定关联（8分）。其他关键词如SLMs、Scaling Laws、Pre-training等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了在隐式反馈场景下，通过改进负样本选择策略（使用动态top-K候选池的随机采样）来增强Direct Preference Optimization在多媒体序列推荐中的性能，并结合稀疏MoE编码器实现高效容量扩展，在三个Amazon基准测试上取得了最高5.25%的NDCG@5提升。

摘要翻译

基于偏好的对齐目标已被广泛采用，从大语言模型中RLHF风格的成对学习到推荐系统中的新兴应用皆可见其身影。然而，现有研究很少探讨直接偏好优化在隐式反馈下的表现，其中未观测到的项目并非可靠的负样本。我们在多模态序列推荐上进行了系统实验，以比较常见的负样本选择策略及其与DPO训练的交互作用。我们的核心发现是：一个简单的修改——用从动态Top-K候选池中随机采样的方式替代确定性的硬负样本——能持续提升排序性能。我们将其有效性归因于两个因素：(1) 减少了由假负样本引起的错误抑制梯度；(2) 在通过受控随机性平滑优化的同时，保留了信息丰富的硬信号。结合可选的稀疏专家混合编码器以实现高效的能力扩展，RoDPO在三个亚马逊基准数据集上实现了高达5.25%的NDCG@5提升，且推理成本几乎不变。

摘要 (Abstract)

Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems. Yet, existing work rarely examines how Direct Preference Optimization (DPO) behaves under implicit feedback, where unobserved items are not reliable negatives. We conduct systematic experiments on multimodal sequential recommendation to compare common negative-selection strategies and their interaction with DPO training. Our central finding is that a simple modification, replacing deterministic hard negatives with stochastic sampling from a dynamic top-K candidate pool, consistently improves ranking performance. We attribute its effectiveness to two factors: (1) reducing erroneous suppressive gradients caused by false negatives, and (2) retaining informative hard signals while smoothing optimization via controlled stochasticity. With an optional sparse Mixture-of-Experts encoder for efficient capacity scaling, RoDPO achieves up to 5.25% NDCG@5 on three Amazon benchmarks, with nearly unchanged inference cost.

关键词: Direct Preference Optimization, DPO, Mixture-of-Experts, MoE, multimodal sequential recommendation, negative-selection strategies, implicit feedback, ranking performance

8. DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

作者: Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29844v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出DIAL框架，通过潜在世界建模桥接高级决策和低级运动执行，核心涉及VLM（视觉语言模型）作为System-2进行潜在世界建模，以及System-1策略解码意图。与关键词的相关性分析：1）“System 2 Thinking"和"World Models"高度相关（10分），因为论文明确使用System-2进行潜在世界建模并合成潜在视觉预见；2）“Large Language Models”、“Pre-training"和"Post-training"有一定关联（5分），论文基于预训练的VLM构建，并涉及两阶段训练（包括微调），但未直接聚焦LLM技术原理；3）其他关键词如MoE、SLMs、对齐、RAG等未在论文中提及或应用，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有端到端视觉-语言-动作（VLA）模型将视觉语言模型（VLM）主要用作多模态编码器、未充分利用其高级决策潜力且训练不稳定的问题，提出了DIAL框架，通过可微分潜在意图瓶颈和两阶段训练，在机器人操作任务上实现了最先进的性能，并展示了零样本泛化能力。

摘要翻译

预训练视觉语言模型（VLM）显著加速了视觉-语言-动作（VLA）模型的发展。然而，现有的大多数端到端VLA模型主要将VLM视为多模态编码器，直接将视觉-语言特征映射为低层动作。这种范式未能充分利用VLM在高层决策中的潜力，并引入了训练不稳定性，常常削弱其丰富的语义表征能力。为应对这些局限，我们提出了DIAL框架，它通过一个可微分的潜在意图瓶颈，桥接高层决策与低层运动执行。具体而言，一个基于VLM的“系统2”通过在VLM原生特征空间内合成潜在视觉前瞻来执行潜在世界建模；这种前瞻显式地编码意图，并作为结构性瓶颈。随后，一个轻量级的“系统1”策略通过潜在逆动力学，将这一预测意图与当前观测共同解码为精确的机器人动作。为确保优化稳定性，我们采用两阶段训练范式：首先是一个解耦的预热阶段，其中“系统2”学习预测潜在未来状态，而“系统1”在统一特征空间内基于真实未来指导学习运动控制；随后进行无缝的端到端联合优化。这使得动作感知梯度能够以受控方式优化VLM骨干网络，从而保留预训练知识。在RoboCasa GR1 Tabletop基准测试上的大量实验表明，DIAL确立了新的技术标杆，仅使用先前方法十分之一的演示数据即实现了更优性能。此外，通过利用异构的人类演示数据，DIAL学习了物理基础的操作先验，并在人形机器人上的实际部署中，对未见过的物体和新颖配置展现出强大的零样本泛化能力。

摘要 (Abstract)

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM’s potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM’s native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

关键词: Vision-Language-Action (VLA), Vision-Language Models (VLMs), latent world modeling, System-2, System-1, end-to-end training, robot manipulation, zero-shot generalization

9. Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

作者: Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29497v1

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	8.0/10	8.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在隐私评估中的应用，通过知识蒸馏将大模型能力迁移到小模型，因此与"Large Language Models"高度相关（10分），与"Small Language Models"相关（8分）。论文使用监督微调方法训练分类器，与"Post-training"相关（8分）。研究目标是使模型评估与人类判断对齐，与"Instruction Tuning"相关（8分）。其他关键词如MoE、Scaling Laws、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了大规模语言模型作为隐私评估工具时计算成本高的问题，通过知识蒸馏方法将Mistral Large 3的隐私评估能力迁移到轻量级编码器模型，在保持与人类判断高度一致的同时显著降低了计算需求。

摘要翻译

文本数据的精确隐私评估始终是隐私保护自然语言处理领域的关键挑战。近期研究表明，大型语言模型（LLMs）可作为可靠的隐私评估工具，其判断与人类标注具有高度一致性；然而，其高昂的计算成本及在处理大规模敏感数据时的不可行性限制了实际应用。为弥补这一不足，本研究将Mistral Large 3（675B参数）的隐私评估能力蒸馏至参数量仅1.5亿的轻量级编码器模型。通过利用一个涵盖10个不同领域的大规模隐私标注文本数据集，我们训练出高效分类器，这些分类器在极大降低计算需求的同时，仍能保持与人类标注的高度一致性。我们在人类标注的测试数据上验证了该方法，并证明了其作为去标识化系统评估指标的实际效用。

摘要 (Abstract)

Accurate privacy evaluation of textual data remains a critical challenge in privacy-preserving natural language processing. Recent work has shown that large language models (LLMs) can serve as reliable privacy evaluators, achieving strong agreement with human judgments; however, their computational cost and impracticality for processing sensitive data at scale limit real-world deployment. We address this gap by distilling the privacy assessment capabilities of Mistral Large 3 (675B) into lightweight encoder models with as few as 150M parameters. Leveraging a large-scale dataset of privacy-annotated texts spanning 10 diverse domains, we train efficient classifiers that preserve strong agreement with human annotations while dramatically reducing computational requirements. We validate our approach on human-annotated test data and demonstrate its practical utility as an evaluation metric for de-identification systems.

关键词: privacy evaluation, large language models, knowledge distillation, lightweight models, human-aligned assessment, computational efficiency, de-identification systems, privacy-preserving NLP

10. An Empirical Study of Multi-Agent Collaboration for Automated Research

作者: Yang Shen, Zhenyi Yi, Ziyi Zhao, Lijun Sun, Dongyang Li, Chin-Teng Lin, Yuhui Shi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29632v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究多智能体系统（MAS）在自动化研究中的应用，直接涉及"Multi-agent Systems"和"LLM Agents"关键词，因此给予10分。论文明确提到从单一大语言模型（LLMs）转向多智能体系统，因此"Large Language Models"也高度相关，给予10分。其他关键词如MoE、SLMs、训练方法、推理技术、压缩、科学AI应用等，论文未涉及，均给予0分。

!!! tip deepseek-chat TL;DR

该论文通过实证研究比较了不同多智能体协作框架（子代理架构与代理团队架构）在自动化机器学习优化任务中的性能，发现子代理模式在严格时间约束下具有高鲁棒性和吞吐量，而代理团队模式在充足计算预算下能实现更深层次的理论对齐，但操作更脆弱。

摘要翻译

随着智能体技术的发展，研究界正迅速从单一大型语言模型转向多智能体系统，以突破自动化研究中的认知瓶颈。然而，针对这些自主智能体的最优多智能体协作框架在很大程度上仍未得到探索。本文通过系统性实证研究，比较了不同多智能体结构在自动化机器学习优化任务中的效能。我们构建了一个严格受控、基于执行的测试环境，该环境具备Git工作树隔离与显式全局内存机制，并在此基准测试中将单智能体基线模型与两种多智能体范式进行对比：子智能体架构（并行探索与事后整合）和智能体团队架构（专家间执行前交接）。通过在严格固定的计算时间预算下评估这些系统，我们的研究揭示了操作稳定性与理论深度之间的根本性权衡。子智能体模式作为一种高鲁棒性、高吞吐量的搜索引擎，在严格时间限制下最适合进行广泛而浅层的优化；相反，智能体团队拓扑因涉及多作者代码生成而表现出更高的操作脆弱性，但在给定充足计算预算时，能实现复杂架构重构所必需的深度理论对齐。这些实证结论为设计未来自动化研究系统提供了可操作的指导原则，主张采用动态路由架构，使其协作结构能够根据实时任务复杂度进行自适应调整。

摘要 (Abstract)

As AI agents evolve, the community is rapidly shifting from single Large Language Models (LLMs) to Multi-Agent Systems (MAS) to overcome cognitive bottlenecks in automated research. However, the optimal multi-agent coordination framework for these autonomous agents remains largely unexplored. In this paper, we present a systematic empirical study investigating the comparative efficacy of distinct multi-agent structures for automated machine learning optimization. Utilizing a rigorously controlled, execution-based testbed equipped with Git worktree isolation and explicit global memory, we benchmark a single-agent baseline against two multi-agent paradigms: a subagent architecture (parallel exploration with post-hoc consolidation) and an agent team architecture (experts with pre-execution handoffs). By evaluating these systems under strictly fixed computational time budgets, our findings reveal a fundamental trade-off between operational stability and theoretical deliberation. The subagent mode functions as a highly resilient, high-throughput search engine optimal for broad, shallow optimizations under strict time constraints. Conversely, the agent team topology exhibits higher operational fragility due to multi-author code generation but achieves the deep theoretical alignment necessary for complex architectural refactoring given extended compute budgets. These empirical insights provide actionable guidelines for designing future autoresearch systems, advocating for dynamically routed architectures that adapt their collaborative structures to real-time task complexity.

关键词: Multi-Agent Systems, Large Language Models, Automated Research, Agent Coordination, Empirical Study, Machine Learning Optimization, Subagent Architecture, Agent Team Architecture

11. Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

作者: Junsol Kim, Winnie Street, Roberta Rocca, Daine M. Korngiebel, Adam Waytz, James Evans, Geoff Keeling 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28925v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的安全微调（safety fine-tuning）对心智理论（ToM）能力的影响，直接涉及"Large Language Models"和"Post-training”（安全微调属于后训练范畴）。“Instruction Tuning/Alignment"和"Mechanistic Interpretability"有一定关联，因为研究涉及对齐（抑制有害心智归因）和机制分析（表征相似性分析）。其他关键词如MoE、Scaling Laws、RAG、Agents等未在摘要中提及或与主题无关。

!!! tip deepseek-chat TL;DR

该研究探讨了大型语言模型（LLMs）的安全微调在抑制模型自我心智归因（如声称有意识或情感）时，是否会损害其心智理论（ToM）能力，结果发现两者在行为和机制上可分离，但安全微调模型对非人类动物的心智归因减少且更少表现出精神信仰。

摘要翻译

大型语言模型（LLM）的安全微调旨在抑制潜在有害的心智归因形式，例如模型声称自身具有意识或宣称拥有情感体验。本研究探讨了抑制心智归因倾向是否会损害与之密切相关的社会认知能力，如心理理论（Theory of Mind, ToM）。通过安全性消融实验与表征相似性的机制分析，我们证明LLM对自身及技术制品的心智归因在行为表现与机制层面均与心理理论能力可分离。然而，经过安全微调的模型相较于人类基线，对非人类动物的心智归因程度偏低，且更少表现出灵性信仰，从而压制了关于非人类心智分布与本质的广泛共识视角。

摘要 (Abstract)

Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

关键词: Large Language Models, Safety Fine-tuning, Theory of Mind, Mind-attribution, Mechanistic Analysis, Representational Similarity, Consciousness, Alignment

12. Concept frustration: Aligning human concepts and machine representations

作者: Enrico Parisini, Christopher J. Soelistyo, Ahab Isaac, Alessandro Barp, Christopher R. S. Banerji 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29654v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究基础模型（Foundation Models）中人类概念与机器表示的对齐问题，属于可解释AI（Explainable AI）范畴，因此与"Large Language Models/Foundation Models"和"Mechanistic Interpretability/Explainable AI"高度相关（10分）。论文多次提到"alignment"概念，但与AI安全中的"Value Alignment"不完全相同，主要涉及概念表示对齐，因此给8分。其他关键词如MoE、SLMs、训练方法、推理技术、压缩、应用领域等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种几何框架来检测和解决基础模型中人类可解释概念与机器内部表示之间的不一致性（概念挫败），通过任务对齐的相似性度量实现了人类与机器概念推理的更好对齐。

摘要翻译

将人类可解释的概念与现代机器学习系统习得的内部表征进行对齐，仍然是可解释人工智能面临的核心挑战。本文提出一种几何框架，用于比较有监督的人类概念与从基础模型嵌入中提取的无监督中间表征。受科学发现中概念飞跃作用的启发，我们形式化了“概念挫折”这一概念：当一个未被观测到的概念在已知概念间诱导出无法在现有本体论中保持一致的关系时，便会产生矛盾。我们开发了任务对齐的相似性度量方法，用于检测基于有监督概念的模型与源自基础模型的无监督表征之间的概念挫折，并证明该现象可在任务对齐的几何结构中被检测到，而传统的欧几里得比较方法则无法做到。在线性高斯生成模型下，我们推导了贝叶斯最优基于概念分类器准确率的闭式表达式，将预测信号分解为已知-已知、已知-未知和未知-未知的贡献，并解析地识别出挫折效应影响性能的具体环节。在合成数据以及真实语言和视觉任务上的实验表明，挫折现象可在基础模型表征中被检测到，并且将一个引发挫折的概念纳入可解释模型，会重组已习得概念表征的几何结构，从而更好地对齐人类与机器的推理过程。这些结果提出了一个诊断不完整概念本体论、并对齐人类与机器概念推理的原则性框架，对开发与验证面向高风险应用的安全可解释人工智能具有启示意义。

摘要 (Abstract)

Aligning human-interpretable concepts with the internal representations learned by modern machine learning systems remains a central challenge for interpretable AI. We introduce a geometric framework for comparing supervised human concepts with unsupervised intermediate representations extracted from foundation model embeddings. Motivated by the role of conceptual leaps in scientific discovery, we formalise the notion of concept frustration: a contradiction that arises when an unobserved concept induces relationships between known concepts that cannot be made consistent within an existing ontology. We develop task-aligned similarity measures that detect concept frustration between supervised concept-based models and unsupervised representations derived from foundation models, and show that the phenomenon is detectable in task-aligned geometry while conventional Euclidean comparisons fail. Under a linear-Gaussian generative model we derive a closed-form expression for Bayes-optimal concept-based classifier accuracy, decomposing predictive signal into known-known, known-unknown and unknown-unknown contributions and identifying analytically where frustration affects performance. Experiments on synthetic data and real language and vision tasks demonstrate that frustration can be detected in foundation model representations and that incorporating a frustrating concept into an interpretable model reorganises the geometry of learned concept representations, to better align human and machine reasoning. These results suggest a principled framework for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning, with implications for the development and validation of safe interpretable AI for high-risk applications.

关键词: concept frustration, foundation models, interpretable AI, human-machine alignment, geometric framework, concept-based models, unsupervised representations, Bayes-optimal classifier

📋 所有论文列表

1. ✅ Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	15.0/10	15.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	15.0/10	15.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了在资源受限的边缘设备上部署支持多任务的大视觉模型时存在的存储冗余和运行时开销问题，通过将LoRA权重作为运行时输入并引入QUAD量化感知训练策略，实现了内存占用减少6倍和延迟提升4倍的效果。

摘要翻译

生成式人工智能（Generative Artificial Intelligence, GenAI）的图像编辑、对象移除以及提示引导的图像转换等功能正日益集成到移动应用中。然而，由于大型视觉模型（Large Vision Models, LVMs）对内存和计算资源的高需求，在资源受限的设备上部署此类模型仍面临挑战。尽管低秩适配器（Low-Rank Adapters, LoRAs）能够实现参数高效的任务适应，但现有的移动端部署流程通常需要为每个LoRA单独编译模型二进制文件，并复制一份基础模型，从而导致存储冗余和运行时开销增加。本研究提出一个统一框架，旨在利用单一共享模型在边缘设备上实现多任务GenAI推理。我们的核心思路是将LoRA权重视为运行时输入，而非将其嵌入到编译后的模型图中，从而允许在运行时动态切换任务而无需重新编译。随后，为支持高效的设备端执行，我们引入了QUAD（量化统一自适应蒸馏，Quantization with Unified Adaptive Distillation），这是一种量化感知训练策略，能够在共享的量化配置下对齐多个LoRA适配器。我们通过一个与移动神经处理单元（NPUs）兼容的轻量级运行时栈实现了所提出的系统，并在多种芯片组上进行了评估。实验结果表明，该系统在保持多种GenAI任务高视觉质量的同时，内存占用最高可降低6倍，延迟最高可改善4倍。

摘要 (Abstract)

Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.

关键词: Generative Vision Models, Edge Deployment, LoRA, Quantization, Model Compression, Inference Acceleration, Multi-task Learning, Mobile NPUs

2. ✅ One-for-All: A Lightweight Stabilized and Parameter-Efficient Pre-trained LLM for Time Series Forecasting

作者: Prasanjit Dey, Soumyabrata Dev, Bianca Schoen-Phelan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29756v1

评分: 51.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文提出了一种名为One-for-All的轻量级参数高效微调方法（rsLoRA），用于将预训练大语言模型适配到多元时间序列预测任务，在显著减少参数和内存占用的同时保持了最先进的预测精度。

摘要翻译

我们致力于解决将预训练大语言模型（LLMs）适配于多元时间序列分析的挑战，该领域部署常受限于极高的计算与内存需求。我们的解决方案“One-for-All”引入了高斯秩稳定低秩适配器（rsLoRA），以实现对冻结大语言模型的参数高效微调。该方法虽受LoRA启发，但rsLoRA引入了一种基于数学理论的秩稳定机制，能够在低秩条件下实现可证明的梯度稳定性——这是现有参数高效微调（PEFT）方法中缺失的创新贡献。我们的框架将可训练的秩分解矩阵（秩为16）注入位置嵌入和输出层，同时保持自注意力权重固定。这一设计将可训练参数减少了6.8倍（对比TimesNet）、21倍（对比GPT4TS）和11.8倍（对比TIME-LLM），同时实现了168至1,776倍更小的内存占用（2.2MiB对比现有先进模型的340MiB-4.18GiB）。在六项时间序列任务上的严格评估表明，One-for-All实现了最优的效率-精度平衡：其参数效率比TimesNet高5.5倍（均方误差MSE=5.50），比GPT4TS高21倍，同时保持了同等的预测精度（MSE=0.33）。该框架的稳定性通过在多样化预测步长（96-720步）和数据集（ETT、Weather、M3、M4）上的一致性能得到验证，其参数量比传统Transformer模型减少98.3%。这些进展使得模型能够在医疗、金融和环境监测等边缘设备上部署，且不损失性能。

摘要 (Abstract)

We address the challenge of adapting pre-trained Large Language Models (LLMs) for multivariate time-series analysis, where their deployment is often hindered by prohibitive computational and memory demands. Our solution, One-for-All, introduces Gaussian Rank-Stabilized Low-Rank Adapters (rsLoRA) to enable parameter-efficient fine-tuning of frozen LLMs. While inspired by LoRA, rsLoRA introduces a mathematically grounded rank-stabilization mechanism that enables provable gradient stability at low ranks a novel contribution absent in prior PEFT methods. Our framework injects trainable rank decomposition matrices (rank 16) into positional embeddings and output layers, while keeping self-attention weights fixed. This design reduces trainable parameters by 6.8$\times$ (vs. TimesNet), 21$\times$ (vs. GPT4TS), and 11.8$\times$ (vs. TIME-LLM), while achieving a 168-1,776$\times$ smaller memory footprint (2.2MiB vs. 340MiB-4.18GiB in SOTA models). Rigorous evaluation across six time-series tasks demonstrates that One-for-All achieves state-of-the-art efficiency-accuracy trade-offs: 5.5$\times$ higher parameter efficiency (MSE=5.50) than TimesNet and 21$\times$ better than GPT4TS, while matching their forecasting accuracy (MSE=0.33). The framework’s stability is validated through consistent performance across diverse horizons (96-720 steps) and datasets (ETT, Weather, M3, M4), with 98.3% fewer parameters than conventional transformers. These advances enable deployment on edge devices for healthcare, finance, and environmental monitoring without compromising performance.

关键词: Large Language Models, Parameter-efficient Fine-tuning, LoRA, Time Series Forecasting, Pre-trained Models, Low-Rank Adaptation, Edge Deployment, Multivariate Analysis

3. ✅ Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

作者: Luoxin Chen, Yichi Zhou, Huishuai Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29500v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在多步推理中中间步骤不可靠的问题，提出了一种基于结构化形式中间步骤的过程奖励方法PRoSFI，通过形式验证引导模型生成可机器检查的推理步骤，从而提升推理可信度。

摘要翻译

大型语言模型（LLMs）近期在复杂多步推理任务上展现出令人瞩目的性能，尤其是在经过结果奖励强化学习后训练的情况下（Guo et al. 2025）。然而，研究发现，结果奖励往往忽略存在缺陷的中间步骤，导致即使最终答案正确，其推理过程仍可能不可靠。为解决这种不可靠的推理问题，我们提出了PRoSFI（基于结构化形式中间步骤的过程奖励），这是一种新颖的奖励方法，可在不牺牲准确性的前提下提升推理的可靠性。模型并不直接生成形式化证明（这对于中等规模（7B）模型而言通常难以实现），而是输出与其自然语言推理相对应的结构化中间步骤。每一步随后由形式化验证器进行检验。只有完全通过验证的推理链才能获得高额奖励。形式化验证的整合引导模型逐步生成机器可检验的证明，从而产生更可信的最终答案。PRoSFI为训练可信赖的推理模型提供了一种简单而有效的途径。

摘要 (Abstract)

Large language models (LLMs) have recently demonstrated impressive performance on complex, multi-step reasoning tasks, especially when post-trained with outcome-rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this unreliable reasoning, we propose PRoSFI (Process Reward over Structured Formal Intermediates), a novel reward method that enhances reasoning reliability without compromising accuracy. Instead of generating formal proofs directly, which is rarely accomplishable for a modest-sized (7B) model, the model outputs structured intermediate steps aligned with its natural language reasoning. Each step is then verified by a formal prover. Only fully validated reasoning chains receive high rewards. The integration of formal verification guides the model towards generating step-by-step machine-checkable proofs, thereby yielding more credible final answers. PRoSFI offers a simple and effective approach to training trustworthy reasoning models.

关键词: Large Language Models, Reasoning Reliability, Formal Verification, Process Reward, Structured Intermediates, Multi-step Reasoning, Reinforcement Learning, PRoSFI

4. ✅ SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

评分: 43.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究具身AI中3D场景的功能性验证框架SceneTeract，核心涉及智能体（agent）在3D环境中的交互验证，因此与"LLM Agents"高度相关（10分）。论文使用Vision-Language Models（VLMs）进行功能预测和推理，与"Large Language Models"有一定关联（5分），但非纯LLM研究。论文提到使用SceneTeract作为奖励引擎进行"VLM post-training”，与"Post-training"相关（8分）。论文涉及分解复杂活动为原子动作序列并进行验证，体现了多步推理和深度推理，与"Chain of Thought"和"System 2 Thinking"各得5分。论文揭示VLM在语义置信度与物理可行性之间的系统不匹配，涉及事实性和可解释性，与"Hallucination Mitigation"和"Explainable AI"各得5分。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了SceneTeract框架，用于验证3D场景在具身智能体特定约束下的功能性，通过结合高层语义推理和低层几何检查，评估了合成室内环境的功能性故障以及前沿视觉语言模型预测功能可供性的能力，并利用该框架作为奖励引擎进行VLM后训练以将几何约束蒸馏到推理模型中。

摘要翻译

具身人工智能依赖于能够支持多样化用户进行有意义活动的交互式三维环境，然而评估这些环境的功能可供性仍是一个核心挑战。我们提出SceneTeract框架，用于在特定智能体约束下验证三维场景的功能性。我们的核心贡献是一个耦合高层语义推理与底层几何检查的具身验证引擎。SceneTeract将复杂活动分解为原子动作序列，并依据具身智能体配置文件，通过显式的物理与几何模拟，针对可访问性要求（如可达性、空间净空和可通行性）验证每个步骤。我们运用SceneTeract对以下两方面进行深入评估：（一）合成室内环境，揭示了阻碍基础交互的频繁功能失效现象；（二）前沿视觉语言模型在推理和预测功能可供性方面的能力，发现即使当前最强模型仍存在语义置信度与物理可行性之间的系统性错配。最后，我们将SceneTeract作为视觉语言模型后训练的奖励引擎，实现了几何约束向推理模型的可扩展蒸馏。我们开源SceneTeract验证套件及相关数据，以弥合具身三维场景理解中感知与物理现实之间的鸿沟。

摘要 (Abstract)

Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.

关键词: Embodied AI, 3D scene functionality, agent-specific constraints, grounded verification, Vision-Language Models (VLMs), functional affordances, post-training, geometric constraints

5. ✅ 6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management

作者: Jiao Chen, Jianhua Tang, Xiaotong Yang, Zuohong Lv 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29656v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究基于大语言模型（LLMs）构建自主代理（LLM Agents）用于6G网络管理，这些代理能够使用工具（Tool Use）与环境进行闭环交互，并通过监督微调（Supervised Fine-tuning）和强化学习进行训练。因此，与"Large Language Models”、“LLM Agents”、“Tool Use"和"Supervised Fine-tuning"高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等未在摘要中提及或与论文主题无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了6GAgentGym框架，通过构建支持工具使用的交互环境和利用监督微调与强化学习训练LLM代理，解决了6G网络管理中缺乏闭环交互能力的问题，使一个8B开源模型在6GAgentBench上达到了与GPT-5相当的整体成功率。

摘要翻译

自主6G网络管理需要能够执行工具、观察由此产生的状态变化并相应调整决策的智能体。然而，现有基于静态问题或脚本化情景回放的基准测试并不支持此类闭环交互，将智能体局限于被动评估而无法从环境反馈中学习。本文提出6GAgentGym以提供闭环交互能力。该框架构建了一个包含42种类型工具的交互环境，其效果分类机制区分了只读观察与状态变更配置，并依托基于NS-3仿真数据校准的学习型实验模型作为支撑。6G-Forge通过迭代式自我指令生成技术，以NS-3仿真数据为种子生成闭环训练轨迹，并依据实验模型进行执行验证。通过对生成语料进行监督微调，继而结合在线闭环交互的强化学习，使一个80亿参数的开源模型在配套的6GAgentBench测试中达到与GPT-5相当的整体成功率，且在长周期任务上表现更优。这些组件共同为自主闭环网络管理提供了可行的技术路径。

摘要 (Abstract)

Autonomous 6G network management requires agents that can execute tools, observe the resulting state changes, and adapt their decisions accordingly. Existing benchmarks based on static questions or scripted episode replay, however, do not support such closed-loop interaction, limiting agents to passive evaluation without the ability to learn from environmental feedback. This paper presents 6GAgentGym to provide closed-loop capability. The framework provides an interactive environment with 42 typed tools whose effect classification distinguishes read-only observation from state-mutating configuration, backed by a learned Experiment Model calibrated on NS-3 simulation data. 6G-Forge bootstraps closed-loop training trajectories from NS-3 seeds via iterative Self-Instruct generation with execution verification against the Experiment Model. Supervised fine-tuning on the resulting corpus followed by reinforcement learning with online closed-loop interaction enables an 8B open-source model to achieve comparable overall success rate to GPT-5 on the accompanying 6GAgentBench, with stronger performance on long-horizon tasks. Together, these components provide a viable path toward autonomous, closed-loop network management.

关键词: 6G network management, autonomous agents, tool use, closed-loop interaction, supervised fine-tuning, reinforcement learning, LLM agents, NS-3 simulation

6. ✅ Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

作者: Huan Zhang, Wei Cheng, Wei Hu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29292v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究如何让代码生成大语言模型在没有外部监督（如教师模型或测试预言）的情况下实现自我改进，提出了基于代码语义熵的课程构建和共识驱动的直接偏好优化方法，实验证明该方法能显著提升代码生成性能。

摘要翻译

提升大型语言模型（LLM）的代码生成能力通常依赖于监督微调或偏好优化，这两种方法均需要昂贵的外部资源，例如强大的教师模型或可靠的测试单元。然而，在实际场景中，获取参考答案与测试预言机远比获取问题描述和测试输入更为困难。本文探讨了一个具有挑战性但现实的问题：代码语言模型能否在无法访问更优教师和测试预言机的情况下实现自我提升？为回答此问题，我们提出了ConSelf，一种基于两个核心理念的自改进方法。首先，我们引入了代码语义熵这一新指标，它通过评估程序行为的功能多样性来衡量问题层面的不确定性，从而能够构建包含最具可学性问题的课程。其次，我们提出了共识驱动的直接偏好优化，这是一种基于偏好的微调方法，通过行为共识对每个偏好对进行加权，从而减轻噪声自生成监督的影响。在多种基准测试和骨干LLM上的实验表明，ConSelf显著优于基线方法，验证了基于语义熵的课程构建和共识驱动优化在无需外部监督的情况下提升代码生成的有效性。

摘要 (Abstract)

Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.

关键词: code generation, large language models, self-improving, semantic entropy, behavioral consensus, direct preference optimization, supervised fine-tuning, curriculum learning

7. ✅ Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

作者: Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29259v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了在隐式反馈场景下，通过改进负样本选择策略（使用动态top-K候选池的随机采样）来增强Direct Preference Optimization在多媒体序列推荐中的性能，并结合稀疏MoE编码器实现高效容量扩展，在三个Amazon基准测试上取得了最高5.25%的NDCG@5提升。

摘要翻译

基于偏好的对齐目标已被广泛采用，从大语言模型中RLHF风格的成对学习到推荐系统中的新兴应用皆可见其身影。然而，现有研究很少探讨直接偏好优化在隐式反馈下的表现，其中未观测到的项目并非可靠的负样本。我们在多模态序列推荐上进行了系统实验，以比较常见的负样本选择策略及其与DPO训练的交互作用。我们的核心发现是：一个简单的修改——用从动态Top-K候选池中随机采样的方式替代确定性的硬负样本——能持续提升排序性能。我们将其有效性归因于两个因素：(1) 减少了由假负样本引起的错误抑制梯度；(2) 在通过受控随机性平滑优化的同时，保留了信息丰富的硬信号。结合可选的稀疏专家混合编码器以实现高效的能力扩展，RoDPO在三个亚马逊基准数据集上实现了高达5.25%的NDCG@5提升，且推理成本几乎不变。

摘要 (Abstract)

Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems. Yet, existing work rarely examines how Direct Preference Optimization (DPO) behaves under implicit feedback, where unobserved items are not reliable negatives. We conduct systematic experiments on multimodal sequential recommendation to compare common negative-selection strategies and their interaction with DPO training. Our central finding is that a simple modification, replacing deterministic hard negatives with stochastic sampling from a dynamic top-K candidate pool, consistently improves ranking performance. We attribute its effectiveness to two factors: (1) reducing erroneous suppressive gradients caused by false negatives, and (2) retaining informative hard signals while smoothing optimization via controlled stochasticity. With an optional sparse Mixture-of-Experts encoder for efficient capacity scaling, RoDPO achieves up to 5.25% NDCG@5 on three Amazon benchmarks, with nearly unchanged inference cost.

关键词: Direct Preference Optimization, DPO, Mixture-of-Experts, MoE, multimodal sequential recommendation, negative-selection strategies, implicit feedback, ranking performance

8. ✅ DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

作者: Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29844v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对现有端到端视觉-语言-动作（VLA）模型将视觉语言模型（VLM）主要用作多模态编码器、未充分利用其高级决策潜力且训练不稳定的问题，提出了DIAL框架，通过可微分潜在意图瓶颈和两阶段训练，在机器人操作任务上实现了最先进的性能，并展示了零样本泛化能力。

摘要翻译

预训练视觉语言模型（VLM）显著加速了视觉-语言-动作（VLA）模型的发展。然而，现有的大多数端到端VLA模型主要将VLM视为多模态编码器，直接将视觉-语言特征映射为低层动作。这种范式未能充分利用VLM在高层决策中的潜力，并引入了训练不稳定性，常常削弱其丰富的语义表征能力。为应对这些局限，我们提出了DIAL框架，它通过一个可微分的潜在意图瓶颈，桥接高层决策与低层运动执行。具体而言，一个基于VLM的“系统2”通过在VLM原生特征空间内合成潜在视觉前瞻来执行潜在世界建模；这种前瞻显式地编码意图，并作为结构性瓶颈。随后，一个轻量级的“系统1”策略通过潜在逆动力学，将这一预测意图与当前观测共同解码为精确的机器人动作。为确保优化稳定性，我们采用两阶段训练范式：首先是一个解耦的预热阶段，其中“系统2”学习预测潜在未来状态，而“系统1”在统一特征空间内基于真实未来指导学习运动控制；随后进行无缝的端到端联合优化。这使得动作感知梯度能够以受控方式优化VLM骨干网络，从而保留预训练知识。在RoboCasa GR1 Tabletop基准测试上的大量实验表明，DIAL确立了新的技术标杆，仅使用先前方法十分之一的演示数据即实现了更优性能。此外，通过利用异构的人类演示数据，DIAL学习了物理基础的操作先验，并在人形机器人上的实际部署中，对未见过的物体和新颖配置展现出强大的零样本泛化能力。

摘要 (Abstract)

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM’s potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM’s native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

关键词: Vision-Language-Action (VLA), Vision-Language Models (VLMs), latent world modeling, System-2, System-1, end-to-end training, robot manipulation, zero-shot generalization

9. ✅ Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

作者: Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29497v1

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	8.0/10	8.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了大规模语言模型作为隐私评估工具时计算成本高的问题，通过知识蒸馏方法将Mistral Large 3的隐私评估能力迁移到轻量级编码器模型，在保持与人类判断高度一致的同时显著降低了计算需求。

摘要翻译

文本数据的精确隐私评估始终是隐私保护自然语言处理领域的关键挑战。近期研究表明，大型语言模型（LLMs）可作为可靠的隐私评估工具，其判断与人类标注具有高度一致性；然而，其高昂的计算成本及在处理大规模敏感数据时的不可行性限制了实际应用。为弥补这一不足，本研究将Mistral Large 3（675B参数）的隐私评估能力蒸馏至参数量仅1.5亿的轻量级编码器模型。通过利用一个涵盖10个不同领域的大规模隐私标注文本数据集，我们训练出高效分类器，这些分类器在极大降低计算需求的同时，仍能保持与人类标注的高度一致性。我们在人类标注的测试数据上验证了该方法，并证明了其作为去标识化系统评估指标的实际效用。

摘要 (Abstract)

Accurate privacy evaluation of textual data remains a critical challenge in privacy-preserving natural language processing. Recent work has shown that large language models (LLMs) can serve as reliable privacy evaluators, achieving strong agreement with human judgments; however, their computational cost and impracticality for processing sensitive data at scale limit real-world deployment. We address this gap by distilling the privacy assessment capabilities of Mistral Large 3 (675B) into lightweight encoder models with as few as 150M parameters. Leveraging a large-scale dataset of privacy-annotated texts spanning 10 diverse domains, we train efficient classifiers that preserve strong agreement with human annotations while dramatically reducing computational requirements. We validate our approach on human-annotated test data and demonstrate its practical utility as an evaluation metric for de-identification systems.

10. ✅ An Empirical Study of Multi-Agent Collaboration for Automated Research

作者: Yang Shen, Zhenyi Yi, Ziyi Zhao, Lijun Sun, Dongyang Li, Chin-Teng Lin, Yuhui Shi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29632v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文通过实证研究比较了不同多智能体协作框架（子代理架构与代理团队架构）在自动化机器学习优化任务中的性能，发现子代理模式在严格时间约束下具有高鲁棒性和吞吐量，而代理团队模式在充足计算预算下能实现更深层次的理论对齐，但操作更脆弱。

摘要翻译

随着智能体技术的发展，研究界正迅速从单一大型语言模型转向多智能体系统，以突破自动化研究中的认知瓶颈。然而，针对这些自主智能体的最优多智能体协作框架在很大程度上仍未得到探索。本文通过系统性实证研究，比较了不同多智能体结构在自动化机器学习优化任务中的效能。我们构建了一个严格受控、基于执行的测试环境，该环境具备Git工作树隔离与显式全局内存机制，并在此基准测试中将单智能体基线模型与两种多智能体范式进行对比：子智能体架构（并行探索与事后整合）和智能体团队架构（专家间执行前交接）。通过在严格固定的计算时间预算下评估这些系统，我们的研究揭示了操作稳定性与理论深度之间的根本性权衡。子智能体模式作为一种高鲁棒性、高吞吐量的搜索引擎，在严格时间限制下最适合进行广泛而浅层的优化；相反，智能体团队拓扑因涉及多作者代码生成而表现出更高的操作脆弱性，但在给定充足计算预算时，能实现复杂架构重构所必需的深度理论对齐。这些实证结论为设计未来自动化研究系统提供了可操作的指导原则，主张采用动态路由架构，使其协作结构能够根据实时任务复杂度进行自适应调整。

摘要 (Abstract)

As AI agents evolve, the community is rapidly shifting from single Large Language Models (LLMs) to Multi-Agent Systems (MAS) to overcome cognitive bottlenecks in automated research. However, the optimal multi-agent coordination framework for these autonomous agents remains largely unexplored. In this paper, we present a systematic empirical study investigating the comparative efficacy of distinct multi-agent structures for automated machine learning optimization. Utilizing a rigorously controlled, execution-based testbed equipped with Git worktree isolation and explicit global memory, we benchmark a single-agent baseline against two multi-agent paradigms: a subagent architecture (parallel exploration with post-hoc consolidation) and an agent team architecture (experts with pre-execution handoffs). By evaluating these systems under strictly fixed computational time budgets, our findings reveal a fundamental trade-off between operational stability and theoretical deliberation. The subagent mode functions as a highly resilient, high-throughput search engine optimal for broad, shallow optimizations under strict time constraints. Conversely, the agent team topology exhibits higher operational fragility due to multi-author code generation but achieves the deep theoretical alignment necessary for complex architectural refactoring given extended compute budgets. These empirical insights provide actionable guidelines for designing future autoresearch systems, advocating for dynamically routed architectures that adapt their collaborative structures to real-time task complexity.

关键词: Multi-Agent Systems, Large Language Models, Automated Research, Agent Coordination, Empirical Study, Machine Learning Optimization, Subagent Architecture, Agent Team Architecture

11. ✅ Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究探讨了大型语言模型（LLMs）的安全微调在抑制模型自我心智归因（如声称有意识或情感）时，是否会损害其心智理论（ToM）能力，结果发现两者在行为和机制上可分离，但安全微调模型对非人类动物的心智归因减少且更少表现出精神信仰。

摘要翻译

大型语言模型（LLM）的安全微调旨在抑制潜在有害的心智归因形式，例如模型声称自身具有意识或宣称拥有情感体验。本研究探讨了抑制心智归因倾向是否会损害与之密切相关的社会认知能力，如心理理论（Theory of Mind, ToM）。通过安全性消融实验与表征相似性的机制分析，我们证明LLM对自身及技术制品的心智归因在行为表现与机制层面均与心理理论能力可分离。然而，经过安全微调的模型相较于人类基线，对非人类动物的心智归因程度偏低，且更少表现出灵性信仰，从而压制了关于非人类心智分布与本质的广泛共识视角。

摘要 (Abstract)

Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

关键词: Large Language Models, Safety Fine-tuning, Theory of Mind, Mind-attribution, Mechanistic Analysis, Representational Similarity, Consciousness, Alignment

12. ✅ Concept frustration: Aligning human concepts and machine representations

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种几何框架来检测和解决基础模型中人类可解释概念与机器内部表示之间的不一致性（概念挫败），通过任务对齐的相似性度量实现了人类与机器概念推理的更好对齐。

摘要翻译

将人类可解释的概念与现代机器学习系统习得的内部表征进行对齐，仍然是可解释人工智能面临的核心挑战。本文提出一种几何框架，用于比较有监督的人类概念与从基础模型嵌入中提取的无监督中间表征。受科学发现中概念飞跃作用的启发，我们形式化了“概念挫折”这一概念：当一个未被观测到的概念在已知概念间诱导出无法在现有本体论中保持一致的关系时，便会产生矛盾。我们开发了任务对齐的相似性度量方法，用于检测基于有监督概念的模型与源自基础模型的无监督表征之间的概念挫折，并证明该现象可在任务对齐的几何结构中被检测到，而传统的欧几里得比较方法则无法做到。在线性高斯生成模型下，我们推导了贝叶斯最优基于概念分类器准确率的闭式表达式，将预测信号分解为已知-已知、已知-未知和未知-未知的贡献，并解析地识别出挫折效应影响性能的具体环节。在合成数据以及真实语言和视觉任务上的实验表明，挫折现象可在基础模型表征中被检测到，并且将一个引发挫折的概念纳入可解释模型，会重组已习得概念表征的几何结构，从而更好地对齐人类与机器的推理过程。这些结果提出了一个诊断不完整概念本体论、并对齐人类与机器概念推理的原则性框架，对开发与验证面向高风险应用的安全可解释人工智能具有启示意义。

摘要 (Abstract)

Aligning human-interpretable concepts with the internal representations learned by modern machine learning systems remains a central challenge for interpretable AI. We introduce a geometric framework for comparing supervised human concepts with unsupervised intermediate representations extracted from foundation model embeddings. Motivated by the role of conceptual leaps in scientific discovery, we formalise the notion of concept frustration: a contradiction that arises when an unobserved concept induces relationships between known concepts that cannot be made consistent within an existing ontology. We develop task-aligned similarity measures that detect concept frustration between supervised concept-based models and unsupervised representations derived from foundation models, and show that the phenomenon is detectable in task-aligned geometry while conventional Euclidean comparisons fail. Under a linear-Gaussian generative model we derive a closed-form expression for Bayes-optimal concept-based classifier accuracy, decomposing predictive signal into known-known, known-unknown and unknown-unknown contributions and identifying analytically where frustration affects performance. Experiments on synthetic data and real language and vision tasks demonstrate that frustration can be detected in foundation model representations and that incorporating a frustrating concept into an interpretable model reorganises the geometry of learned concept representations, to better align human and machine reasoning. These results suggest a principled framework for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning, with implications for the development and validation of safe interpretable AI for high-risk applications.

关键词: concept frustration, foundation models, interpretable AI, human-machine alignment, geometric framework, concept-based models, unsupervised representations, Bayes-optimal classifier

13. ❌ AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

作者: Yubo Cui, Xianchao Guan, Zijun Xiong, Zheng Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29410v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究的是视觉语言模型（VLMs）的对抗鲁棒性微调方法，核心是提出AGFT框架来保持跨模态对齐。与关键词的相关性分析：1）“Post-training” OR “Supervised Fine-tuning” OR “SFT”：高度相关（10分），论文核心是微调方法；2）“Instruction Tuning” OR “Alignment” OR “Value Alignment”：高度相关（10分），论文重点研究对齐保持；3）“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”：有一定关联（5分），论文提到预训练模型；其他关键词主要针对纯语言模型或特定技术，与视觉语言模型的对抗鲁棒性研究无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对预训练视觉语言模型在零样本场景下对抗鲁棒性不足的问题，提出了对齐引导的微调框架AGFT，通过软对齐分布和分布一致性校准机制，在提升对抗鲁棒性的同时保持了跨模态语义结构，实验表明该方法优于现有方法。

摘要翻译

预训练视觉语言模型（VLMs）展现出强大的零样本泛化能力，但仍易受对抗性扰动的影响。现有的基于分类引导的对抗性微调方法往往会破坏预训练的跨模态对齐，削弱视觉-文本对应关系并降低零样本性能。本文提出一种对齐引导微调（Alignment-Guided Fine-Tuning, AGFT）框架，在保持跨模态语义结构的同时增强零样本对抗鲁棒性。与依赖硬标签且无法维持图像与文本间相对关系的基于标签的方法不同，AGFT利用原始模型的概率预测进行文本引导的对抗训练，通过软对齐分布将对抗性视觉特征与文本嵌入对齐，从而提升零样本对抗鲁棒性。为解决微调引入的结构性差异，我们引入一种分布一致性校准机制，调整鲁棒模型的输出以匹配预训练模型预测的温度缩放版本。在多个零样本基准上的大量实验表明，AGFT在显著提升零样本对抗鲁棒性的同时，性能优于现有先进方法。

摘要 (Abstract)

Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.

关键词: Vision-Language Models, Adversarial Robustness, Fine-Tuning, Cross-modal Alignment, Zero-shot Generalization, AGFT, Distribution Consistency Calibration, Adversarial Training

14. ❌ Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

作者: Yahan Li, Xinyi Jie, Wanjia Ruan, Xubei Zhang, Huaijie Zhu, Yicheng Gao, Chaohao Du, Ruishan Liu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29373v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 该论文核心研究大语言模型（LLMs）在医疗咨询场景下的应用评估，与关键词1高度相关（10分）。论文属于AI在科学（医疗）领域的应用，与关键词27高度相关（10分）。论文关注模型在患者提供矛盾或不准确信息时的安全响应，涉及事实性和幻觉缓解问题，与关键词22有一定关联（5分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等）或理论（如Scaling Laws），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在医疗咨询中，当患者表现出信息矛盾、事实不准确、自我诊断和抗拒治疗等挑战性行为时，大语言模型的安全响应问题，并构建了一个双语基准CPB-Bench来评估多种LLMs，发现模型在处理矛盾或医学上不可信信息时存在一致的失败模式，且干预策略改善效果不一致。

摘要翻译

大型语言模型（LLM）在医疗咨询和健康信息支持中的应用日益广泛。在这一高风险场景中，安全性不仅取决于医学知识，还取决于模型在面对患者输入内容模糊、前后矛盾或具有误导性时的应对方式。然而，现有的大多数医疗LLM评估均假设患者提问是理想化且表述清晰的，这限制了评估的现实性。本文研究了真实医疗咨询中常见且会干扰安全临床推理的、具有挑战性的患者行为。我们基于临床实际定义了此类行为的四个类别：信息矛盾、事实错误、自我诊断和抗拒诊疗。针对每类行为，我们设定了具体的不安全响应判定标准。基于四个现有的医疗对话数据集，我们构建了CPB-Bench（挑战性患者行为基准），这是一个包含692个多轮对话的双语（英文和中文）基准数据集，所有对话均已标注上述行为类型。我们评估了一系列开源和闭源LLM对挑战性患者话语的响应表现。尽管模型整体表现良好，但我们发现了持续存在且与行为类型相关的失败模式，尤其是在处理患者信息矛盾或医学上不合理的情况时存在明显困难。我们还研究了四种干预策略，发现其改进效果不一致，且可能引入不必要的修正。我们公开了数据集与代码。

摘要 (Abstract)

Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.

关键词: Large Language Models, Medical Consultation, Challenging Patient Behaviors, Safety Evaluation, Benchmark, Failure Patterns, Intervention Strategies, Bilingual Dataset

15. ❌ A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

作者: Lixin Xiu, Xufang Luo, Hideki Nakayama 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29676v1

评分: 23.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）的内部决策过程，属于大模型技术范畴，因此与"Large Language Models"相关（8分）。论文的核心是分析模型决策过程，属于可解释AI/机制可解释性领域，与"Mechanistic Interpretability"高度相关（10分）。论文提到"visual instruction tuning”，与"Instruction Tuning"有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization等均未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文通过部分信息分解（PID）框架定量分析大型视觉语言模型（LVLMs）的内部决策过程，揭示了模型存在两种任务机制（协同驱动 vs. 知识驱动）和两种家族级策略（融合中心 vs. 语言中心），并发现视觉指令调优是学习融合的关键阶段。

摘要翻译

大型视觉语言模型（LVLMs）取得了令人瞩目的性能，但其内部决策过程仍不透明，难以判断其成功是源于真正的多模态融合还是对单模态先验的依赖。为弥合这一归因差距，我们引入了一种利用部分信息分解（PID）的新框架，以量化测量LVLMs的“信息谱”——将模型决策相关信息分解为冗余、独特和协同三个组成部分。通过将可扩展的估计器适配于现代LVLM输出，我们的模型无关流程在三个维度上对26个LVLM进行了剖析：广度（跨模型与跨任务）、深度（逐层信息动态）和时间（训练过程中的学习动态）。我们的分析揭示了两个关键结果：（i）两种任务机制（协同驱动型与知识驱动型）；（ii）两种稳定且对立的家族级策略（以融合为中心与以语言为中心）。我们还发现了逐层处理中一致的三阶段模式，并确定视觉指令微调是学习融合的关键阶段。这些贡献共同提供了一个超越仅精度评估的量化视角，并为分析和设计下一代LVLM提供了见解。代码与数据可在 https://github.com/RiiShin/pid-lvlm-analysis 获取。

摘要 (Abstract)

Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the “information spectrum” of LVLMs – decomposing a model’s decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions – breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .

关键词: Large Vision-Language Models, Partial Information Decomposition, Information Spectrum, Multimodal Fusion, Model Interpretability, Visual Instruction Tuning, Decision-making Analysis, Quantitative Evaluation

16. ❌ Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks

作者: Chong Xiang, Drew Zagieboylo, Shaona Ghosh, Sanjay Kariyappa, Kai Greshake, Hanshen Xiao, Chaowei Xiao, G. Edward Suh 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30016v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文聚焦于AI代理（由LLMs驱动）的安全防御，特别是针对间接提示注入攻击的系统级防护。因此，与"Large Language Models"和"LLM Agents"高度相关（10分），因为LLMs是代理的核心技术，而代理是论文的研究对象。其他关键词如MoE、SLMs、训练技术、推理方法、压缩、科学应用等，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了由大语言模型驱动的AI代理面临的间接提示注入攻击，并提出了系统级防御框架，强调动态重规划、受限模型决策和个性化人机交互作为核心防护策略。

摘要翻译

由大型语言模型（LLM）驱动的AI智能体易受间接提示注入攻击，即嵌入在不可信数据中的恶意指令可能触发智能体的危险行为。本立场论文阐述了我们针对间接提示注入攻击构建系统级防御的愿景。我们提出三点核心主张：（1）对于动态任务和现实环境，动态重规划与安全策略更新通常是必要的；（2）某些依赖上下文的安全决策仍需由LLM（或其他学习模型）作出，但必须在严格限制模型可观察范围与决策权限的系统设计框架内进行；（3）在本质模糊的场景中，应将个性化设置与人机交互作为核心设计考量。除主要观点外，本文还探讨了现有基准测试的局限性——它们可能造成对系统效用与安全性的错误认知。我们同时强调系统级防御的价值：这类防御通过构建与控制智能体行为、整合基于规则与基于模型的安全检查、并支持针对模型鲁棒性与人机交互的定向研究，为智能体系统提供了核心架构支撑。

摘要 (Abstract)

AI agents, predominantly powered by large language models (LLMs), are vulnerable to indirect prompt injection, in which malicious instructions embedded in untrusted data can trigger dangerous agent actions. This position paper discusses our vision for system-level defenses against indirect prompt injection attacks. We articulate three positions: (1) dynamic replanning and security policy updates are often necessary for dynamic tasks and realistic environments; (2) certain context-dependent security decisions would still require LLMs (or other learned models), but should only be made within system designs that strictly constrain what the model can observe and decide; (3) in inherently ambiguous cases, personalization and human interaction should be treated as core design considerations. In addition to our main positions, we discuss limitations of existing benchmarks that can create a false sense of utility and security. We also highlight the value of system-level defenses, which serve as the skeleton of agentic systems by structuring and controlling agent behaviors, integrating rule-based and model-based security checks, and enabling more targeted research on model robustness and human interaction.

关键词: AI agents, large language models, indirect prompt injection, system-level defenses, security policy, dynamic replanning, human interaction, agentic systems

17. ❌ NeuroBRIDGE: Behavior-Conditioned Koopman Dynamics with Riemannian Alignment for Early Substance Use Initiation Prediction from Longitudinal Functional Connectome

作者: Badhan Mazumder, Sir-Lord Wiafe, Vince D. Calhoun, Dong Hye Ye 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29960v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文《NeuroBRIDGE》专注于使用图神经网络（GNN）和黎曼几何方法分析纵向功能连接组数据，以预测青少年物质使用起始（SUI）。其核心是神经科学和生物信息学应用，而非大语言模型（LLM）或深度学习技术原理的创新。所有关键词中，仅“AI for Science”或“Bioinformatics”高度相关（10分），因为论文属于AI在生物医学（神经科学）领域的应用；而“Mechanistic Interpretability”或“Explainable AI”有一定关联（5分），因为论文提到了“interpretable insights into neural pathways”。其他关键词均与大模型技术、训练方法、推理优化、代理系统等无关，故评分为0。加权总分计算为（10×1.0 + 5×1.0）= 15.0。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为NeuroBRIDGE的新型图神经网络框架，通过黎曼对齐和行为条件Koopman动力学分析纵向脑功能连接组数据，以改进对青少年物质使用起始的早期预测，并在ABCD数据集上验证了其有效性。

摘要翻译

早期识别存在物质使用起始（SUI）风险的青少年至关重要，但亦十分困难，因为多数预测模型将脑连接视为静态或横断面特征，未能捕捉脑网络随时间及行为变化的动态过程。我们提出了NeuroBRIDGE（基于行为条件的黎曼流形上纵向连接组的库普曼动力学框架），这是一种新型图神经网络框架。该框架将纵向功能连接组对齐于黎曼切空间，并通过双时间注意力机制与行为条件约束的库普曼动力学相结合，以捕捉神经连接的时序变化。在ABCD数据集上的评估表明，NeuroBRIDGE对未来SUI的预测性能优于相关基线模型，同时提供了对神经通路的可解释性分析，深化了我们对神经发育风险的理解，并为针对性预防策略提供了依据。

摘要 (Abstract)

Early identification of adolescents at risk for substance use initiation (SUI) is vital yet difficult, as most predictors treat connectivity as static or cross-sectional and miss how brain networks change over time and with behavior. We proposed NeuroBRIDGE (Behavior conditioned RIemannian Koopman Dynamics on lonGitudinal connEctomes), a novel graph neural network-based framework that aligns longitudinal functional connectome in a Riemannian tangent space and couples dual-time attention with behavioral-conditioned Koopman dynamics to capture temporal change. Evaluated on ABCD, NeuroBRIDGE improved future SUI prediction over relevant baselines while offering interpretable insights into neural pathways, refining our understanding of neurodevelopmental risk and informing targeted prevention.

关键词: substance use initiation prediction, longitudinal functional connectome, graph neural network, Riemannian alignment, Koopman dynamics, behavior-conditioned modeling, neurodevelopmental risk, ABCD dataset

18. ❌ Reward-Based Online LLM Routing via NeuralUCB

作者: Ming-Hua Tsai, Phat Tran 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30035v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究LLM路由问题，核心是使用NeuralUCB算法进行成本感知的在线路由决策，因此与"Large Language Models"高度相关（10分）。论文未涉及其他关键词的具体技术内容，如MoE、SLMs、训练方法、推理优化、对齐、代理系统等，这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于NeuralUCB的成本感知大语言模型在线路由方法，实验表明该方法在保持竞争力的奖励同时显著降低了推理成本。

摘要翻译

本研究探讨了基于NeuralUCB（神经上置信界算法）的成本感知大语言模型（Large Language Model, LLM）路由方法。现有的路由策略大致可分为监督路由方法和部分反馈方法，二者在效率与适应性方面各有不同的权衡。我们实现了一种基于NeuralUCB的路由策略，并在RouterBench的模拟在线环境下进行了评估。实验结果表明，所提出的方法在效用奖励方面持续优于随机路由和最小成本基线方法。与追求最大质量的路由参考相比，我们的方法在保持具有竞争力的奖励的同时，显著降低了推理成本。这些发现表明，NeuralUCB是一种具有前景的成本感知LLM路由方法，同时也揭示了在动作区分和探索方面仍存在的挑战。

摘要 (Abstract)

This study investigates the use of NeuralUCB for cost-aware large language model (LLM) routing. Existing routing approaches can be broadly grouped into supervised routing methods and partial-feedback methods, each with different tradeoffs in efficiency and adaptivity. We implement a NeuralUCB-based routing policy and evaluate it on RouterBench under a simulated online setting. Experimental results show that the proposed method consistently outperforms random and min-cost baselines in utility reward. Compared with the max-quality reference, our method achieves substantially lower inference cost while maintaining competitive reward. These findings suggest that NeuralUCB is a promising approach for cost-aware LLM routing, while also highlighting remaining challenges in action discrimination and exploration.

关键词: LLM routing, NeuralUCB, cost-aware, online setting, inference cost, utility reward, RouterBench

19. ❌ Designing FSMs Specifications from Requirements with GPT 4.0

作者: Omer Nguena Timo, Paul-Alexis Rodriguez, Florent Avellaneda 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29140v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是提出一个基于LLM的框架，利用GPT-4.0从需求文档自动设计有限状态机（FSM），并引入专家中心的方法通过FSM变异和测试生成来修复LLM生成的FSM。因此，仅与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分），因为论文明确使用LLM（GPT-4.0）作为核心技术。其他关键词涉及具体技术原理（如MoE、量化）、训练方法（如预训练、RLHF）、应用领域（如生物信息学）或高级能力（如推理、代理），论文未直接涉及这些方面，故均评0分。

!!! tip deepseek-chat TL;DR

该论文提出一个基于大语言模型（LLM）的框架，用于从自然语言需求自动设计有限状态机（FSM），并通过FSM变异和测试生成方法修复LLM生成的FSM，实验评估了LLM在此任务中的能力。

摘要翻译

有限状态机（Finite State Machine, FSM）是反应式系统的可执行形式化规约。这类机器基于系统需求设计，而需求通常记录在以自然语言撰写的文本文件中。FSM在模型驱动系统工程（Model-Driven Engineering, MDE）的各个阶段发挥着关键作用，例如用于自动化测试活动。FSM的质量至关重要：其质量越低，测试阶段遗留的缺陷就越多，系统在生产环境中发生故障的风险也越高，甚至可能导致灾难性后果。因此，本文利用大语言模型（LLM）领域的最新进展，提出了一种基于LLM的框架，用于从需求自动生成FSM。该框架还提出了一种以专家为中心的方法，基于FSM变异与测试生成来修复由LLM生成的FSM。本文还通过实验分析与评估了LLM在执行框架中各项任务以及通过多种方法修复FSM的能力。论文展示了基于模拟数据的实验结果，这些结果与方法为LLM提供了新的分析与视角，有助于进一步推动机器学习技术及其在MDE中的应用发展。

摘要 (Abstract)

Finite state machines (FSM) are executable formal specifications of reactive systems. These machines are designed based on systems’ requirements. The requirements are often recorded in textual documents written in natural languages. FSMs play a crucial role in different phases of the model-driven system engineering (MDE). For example, they serve to automate testing activities. FSM quality is critical: the lower the quality of FSM, the higher the number of faults surviving the testing phase and the higher the risk of failure of the systems in production, which could lead to catastrophic scenarios. Therefore, this paper leverages recent advances in the domain of LLM to propose an LLM-based framework for designing FSMs from requirements. The framework also suggests an expert-centric approach based on FSM mutation and test generation for repairing the FSMs produced by LLMs. This paper also provides an experimental analysis and evaluation of LLM’s capacities in performing the tasks presented in the framework and FSM repair via various methods. The paper presents experimental results with simulated data. These results and methods bring a new analysis and vision of LLMs that are useful for further development of machine learning technology and its applications to MDE.

关键词: Finite State Machines, LLM-based framework, GPT-4.0, requirements engineering, FSM mutation, test generation, model-driven engineering, automated testing

20. ❌ How Symmetry Governs the Dihedral Angle Dependence of Intermolecular Spin-Orbit Coupling

作者: Antonio J. Garzon-Ramirez, Connor K. Terry Weatherly, Kyle T. Kairys, Michael R. Wasielewski, Roel Tempelaar 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28961v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文研究分子间自旋轨道耦合的二面角依赖性，属于理论化学和物理化学领域，与绝大多数大模型和深度学习技术关键词完全无关。唯一可能相关的关键词是"AI for Science” OR “Bioinformatics” OR “Cheminformatics”，因为该研究涉及化学信息学相关的理论计算，但论文本身并未使用AI或大模型方法，而是基于理论分析和对称性原理，因此给予5分（有一定关联）。其他所有关键词均与论文内容无直接联系，评分为0分。

!!! tip deepseek-chat TL;DR

该论文通过理论研究发现，在特定分子系统中，自旋轨道耦合在供体-受体正交构型下并非最大，而是需要倾斜角度和分子手性才能激活相关路径，挑战了传统认知。

摘要翻译

自旋轨道电荷转移系间窜越（SOCT-ISC）能够在无需重原子参与的情况下，于给体-受体（DA）二元体系中高效产生三重态激发态，适用于众多技术领域。传统观点普遍认为，当给体与受体部分之间的二面角呈正交时，该过程最为高效。本文通过理论研究对此观点提出挑战，揭示了一种在正交条件下自旋轨道耦合（SOCs）被最小化的情形。这一情形基于对相关单重态与三重态的结构强加对称性分析而得以合理解释。值得注意的是，在此情形下，有限的自旋轨道耦合要求倾斜的取向角度，而这又需要分子具有手性，表明手性可能是激活相关自旋轨道耦合路径的先决条件。

摘要 (Abstract)

Spin-orbit, charge-transfer intersystem crossing (SOCT-ISC) allows for the efficient production of triplet excited states in donor-acceptor (DA) dyads without the involvement of heavy atoms, for use in a myriad of technologies. This process is commonly believed to proceed optimally when the dihedral angle between donor and acceptor moieties is orthogonal. Here, we challenge this idea through a theoretical study unveiling a scenario where spin-orbit couplings (SOCs) are minimized under orthogonal conditions. This scenario is rationalized based on an analysis of the structure-imposed symmetry properties of the involved singlet and triplet states. Notably, in this scenario, finite SOCs demand oblique orientation angles, which in turn requires molecular chirality, suggesting chirality to be a prerequisite for activating the involved SOC pathways.

关键词: spin-orbit coupling, dihedral angle, donor-acceptor dyads, molecular chirality, symmetry analysis, SOCT-ISC, triplet excited states, theoretical study

21. ❌ Perspective of Fermi’s golden rule and its generalizations in chemical physics

作者: Seogjoo J. Jang, Goun Kim, Young Min Rhee 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28373v2

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《Perspective of Fermi’s golden rule and its generalizations in chemical physics》是一篇关于费米黄金法则（FGR）在化学物理领域应用的综述性文章，主要回顾了FGR的历史、推导、假设、应用以及最新进展。所有关键词均与大模型、深度学习技术原理或AI应用相关，而本文完全不涉及这些主题。唯一可能相关的关键词是“AI for Science” OR “Bioinformatics” OR “Cheminformatics”，因为论文属于化学物理领域，与科学计算相关，但文中并未提及AI或机器学习方法，因此仅给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

本文综述了费米黄金法则（FGR）在化学物理领域的理论基础、应用实践以及最新推广，旨在澄清其应用中的模糊性和开放性问题。

摘要翻译

本文回顾了费米黄金定则（Fermi’s golden rule，简称FGR）的简史，概述其推导过程、基本假设及典型表达形式。文章综述了FGR在化学物理等领域的主要应用，展示了该规则的广泛适用性与成功实践。同时，对FGR实际应用中存在的模糊性与开放性问题进行了辨析，并探讨了近年来FGR的推广形式及其在实际计算中的应用方法进展。

摘要 (Abstract)

This perspective provides a succinct history of Fermi’s golden rule (FGR), overview of its derivation, assumptions, and representative forms. Major applications of FGR, mostly in the field of chemical physics, are reviewed. These illustrate the broad applicability and success of FGR. Ambiguities and open issues encountered in practical applications of FGR are clarified. Recent advances in generalizations of FGR and computational methods for practical applications are addressed.

关键词: Fermi’s golden rule, chemical physics, generalizations, computational methods, applications, derivation, open issues

22. ❌ Automatic Identification of Parallelizable Loops Using Transformer-Based Source Code Representations

作者: Izavan dos S. Correia, Henrique C. T. Santos, Tiago A. E. Ferreira 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30040v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是使用Transformer模型（DistilBERT）进行源代码分析以识别可并行循环，属于软件工程中的代码分析任务。虽然使用了Transformer架构，但论文并未涉及大语言模型（LLM）或深度学习在科学领域的应用，也未探讨大模型技术原理的创新。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统、科学AI应用等直接相关，而本文仅使用预训练的Transformer编码器进行代码分类，与这些关键词的核心内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Transformer（DistilBERT）的方法，用于自动识别源代码中可并行执行的循环，在合成和真实代码数据集上实现了超过99%的准确率，简化了预处理并提高了泛化能力。

摘要翻译

自动并行化在软件工程中仍是一个具有挑战性的问题，尤其是在识别能够安全地在现代多核架构上并行执行的循环代码区域方面。传统的静态分析技术，如依赖分析（dependence analysis）和多面体模型（polyhedral models），在处理不规则或动态结构的代码时往往面临困难。本研究提出了一种基于Transformer的方法，用于对源代码的并行化潜力进行分类，重点在于区分独立（可并行化）循环与未定义循环。我们采用DistilBERT模型，通过子词标记化（subword tokenization）处理源代码序列，使模型能够在不依赖手工特征的情况下捕捉上下文相关的句法和语义模式。该方法在一个结合了合成生成循环与人工标注真实代码的平衡数据集上进行了评估，采用了10折交叉验证和多种性能指标。结果显示，该方法性能持续优异，平均准确率超过99%，且误报率低，证明了其鲁棒性和可靠性。与先前基于标记的方法相比，所提出的方法简化了预处理步骤，同时提升了泛化能力并保持了计算效率。这些发现凸显了轻量级Transformer模型在循环层面实际识别并行化机会方面的潜力。

摘要 (Abstract)

Automatic parallelization remains a challenging problem in software engineering, particularly in identifying code regions where loops can be safely executed in parallel on modern multi-core architectures. Traditional static analysis techniques, such as dependence analysis and polyhedral models, often struggle with irregular or dynamically structured code. In this work, we propose a Transformer-based approach to classify the parallelization potential of source code, focusing on distinguishing independent (parallelizable) loops from undefined ones. We adopt DistilBERT to process source code sequences using subword tokenization, enabling the model to capture contextual syntactic and semantic patterns without handcrafted features. The approach is evaluated on a balanced dataset combining synthetically generated loops and manually annotated real-world code, using 10-fold cross-validation and multiple performance metrics. Results show consistently high performance, with mean accuracy above 99% and low false positive rates, demonstrating robustness and reliability. Compared to prior token-based methods, the proposed approach simplifies preprocessing while improving generalization and maintaining computational efficiency. These findings highlight the potential of lightweight Transformer models for practical identification of parallelization opportunities at the loop level.

关键词: Automatic Parallelization, Transformer, Source Code Analysis, Loop Parallelization, DistilBERT, Code Classification, Software Engineering, Static Analysis

23. ❌ Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

作者: Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Chain-of-Thought（CoT）monitorability在LLM后训练（post-training）中的变化，特别是当CoT监控目标与最终输出目标存在冲突时。因此，与CoT、post-training、alignment和RLHF高度相关（10-15分）。System 2 Thinking、Self-Correction、Hallucination Mitigation和Explainable AI有一定关联（5分），因为涉及推理监控和可解释性。其他关键词如MoE、SLMs、RAG、Quantization等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在LLM后训练中，当Chain-of-Thought（CoT）监控目标与模型输出目标存在冲突时，CoT的可监控性会降低，并提出了一个预测框架来分类和验证这一现象。

摘要翻译

思维链监控（Chain-of-Thought monitoring）——即自动化系统监控大语言模型的思维链——是一种有效监督人工智能系统的可行方法。然而，模型的思维链能在多大程度上帮助我们监督模型（即思维链的可监控性）可能受到训练的影响，例如模型可能学会隐藏其推理的关键特征。我们提出并实证验证了一个概念框架，用于预测这种现象何时发生以及为何发生。我们将大语言模型的后训练建模为一个强化学习环境，其中奖励函数可分解为两项：一项取决于最终输出，另一项取决于思维链。该框架允许我们在训练前将这两项归类为“对齐”、“正交”或“冲突”。我们预测，使用冲突项进行训练会降低可监控性，正交项不会对其产生影响，而对齐项则会提升可监控性。为验证框架，我们将其用于对一组强化学习环境进行分类，在这些环境中训练大语言模型，并评估训练如何影响思维链的可监控性。研究发现：（1）使用“冲突”奖励项进行训练会降低思维链的可监控性；（2）优化冲突奖励项具有较高难度。

摘要 (Abstract)

Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model’s CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as “aligned”, “orthogonal”, or “in-conflict” before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with “in-conflict” reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.

关键词: Chain-of-Thought, CoT monitoring, LLM post-training, monitorability, reward decomposition, aligned orthogonal in-conflict, RL environment, training effects

24. ❌ Tucker Attention: A generalization of approximate attention mechanisms

作者: Timon Klein, Jonas Kusch, Sebastian Sager, Stefan Schnake, Steffen Schotthöfer 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30033v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的注意力机制Tucker Attention，专注于减少多头自注意力（MHA）的内存占用，属于大模型技术原理的创新。核心相关关键词：1）‘Large Language Models OR LLMs OR Foundation Models’（8分）：论文在LLM测试案例中评估了Tucker Attention，直接涉及大模型应用；2）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（8分）：Tucker Attention是一种参数高效方案，显著减少参数数量；3）‘KV Cache Compression OR Linear Attention OR FlashAttention’（8分）：论文与FlashAttention兼容，属于注意力优化技术；4）‘Quantization OR Model Compression OR Low-bit Weights’（5分）：论文通过低秩分解减少内存占用，与模型压缩有一定关联。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Tucker Attention的广义注意力机制，通过低秩分解策略显著减少自注意力层的参数数量，在LLM和ViT测试中比现有方法（如GQA和MLA）参数效率更高，且兼容FlashAttention和RoPE。

摘要翻译

为降低多头自注意力机制中自注意力模块的内存占用，学界涌现出多种方法，例如分组查询注意力与多头潜在注意力。这些方法通过在嵌入维度或注意力头维度上采用特定的低秩分解策略来实现优化。从经典低秩近似的视角看，这些方法具有非传统特性，引发了对它们实际逼近的对象以及如何解释所得表示的低秩行为的思考。为回答这些问题，本研究提出对自注意力层中权重对象的广义视角及一种分解策略，据此构建了一种参数高效方案——塔克注意力。经在大型语言模型和视觉Transformer测试案例中验证，与分组查询注意力和多头潜在注意力相比，塔克注意力在达到相近验证指标时所需参数数量可降低一个数量级。此外，塔克注意力将多头自注意力、分组查询注意力及多头潜在注意力均涵盖为特例，且完全兼容闪电注意力与旋转位置编码。这种广义策略揭示了多头自注意力、分组查询注意力及多头潜在注意力实际达到的秩，并进一步为多头潜在注意力提供了简化方案。

摘要 (Abstract)

The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interpret the low-rank behavior of the resulting representations. To answer these questions, this work proposes a generalized view on the weight objects in the self-attention layer and a factorization strategy, which allows us to construct a parameter efficient scheme, called Tucker Attention. Tucker Attention requires an order of magnitude fewer parameters for comparable validation metrics, compared to GQA and MLA, as evaluated in LLM and ViT test cases. Additionally, Tucker Attention~encompasses GQA, MLA, MHA as special cases and is fully compatible with flash-attention and rotary position embeddings (RoPE). This generalization strategy yields insights of the actual ranks achieved by MHA, GQA, and MLA, and further enables simplifications for MLA.

关键词: Tucker Attention, self-attention mechanism, low-rank factorization, parameter efficient, multi-headed self attention, flash-attention, LLM, memory footprint reduction

25. ❌ The Triadic Cognitive Architecture: Bounding Autonomous Action via Spatio-Temporal and Epistemic Friction

作者: Davide Di Gioia 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30031v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究基于LLM的自主AI代理（LLM Agents）在交互环境中的认知架构问题，提出Triadic Cognitive Architecture（TCA）框架来优化代理的决策过程，涉及工具使用（Tool Use）、深度推理（System 2 Thinking）和链式思考（Chain of Thought）等概念，并在模拟医疗诊断环境（AI for Science应用）中验证。其他关键词如MoE、训练方法、压缩技术等未在摘要中提及，故评分为0。

!!! tip deepseek-chat TL;DR

论文针对当前基于大语言模型的自主AI代理在交互环境中存在的认知无边界问题，提出了Triadic Cognitive Architecture框架，通过引入认知摩擦和优化控制来减少决策时间并提高行动效果，在模拟急诊医疗诊断网格中验证了该方法的有效性。

摘要翻译

当前由大型语言模型驱动的自主人工智能代理主要处于一种认知失重状态：它们处理信息时缺乏对网络拓扑、时间节奏或认知局限的内在感知。因此，启发式代理循环在交互环境中会表现出多种失效模式，包括拥堵条件下的过度工具使用、时间衰减下的冗长决策过程，以及证据模糊时的脆弱行为。本文提出三元认知架构，这是一个将机器推理建立在连续时间物理学基础上的统一数学框架。通过综合非线性滤波理论、黎曼路由几何与最优控制理论，我们正式定义了认知摩擦的概念。我们将代理的决策过程映射为一个耦合的随机控制问题，其中信息获取具有路径依赖性并受物理约束。三元认知架构不依赖任意的启发式停止标记，而是采用基于哈密顿-雅可比-贝尔曼方程的停止边界，并通过净效用停止条件实例化了一种基于推演的信念依赖信息价值近似方法。通过在模拟急诊医疗诊断网格中的实证验证，我们证明：尽管贪婪基线策略在延迟和拥堵成本下存在过度决策问题，三元策略在此环境中能够在不降低诊断准确性的前提下缩短行动时间并提高患者生存率。

摘要 (Abstract)

Current autonomous AI agents, driven primarily by Large Language Models (LLMs), operate in a state of cognitive weightlessness: they process information without an intrinsic sense of network topology, temporal pacing, or epistemic limits. Consequently, heuristic agentic loops (e.g., ReAct) can exhibit failure modes in interactive environments, including excessive tool use under congestion, prolonged deliberation under time decay, and brittle behavior under ambiguous evidence. In this paper, we propose the Triadic Cognitive Architecture (TCA), a unified mathematical framework that grounds machine reasoning in continuous-time physics. By synthesizing nonlinear filtering theory, Riemannian routing geometry, and optimal control, we formally define the concept of Cognitive Friction. We map the agent’s deliberation process to a coupled stochastic control problem where information acquisition is path-dependent and physically constrained. Rather than relying on arbitrary heuristic stop-tokens, the TCA uses an HJB-motivated stopping boundary and instantiates a rollout-based approximation of belief-dependent value-of-information with a net-utility halting condition. Through empirical validation in a simulated Emergency Medical Diagnostic Grid (EMDG), we demonstrate that while greedy baselines over-deliberate under latency and congestion costs, the triadic policy reduces time-to-action while improving patient viability without degrading diagnostic accuracy in this environment.

关键词: Autonomous AI agents, Large Language Models, Cognitive Architecture, Tool Use, Stochastic Control, Emergency Medical Diagnostic, ReAct, Belief-dependent Value-of-Information

26. ❌ Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models

作者: Md Saad, Sajjad Hussain, Mohd Suhaib 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30022v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出一个结合强化学习（RL）和大语言模型（LLMs）的混合框架，用于提升机器人操作任务。论文明确将LLMs用于高层任务规划和自然语言理解，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。同时，该框架旨在使机器人能够理解和执行复杂指令并适应环境，这本质上属于LLM驱动的自主代理系统，因此与’LLM Agents OR Autonomous Agents OR Agentic Workflow’也高度相关（10分）。论文未涉及其他关键词的具体技术细节（如MoE、SFT、RAG、量化等），也未明确属于生物信息学等特定科学领域，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合强化学习和大型语言模型的混合框架，用于机器人操作任务，实验结果表明该框架相比纯强化学习系统在任务完成时间上减少了33.5%，并在准确性和适应性上分别提升了18.1%和36.4%。

摘要翻译

本文提出了一种结合强化学习（RL）与大型语言模型（LLMs）的新型混合框架，以提升机器人操作任务的性能。该框架利用强化学习实现精确的低层控制，同时借助大型语言模型进行高层任务规划与自然语言理解，从而有效连接机器人系统中的低层执行与高层推理。这种集成方式使机器人能够理解并执行复杂的类人指令，并实时适应动态变化的环境。该框架在基于PyBullet的仿真环境中使用Franka Emika Panda机械臂进行了测试，并以多种操作场景作为基准。实验结果表明，与仅使用强化学习的系统相比，该框架在任务完成时间上减少了33.5%，在准确性与适应性方面分别提升了18.1%和36.4%。这些结果凸显了大型语言模型增强型机器人系统在实际应用中的潜力，使其更高效、更具适应性，并能更好地与人交互。未来的研究将致力于探索仿真到现实的迁移、可扩展性以及多机器人系统，以进一步拓宽该框架的适用领域。

摘要 (Abstract)

This paper introduces a new hybrid framework that combines Reinforcement Learning (RL) and Large Language Models (LLMs) to improve robotic manipulation tasks. By utilizing RL for accurate low-level control and LLMs for high level task planning and understanding of natural language, the proposed framework effectively connects low-level execution with high-level reasoning in robotic systems. This integration allows robots to understand and carry out complex, human-like instructions while adapting to changing environments in real time. The framework is tested in a PyBullet-based simulation environment using the Franka Emika Panda robotic arm, with various manipulation scenarios as benchmarks. The results show a 33.5% decrease in task completion time and enhancements of 18.1% and 36.4% in accuracy and adaptability, respectively, when compared to systems that use only RL. These results underscore the potential of LLM-enhanced robotic systems for practical applications, making them more efficient, adaptable, and capable of interacting with humans. Future research will aim to explore sim-to-real transfer, scalability, and multi-robot systems to further broaden the framework’s applicability.

关键词: Robotic Manipulation, Reinforcement Learning, Large Language Models, Hybrid Framework, Task Planning, Natural Language Understanding, Simulation, Franka Emika Panda

27. ❌ Phyelds: A Pythonic Framework for Aggregate Computing

作者: Gianluca Aguzzi, Davide Domini, Nicolas Farabegoli, Mirko Viroli 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29999v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Phyelds专注于为Python生态系统开发一个聚合编程框架，旨在支持大规模分布式学习和机器学习算法实现。虽然论文提到了机器学习集成和联邦学习协调，但所有关键词都直接针对大语言模型（LLM）的特定技术、训练方法、推理优化、对齐技术、代理系统等。论文内容完全不涉及LLM、深度学习模型架构、训练技术、推理加速、对齐方法或科学AI应用，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对Python数据科学社区缺乏聚合编程工具的问题，开发了Phyelds框架，实现了场演算模型并展示了其在联邦学习协调和多智能体强化学习模拟中的集成能力。

摘要翻译

聚合编程是一种基于场的协调范式，经过十余年的探索，已在传感器网络、机器人物联网等多个领域成功应用，并拥有多种编程语言实现，如Protelis、ScaFi（Scala）和FCPP（C++）。近期的一个研究方向将机器学习与聚合计算相结合，旨在支持大规模分布式学习，并为实现学习算法提供新的抽象。然而，现有实现并未面向数据科学从业者，这些从业者主要使用Python——这是数据科学和机器学习领域的事实标准语言，拥有丰富成熟的生态系统。Python在其他应用场景（如教育和机器人领域，例如通过ROS）也具有优势。为填补这一空白，我们提出了Phyelds，一个用于聚合编程的Python库。Phyelds完整而轻量地实现了场演算计算模型，其API设计符合Python风格，架构旨在与Python的机器学习生态系统无缝集成。我们描述了Phyelds的设计与实现，并通过从经典的聚合计算模式到联邦学习协调，以及与广泛使用的多智能体强化学习模拟器的集成等跨领域应用，展示了其多功能性。

摘要 (Abstract)

Aggregate programming is a field-based coordination paradigm with over a decade of exploration and successful applications across domains including sensor networks, robotics, and IoT, with implementations in various programming languages, such as Protelis, ScaFi (Scala), and FCPP (C++). A recent research direction integrates machine learning with aggregate computing, aiming to support large-scale distributed learning and provide new abstractions for implementing learning algorithms. However, existing implementations do not target data science practitioners, who predominantly work in Python–the de facto language for data science and machine learning, with a rich and mature ecosystem. Python also offers advantages for other use cases, such as education and robotics (e.g., via ROS). To address this gap, we present Phyelds, a Python library for aggregate programming. Phyelds offers a fully featured yet lightweight implementation of the field calculus model of computation, featuring a Pythonic API and an architecture designed for seamless integration with Python’s machine learning ecosystem. We describe the design and implementation of Phyelds and illustrate its versatility across domains, from well-known aggregate computing patterns to federated learning coordination and integration with a widely used multi-agent reinforcement learning simulator.

关键词: Aggregate Programming, Python Framework, Field Calculus, Distributed Learning, Federated Learning, Multi-agent Systems, Machine Learning Integration, ROS Integration

28. ❌ Scalable AI-assisted Workflow Management for Detector Design Optimization Using Distributed Computing

作者: Derek Anderson, Amit Bashyal, Markus Diefenthaler, Cristiano Fanelli, Wen Guan, Tanja Horn, Alex Jentsch Meifeng Lin, Tadashi Maeno, Kei Nagai, Hemalata Nayak, Connor Pecar, Karthik Suresh, Fang-Ying Tsai, Anselm Vossen, Tianle Wang, Torre Wenaus 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究利用AI/ML驱动的分布式工作流引擎（PanDA-iDDS）进行探测器设计优化的框架，并集成了多目标贝叶斯优化。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术，而论文中的AI/ML指的是更通用的机器学习/优化方法（如贝叶斯优化），并未涉及LLM。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（高能物理探测器设计）中的应用，但相关性中等（5分），因为论文未明确涉及生物信息学或化学信息学，且AI应用是框架性的而非核心模型创新。

!!! tip deepseek-chat TL;DR

该论文提出了一个集成多目标贝叶斯优化与PanDA-iDDS分布式工作流引擎的AI辅助框架，用于优化高能物理探测器设计，并在EIC的ePIC和dRICH探测器案例中展示了其在自动化、可扩展性和效率方面的提升。

摘要翻译

生产与分布式分析（PanDA）系统最初为欧洲核子研究中心大型强子对撞机（LHC）的ATLAS实验开发，现已演变为一个在分布式计算资源上协调大规模工作流的稳健平台。结合其智能分布式调度（iDDS）组件，PanDA通过可扩展且灵活的工作流引擎支持人工智能/机器学习驱动的工作流。
我们提出了一种用于探测器设计优化的AI辅助框架，该框架将多目标贝叶斯优化与PanDA-iDDS工作流引擎相结合，以协调跨异构资源的迭代模拟。该框架应对了现代探测器设计中固有的高维参数空间探索挑战。
我们通过基准问题以及对电子-离子对撞机（EIC）的ePIC和dRICH探测器的实际研究，展示了该框架的应用。结果表明，其在多目标优化中实现了更高的自动化程度、可扩展性和效率。这项工作为AI驱动的探测器设计及其他计算密集型科学应用建立了一个灵活且可扩展的范式。

摘要 (Abstract)

The Production and Distributed Analysis (PanDA) system, originally developed for the ATLAS experiment at the CERN Large Hadron Collider (LHC), has evolved into a robust platform for orchestrating large-scale workflows across distributed computing resources. Coupled with its intelligent Distributed Dispatch and Scheduling (iDDS) component, PanDA supports AI/ML-driven workflows through a scalable and flexible workflow engine. We present an AI-assisted framework for detector design optimization that integrates multi-objective Bayesian optimization with the PanDA–iDDS workflow engine to coordinate iterative simulations across heterogeneous resources. The framework addresses the challenge of exploring high-dimensional parameter spaces inherent in modern detector design. We demonstrate the framework using benchmark problems and realistic studies of the ePIC and dRICH detectors for the Electron-Ion Collider (EIC). Results show improved automation, scalability, and efficiency in multi-objective optimization. This work establishes a flexible and extensible paradigm for AI-driven detector design and other computationally intensive scientific applications.

关键词: AI-assisted workflow, detector design optimization, distributed computing, multi-objective Bayesian optimization, PanDA-iDDS, Electron-Ion Collider, scalable framework, computational scientific applications

29. ❌ Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives

作者: Mohammadhossein Khojasteh, Yifan Jiang, Stefano De Giorgis, Frank van Harmelen, Filip Ilievski 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29997v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在叙事类比推理中的应用，提出YARN框架使用LLM分解和抽象叙事单元以增强结构映射。与"Large Language Models"高度相关（10分），因为LLM是核心工具；与"Chain of Thought"和"System 2 Thinking"相关（各8分），因为研究涉及多步推理和深度推理过程；其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出YARN框架，通过使用LLM生成叙事抽象来增强结构映射，从而提升叙事类比推理性能，实验表明该方法优于端到端LLM基线。

摘要翻译

类比推理是人类在问题解决与论证中进行归纳推理的关键驱动力。然而，机器在叙事结构间的类比推理仍面临挑战。结构映射的认知引擎无法直接应用，因为它们预设了预先提取的实体，而大型语言模型（LLM）的表现则对提示格式及叙事间表面相似度高度敏感。这一差距引出了一个核心问题：利用LLM生成的抽象来增强结构映射，会如何影响其在叙事中的类比推理能力？为此，我们提出了一个模块化框架YARN（为叙事推理生成抽象），该框架使用LLM将叙事分解为单元，对这些单元进行抽象，然后将其传递给映射组件，以跨故事对齐元素，从而执行类比推理。基于先前关于框架理论的研究，我们定义并操作化了四个抽象层级，这些层级既能捕捉单元的一般意义，也能捕捉其在故事中的角色。实验表明，抽象化处理能持续提升模型性能，其表现达到或优于端到端LLM基线模型。进一步的错误分析揭示了在合适层级进行抽象、整合隐含因果关系以及叙事中类比模式的新兴分类等方面仍存在的挑战。YARN支持通过系统调整实验设置来分析各组件贡献，为促进未来研究，我们已公开YARN的代码。

摘要 (Abstract)

Analogical reasoning is a key driver of human generalization in problem-solving and argumentation. Yet, analogies between narrative structures remain challenging for machines. Cognitive engines for structural mapping are not directly applicable, as they assume pre-extracted entities, whereas LLMs’ performance is sensitive to prompt format and the degree of surface similarity between narratives. This gap motivates a key question: What is the impact of enhancing structural mapping with LLM-derived abstractions on their analogical reasoning ability in narratives? To that end, we propose a modular framework named YARN (Yielding Abstractions for Reasoning in Narratives), which uses LLMs to decompose narratives into units, abstract these units, and then passes them to a mapping component that aligns elements across stories to perform analogical reasoning. We define and operationalize four levels of abstraction that capture both the general meaning of units and their roles in the story, grounded in prior work on framing. Our experiments reveal that abstractions consistently improve model performance, resulting in competitive or better performance than end-to-end LLM baselines. Closer error analysis reveals the remaining challenges in abstraction at the right level, in incorporating implicit causality, and an emerging categorization of analogical patterns in narratives. YARN enables systematic variation of experimental settings to analyze component contributions, and to support future work, we make the code for YARN openly available.

关键词: analogical reasoning, narratives, LLM-derived abstractions, structural mapping, YARN framework, cognitive engines, abstraction levels, story units

30. ❌ Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

作者: Nathan Heath 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29993v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究MONA方法在强化学习环境中的扩展和验证，专注于奖励黑客缓解和智能体安全，不涉及大模型、深度学习技术原理或科学领域应用。所有关键词均与大模型技术、训练方法、推理优化、AI科学应用等相关，而本文是纯粹的强化学习安全研究，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文扩展了MONA方法在Camera Dropbox环境中的研究，通过引入模块化学习批准机制验证了奖励黑客缓解效果，发现最佳校准学习监督器能实现零奖励黑客但意图行为率较低，表明工程挑战转向构建保持足够远见的学习批准模型。

摘要翻译

近视优化与非近视批准（MONA）通过限制智能体的规划视野，同时提供远见性批准作为训练信号，以缓解多步奖励黑客问题~\cite{farquhar2025mona}。原论文指出了一个关键开放性问题：构建批准的方法——特别是批准对已实现结果的依赖程度——如何影响MONA安全保证的有效性。我们提出了一项以复现为先的扩展研究，基于公开的MONA Camera Dropbox环境，具体工作包括：（i）将已发布的代码库重构为标准Python项目，并实现脚本化PPO训练；（ii）使用发布的参考数组验证了普通强化学习（91.5%奖励黑客率）与理想MONA（0.0%黑客率）之间的对比结果；（iii）引入模块化的学习型批准套件，涵盖理想型、噪声型、误设型、学习型及校准型批准机制。在针对批准方法、规划视野、数据集规模和校准策略的有限规模试点扫描中，表现最佳的校准学习型监督器运行实现了零观测奖励黑客，但其预期行为率显著低于理想MONA（11.9%对比99.9%），这表明问题在于优化不足而非黑客行为重现。这些结果将MONA论文中关于批准谱系的猜想转化为可运行的实验对象，并表明核心工程挑战已从验证MONA概念转向构建能够保持足够远见性、同时不重新开放奖励黑客通道的学习型批准模型。代码、配置及复现命令已公开。https://github.com/codernate92/mona-camera-dropbox-repro

摘要 (Abstract)

Myopic Optimization with Non-myopic Approval (MONA) mitigates multi-step reward hacking by restricting the agent’s planning horizon while supplying far-sighted approval as a training signal~\cite{farquhar2025mona}. The original paper identifies a critical open question: how the method of constructing approval – particularly the degree to which approval depends on achieved outcomes – affects whether MONA’s safety guarantees hold. We present a reproduction-first extension of the public MONA Camera Dropbox environment that (i)~repackages the released codebase as a standard Python project with scripted PPO training, (ii)~confirms the published contrast between ordinary RL (91.5% reward-hacking rate) and oracle MONA (0.0% hacking rate) using the released reference arrays, and (iii)~introduces a modular learned-approval suite spanning oracle, noisy, misspecified, learned, and calibrated approval mechanisms. In reduced-budget pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies, the best calibrated learned-overseer run achieves zero observed reward hacking but substantially lower intended-behavior rates than oracle MONA (11.9% vs.\ 99.9%), consistent with under-optimization rather than re-emergent hacking. These results operationalize the MONA paper’s approval-spectrum conjecture as a runnable experimental object and suggest that the central engineering challenge shifts from proving MONA’s concept to building learned approval models that preserve sufficient foresight without reopening reward-hacking channels. Code, configurations, and reproduction commands are publicly available. https://github.com/codernate92/mona-camera-dropbox-repro

关键词: MONA, reward hacking, reinforcement learning, agent safety, approval mechanisms, PPO training, calibration, Camera Dropbox

作者: Iain Swift, JingHua Ye, Ruairi O’Reilly 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于多模态深度学习在癌症预后（胶质瘤生存预测）中的应用，具体研究跨模态交互的量化方法。论文内容与大多数关键词（主要涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词针对的是大语言模型（LLMs）及相关技术，而本文研究的是传统的多模态深度学习模型（结合WSI和RNA-seq特征），并非大语言模型。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学领域的AI应用，与癌症研究直接相关，评分为10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究通过将InterSHAP方法从分类任务扩展到Cox比例风险模型，量化了胶质瘤生存预测中多模态融合模型的跨模态交互，发现预测性能的提升主要源于信号的互补性聚合而非学习到的协同作用。

摘要翻译

多模态深度学习在癌症预后预测中通常被认为受益于跨模态的协同交互作用，然而这一假设尚未在生存预测场景中得到直接验证。本研究将基于沙普利交互作用指数的度量方法InterSHAP从分类任务适配至Cox比例风险模型，并应用于量化胶质瘤生存预测中的跨模态交互作用。利用TCGA-GBM和TCGA-LGG数据（n=575），我们评估了四种融合全视野数字切片（Whole-Slide Image, WSI）与RNA-seq特征的架构。核心发现表明预测性能与测得的交互作用呈反向关系：实现更优区分度（C指数0.64$\to$0.82）的架构展现出等同或更低的跨模态交互作用（4.8%$\to$3.0%）。方差分解显示所有架构中均存在稳定的加性贡献（WSI${\approx}$40%，RNA${\approx}$55%，交互作用${\approx}$4%），表明性能提升源于互补信号的聚合而非学习到的协同效应。这些发现为比较融合策略提供了实用的模型审计工具，重新阐释了架构复杂性在多模态融合中的作用，并对隐私保护的联邦部署具有启示意义。

摘要 (Abstract)

Multimodal deep learning for cancer prognosis is commonly assumed to benefit from synergistic cross-modal interactions, yet this assumption has not been directly tested in survival prediction settings. This work adapts InterSHAP, a Shapley interaction index-based metric, from classification to Cox proportional hazards models and applies it to quantify cross-modal interactions in glioma survival prediction. Using TCGA-GBM and TCGA-LGG data (n=575), we evaluate four fusion architectures combining whole-slide image (WSI) and RNA-seq features. Our central finding is an inverse relationship between predictive performance and measured interaction: architectures achieving superior discrimination (C-index 0.64$\to$0.82) exhibit equivalent or lower cross-modal interaction (4.8%$\to$3.0%). Variance decomposition reveals stable additive contributions across all architectures (WSI${\approx}$40%, RNA${\approx}$55%, Interaction${\approx}$4%), indicating that performance gains arise from complementary signal aggregation rather than learned synergy. These findings provide a practical model auditing tool for comparing fusion strategies, reframe the role of architectural complexity in multimodal fusion, and have implications for privacy-preserving federated deployment.

关键词: multimodal deep learning, glioma survival prediction, cross-modal interactions, InterSHAP, Cox proportional hazards models, whole-slide image, RNA-seq, fusion architectures

32. ❌ Trimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology, Gene Expression, and MRI

作者: Iain Swift, JingHua Ye 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29968v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用深度学习进行多模态（组织病理学、基因表达、MRI）脑胶质瘤生存预测的可行性研究，属于AI在生物医学领域的应用。所有关键词均与大模型技术原理、训练方法、推理优化、对齐、代理系统等核心大模型技术无关，因此除’AI for Science OR Bioinformatics OR Cheminformatics’（评5分，因其属于AI在生物信息学/科学领域的应用，与论文的医学AI应用有一定关联）外，其余关键词均评0分。论文未涉及任何大模型（LLM）或相关技术，也未提及指定的专家作者。

!!! tip deepseek-chat TL;DR

本研究探索了将FLAIR MRI作为第三模态整合到组织病理学和基因表达的双模态框架中，用于脑胶质瘤生存预测的可行性，初步结果表明三模态早期融合在有限样本下可能提供额外的预后价值，但统计显著性不足。

摘要翻译

多模态深度学习通过整合组织病理学与基因组数据，已提升了脑肿瘤预后预测的准确性，然而在统一的生存分析框架中，容积磁共振成像（MRI）的贡献尚未得到充分探索。本先导研究将BraTS2021数据集中的液体衰减反转恢复序列（Fluid Attenuated Inversion Recovery, FLAIR）MRI作为第三模态，扩展了原有的双模态框架。基于TCGA-GBMLGG队列（664例患者），我们评估了三种单模态模型、九种双模态配置及三种三模态配置，涵盖早期融合、晚期融合与联合融合策略。在此小样本队列中，三模态早期融合取得了探索性综合评分（Composite Score, CS = 0.854），相较于相同患者群体的双模态基线实现了+0.011的受控$Δ$CS提升，但该差异未达到统计学显著性（p = 0.250，置换检验）。MRI单模态表现出合理的判别能力（CS = 0.755），虽未显著提升双模态组合性能，但在三模态组合中提供了可测量的增益。所有包含MRI的实验均受限于19例测试患者，导致自助法置信区间较宽（例如[0.400,1.000]），无法得出确定性结论。这些发现提供了初步证据：即使在有限样本量下，第三成像模态仍可能增加预后价值，且额外模态需要充分的多模态上下文才能有效发挥作用。

摘要 (Abstract)

Multimodal deep learning has improved prognostic accuracy for brain tumours by integrating histopathology and genomic data, yet the contribution of volumetric MRI within unified survival frameworks remains unexplored. This pilot study extends a bimodal framework by incorporating Fluid Attenuated Inversion Recovery (FLAIR) MRI from BraTS2021 as a third modality. Using the TCGA-GBMLGG cohort (664 patients), we evaluate three unimodal models, nine bimodal configurations, and three trimodal configurations across early, late, and joint fusion strategies. In this small cohort setting, trimodal early fusion achieves an exploratory Composite Score (CS = 0.854), with a controlled $Δ$CS of +0.011 over the bimodal baseline on identical patients, though this difference is not statistically significant (p = 0.250, permutation test). MRI achieves reasonable unimodal discrimination (CS = 0.755) but does not substantially improve bimodal pairs, while providing measurable uplift in the three-way combination. All MRI containing experiments are constrained to 19 test patients, yielding wide bootstrap confidence intervals (e.g. [0.400,1.000]) that preclude definitive conclusions. These findings provide preliminary evidence that a third imaging modality may add prognostic value even with limited sample sizes, and that additional modalities require sufficient multimodal context to contribute effectively.

关键词: multimodal deep learning, glioma survival prediction, histopathology, gene expression, MRI, trimodal fusion, prognostic accuracy, TCGA-GBMLGG

33. ❌ Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

作者: Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29950v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究医疗团队在智能辅导系统中的生理和对话动态，属于AI在科学（医学教育）领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。但论文未涉及大模型、深度学习技术原理或任何其他关键词中的具体技术（如LLMs、MoE、训练方法、推理技术等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究通过分析医疗团队在智能辅导系统中诊断虚拟病例时的生理同步性和对话语义变化，发现高生理同步性与低语义相似性相关，并将这些同步峰值定性为成功团队共享发现和失败团队共享不确定性的“关键时刻”。

摘要翻译

有效的协作要求团队通过社会共享学习调节（Socially Shared Regulation of Learning, SSRL）来管理复杂的认知与情绪状态。生理同步性（即生理信号在时间上的对齐）能够反映这些状态，但其本身难以独立解读。本研究利用智能辅导系统，探究了四组医学二人团队在诊断虚拟病例过程中的生理与对话动态。研究发现，对话中的语义转变与瞬时的生理同步峰值相关。我们同时对话语片段进行了SSRL编码，并利用句子嵌入计算了余弦相似度。结果显示，激活先验知识阶段的语义相似度显著低于简单的任务执行阶段。较高的生理同步性伴随着较低的语义相似度，表明此类时刻涉及探索性且多样化的语言使用。定性分析将这些同步峰值交叉验证为“关键转折点”：成功的团队在共同发现时出现同步，而不成功的团队则在共同陷入不确定时达到峰值。本研究通过展示如何将生物信号与对话融合以理解问题解决中的关键时刻，推动了以人为中心的人工智能发展。

摘要 (Abstract)

Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments’’: successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.

关键词: physiological synchrony, medical teams, intelligent tutoring system, socially shared regulation of learning, semantic similarity, dialogue analysis, problem solving, human-centered AI

34. ❌ Four Generations of Quantum Biomedical Sensors

作者: Xin Jin, Priyam Srivastava, Ronghe Wang, Yuqing Li, Jonathan Beaumariage, Tom Purdy, M. V. Gurudev Dutt, Kang Kim, Kaushik Seshadreesan, Junyu Liu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29944v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子生物医学传感器技术，提出了一个四代分类框架，涉及量子相干、纠缠、自旋压缩和量子学习等概念。论文内容与绝大多数关键词（涉及大模型、深度学习技术、训练方法、推理优化等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物医学传感领域，是AI for Science的一个潜在应用方向，但论文本身并未直接讨论AI或机器学习技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于量子资源利用的四代量子生物医学传感器分类框架，分析了技术瓶颈并规划了从测量物理可观测量到提取结构化生物信息的量子增强智能发展路线图。

摘要翻译

量子传感技术为超灵敏生物医学传感提供了变革性潜力，但其临床转化仍受限于经典噪声极限和对宏观系综的依赖。我们提出一个统一的代际框架，依据量子资源的利用方式对不断发展的量子生物传感器领域进行系统性梳理。第一代设备利用离散能级进行信号转换，但仍遵循经典标度律。第二代传感器利用量子相干性达到标准量子极限，而第三代架构则借助纠缠与自旋压缩逼近海森堡极限精度。我们进一步界定了一个新兴的第四代范式，其特征是量子传感与量子学习及变分电路的端到端集成，从而实现在量子域内直接进行自适应推理。通过分析带宽匹配和传感器-组织邻近度等关键参数，我们识别了主要技术瓶颈，并提出了从测量物理观测量向利用量子增强智能提取结构化生物信息过渡的技术路线图。

摘要 (Abstract)

Quantum sensing technologies offer transformative potential for ultra-sensitive biomedical sensing, yet their clinical translation remains constrained by classical noise limits and a reliance on macroscopic ensembles. We propose a unifying generational framework to organize the evolving landscape of quantum biosensors based on their utilization of quantum resources. First-generation devices utilize discrete energy levels for signal transduction but follow classical scaling laws. Second-generation sensors exploit quantum coherence to reach the standard quantum limit, while third-generation architectures leverage entanglement and spin squeezing to approach Heisenberg-limited precision. We further define an emerging fourth generation characterized by the end-to-end integration of quantum sensing with quantum learning and variational circuits, enabling adaptive inference directly within the quantum domain. By analyzing critical parameters such as bandwidth matching and sensor-tissue proximity, we identify key technological bottlenecks and propose a roadmap for transitioning from measuring physical observables to extracting structured biological information with quantum-enhanced intelligence.

关键词: Quantum sensing, Biomedical sensors, Quantum coherence, Entanglement, Spin squeezing, Quantum learning, Variational circuits, Heisenberg limit

35. ❌ Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

作者: Peng Gang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29953v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究结构化意图表示（如5W3H、CO-STAR、RISEN框架）作为人机交互的通信层，评估其在Claude、GPT-4o、Gemini等大语言模型上的跨模型鲁棒性、框架比较和弱模型补偿效应。因此，与’Large Language Models’高度相关（10分），因为研究直接测试多个LLM；与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为研究涉及提示工程和用户目标对齐，但未涉及模型本身的指令调优或对齐训练。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了结构化意图表示（如5W3H框架）作为人机交互的协议式通信层，通过跨模型、跨语言和跨框架的实验发现，结构化提示能显著减少跨语言评分方差，并观察到弱模型补偿效应，其中AI扩展的提示将用户交互轮次减少60%，满意度从3.16提升至4.04。

摘要翻译

结构化意图表征能在多大程度上可靠地跨不同AI模型、语言和提示框架保持用户目标？先前研究表明，基于5W3H框架的结构化意图规范PPS（提示协议规范）能提升中文语境下的目标对齐度，并可推广至英语和日语。本文从三个方向拓展该研究脉络：在Claude、GPT-4o和Gemini 2.5 Pro间的跨模型鲁棒性验证；与CO-STAR及RISEN框架的受控对比；以及在生态效度场景下对AI辅助意图扩展的用户研究（N=50）。通过对3,240份模型输出（3种语言×6种条件×3个模型×3个领域×20项任务）由独立评判系统（DeepSeek-V3）评估，我们发现结构化提示能显著降低跨语言分数方差——相较于非结构化基线，最优结构化条件将跨语言标准差从0.470降至约0.020。同时观察到弱模型补偿现象：基线表现最弱的模型（Gemini）获得更大的D-A增益（+1.006），而最强模型（Claude）增益较小（+0.217）。在当前评估分辨率下，5W3H、CO-STAR和RISEN均取得相似的高目标对齐分数，表明维度分解本身是重要的活性成分。用户研究中，经AI扩展的5W3H提示将交互轮次减少60%，用户满意度从3.16提升至4.04。这些发现证实了结构化意图表征作为人机交互中类协议通信层的实践价值，具有强鲁棒性。

摘要 (Abstract)

How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.

关键词: structured intent, prompting frameworks, cross-model robustness, goal alignment, 5W3H, CO-STAR, RISEN, weak-model compensation

36. ❌ Rethinking AI Literacy Education in Higher Education: Bridging Risk Perception and Responsible Adoption

作者: Shasha Yu, Fiona Carroll, Barry L. Bentley 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI素养教育、风险认知与负责任采用，属于AI教育与社会影响领域，未涉及大模型/深度学习技术原理创新或具体应用，与所有技术关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究通过调查139名学生对AI风险的认知与采用意愿，发现学生更关注明确风险而非抽象风险，风险认知与采用意愿呈负相关，技术教育缩小了性别差异但男性采用意愿更高，AI专业学生存在风险低估现象，强调了需要差异化AI素养策略来弥合认知与负责任采用之间的差距。

摘要翻译

随着人工智能日益深度融入社会各领域，理解未来人工智能从业者——尤其是技术专业学生——对其风险的认知，对于负责任的人工智能开发与应用至关重要。本研究通过显性人工智能风险评级与基于场景的风险及采用意愿评估，对139名计算机科学、数据科学/数据分析及其他专业学生的反馈进行了分析。主要发现如下：（1）学生对具体、明确陈述的风险表现出远高于对抽象或场景嵌入风险的关注度；（2）风险感知与人工智能采用意愿呈现明显的负相关关系；（3）尽管技术教育缩小了风险认知的性别差异，但男性学生仍表现出更高的采用意愿；（4）研究观察到一种“风险低估”现象：人工智能相关专业的学生虽具有更高的显性风险意识，且在应用场景中风险识别能力较低，却同时表现出更高的人工智能采用意愿。这些发现强调，需要采取差异化的AI素养（AI literacy）培养策略，以弥合风险认知与负责任应用之间的鸿沟，并为致力于培养具备伦理意识和社会责任感的人工智能从业者的教育者、政策制定者、行业领袖及学术机构提供了重要启示。

摘要 (Abstract)

As AI becomes increasingly embedded across societal domains, understanding how future AI practitioners, particularly technology students, perceive its risks is essential for responsible development and adoption. This study analyzed responses from 139 students in Computer Science, Data Science/Data Analytics, and other disciplines using both explicit AI risk ratings and scenario-based assessments of risk and adoption willingness. Four key findings emerged: (1) Students expressed substantially higher concern for concrete, explicitly stated risks than for abstract or scenario-embedded risks; (2) Perceived risk and willingness to adopt AI demonstrated a clear inverse relationship; (3) Although technical education narrowed gender differences in risk awareness, male students reported higher adoption willingness; and (4) A form of “risk underappreciation” was observed, wherein students in AI-related specializations showed both elevated explicit risk awareness and higher willingness to adopt AI, despite lower recognition of risks in applied scenarios. These findings underscore the need for differentiated AI literacy strategies that bridge the gap between awareness and responsible adoption and offer valuable insights for educators, policymakers, industry leaders, and academic institutions aiming to cultivate ethically informed and socially responsible AI practitioners.

关键词: AI literacy education, risk perception, responsible adoption, higher education, technology students, scenario-based assessment, gender differences, risk underappreciation

37. ❌ Bethe Ansatz with a Large Language Model

作者: Balázs Pozsgay, István Vona 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29932v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLM（ChatGPT）解决数学物理中的Bethe Ansatz计算问题，属于大模型在科学领域的应用创新。因此与’Large Language Models’和’AI for Science’高度相关（10分）。LLM执行多步推导任务，涉及推理过程，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。LLM在人类监督下纠正错误，体现’Self-Correction’元素（5分）。LLM作为计算代理执行任务，与’LLM Agents’概念相关（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文探索了大型语言模型（LLM）在数学物理领域的应用，成功使用ChatGPT计算了可积自旋链模型的Bethe Ansatz解，包括两个新模型，验证了LLM在复杂科学计算任务中的潜力。

摘要翻译

我们探究了大语言模型（LLM）在数学物理领域执行特定计算任务的能力：其任务是计算选定可积自旋链模型的坐标Bethe Ansatz解。我们选取了三个尚未发表解的可积哈密顿量，其中两个哈密顿量实际上是全新的。我们观察到，LLM在所有案例中均以半自主方式完成了求解，过程中出现了一些错误。这些错误在研究人员发现后得以修正。LLM得出的结果已通过精确对角化（由独立程序执行）进行验证，其推导过程也经作者核查。这些Bethe Ansatz解本身具有研究价值：我们的第二个模型明显破坏了左右对称性，但具有PT对称性，因此其解可能在广义流体动力学中有应用前景；而第三个模型则通过一种特殊形式的嵌套Bethe Ansatz求解，该模型虽具有相互作用，但其嵌套层级呈现缺乏$U(1)$对称性的自由费米子结构。这种结构似乎是独特的，且由LLM自主发现。本研究使用了OpenAI开发的ChatGPT 5.2 Pro和5.4 Pro版本。

摘要 (Abstract)

We explore the capability of a Large Language Model (LLM) to perform specific computations in mathematical physics: the task is to compute the coordinate Bethe Ansatz solution of selected integrable spin chain models. We select three integrable Hamiltonians for which the solutions were unpublished; two of the Hamiltonians are actually new. We observed that the LLM semi-autonomously solved the task in all cases, with a few mistakes along the way. These were corrected after the human researchers spotted them. The results of the LLM were checked against exact diagonalization (performed by separate programs), and the derivations were also checked by the authors. The Bethe Ansatz solutions are interesting in themselves. Our second model manifestly breaks left-right invariance, but it is PT-symmetric, therefore its solution could be interesting for applications in Generalized Hydrodynamics. And our third model is solved by a special form of the nested Bethe Ansatz, where the model is interacting, but the nesting level has a free fermionic structure lacking $U(1)$-invariance. This structure appears to be unique and it was found by the LLM. We used ChatGPT 5.2 Pro and 5.4 Pro by OpenAI.

关键词: Large Language Model, Bethe Ansatz, Integrable spin chain, Mathematical physics, ChatGPT, Scientific computation, AI for science, PT-symmetry

38. ❌ ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

作者: Jonas Landsgesell, Pascal Knoll 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29928v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文聚焦于表格基础模型（如TabPFN、TabICL）的评估，属于大模型在科学领域（特别是数据科学/生物信息学）的应用。核心贡献是引入ScoringBench基准，使用适当的评分规则评估概率预测质量，并研究了不同微调目标的影响。因此，与’Foundation Models’、‘Supervised Fine-tuning’、‘AI for Science’高度相关（8分），与’Pre-training’、‘In-context Learning’有一定关联（5分），其余关键词（如MoE、RLHF、Agents等）未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对表格基础模型在金融和临床研究等高风险领域评估不足的问题，提出了ScoringBench基准，通过全面的适当评分规则评估概率预测质量，并发现模型排名和最优预训练目标高度依赖于所选评分规则，强调了评估指标选择的重要性。

摘要翻译

诸如TabPFN和TabICL等表格基础模型已能生成完整的预测分布，然而主流的回归基准测试几乎仅通过点估计指标（如RMSE、R²）来评估它们。这些聚合性指标常常掩盖了模型在分布尾部的性能表现，这对于金融和临床研究等高风险决策领域是一个严重缺陷，因为这些领域通常存在不对称的风险特征。我们推出了ScoringBench——一个开放的基准测试框架，它计算包括CRPS、CRLS、区间评分、能量评分、加权CRPS和Brier评分在内的一系列严格评分规则，同时结合标准点估计指标，从而更全面地评估概率预测的质量。我们在一系列回归基准测试中，评估了经不同评分规则目标微调的realTabPFNv2.5和TabICL，并与未经微调的realTabPFNv2.5进行了比较。研究结果证实，模型排名依赖于所选的评分规则，且没有任何单一的预训练目标是普遍最优的。这表明，对于极端事件敏感的应用场景，评估指标的选择与数据本身一样，都是领域特定的要求。ScoringBench可通过https://github.com/jonaslandsgesell/ScoringBench获取。当前排行榜的实时预览地址为https://scoringbench.bolt.host。该排行榜通过git拉取请求进行维护，以确保透明度、可追溯性、灵活性和可复现性。

摘要 (Abstract)

Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions yet prevailing regression benchmarks evaluate them almost exclusively via point estimate metrics RMSE R2 These aggregate measures often obscure model performance in the tails of the distribution a critical deficit for high stakes decision making in domains like finance and clinical research where asymmetric risk profiles are the norm We introduce ScoringBench an open benchmark that computes a comprehensive suite of proper scoring rules like CRPS CRLS Interval Score Energy Score weighted CRPS and Brier Score alongside standard point metrics providing a richer picture of probabilistic forecast quality We evaluate realTabPFNv2.5 fine tuned with different scoring rule objectives and TabICL relative to untuned realTabPFNv2.5 across a suite of regression benchmarks Our results confirm that model rankings depend on the chosen scoring rule and that no single pretraining objective is universally optimal This demonstrates that for applications sensitive to extreme events the choice of evaluation metric is as much a domain specific requirement as the data itself ScoringBench is available at https://github.com/jonaslandsgesell/ScoringBench A live preview of the current leaderboard is available at https://scoringbench.bolt.host The leaderboard is maintained via git pull requests to ensure transparency traceability agility and reproducibility

关键词: Tabular Foundation Models, ScoringBench, Proper Scoring Rules, Probabilistic Forecasting, Model Evaluation, Fine-tuning, Regression Benchmarks, High-stakes Decision Making

39. ❌ End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines

作者: Raül Pérez-Gonzalo, Andreas Espersen, Søren Forchhammer, Antonio Agudo 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29927v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于风力涡轮机图像压缩的深度学习应用，与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为其属于AI在工业检测（可视为科学/工程应用）中的一个具体应用，但并非核心的生物信息学或化学信息学领域，且未涉及大模型。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于风力涡轮机检测的端到端深度学习框架，通过结合分割和双模式（有损/无损）压缩，在保证叶片区域高保真度的同时高效压缩图像，为自动化检测提供了实用解决方案。

摘要翻译

在风力涡轮机巡检过程中，传输大量高分辨率图像会引入评估与检测严重缺陷的瓶颈。高效的编码必须在保持叶片区域高保真度的同时对背景进行大幅压缩。本研究提出一种端到端的深度学习框架，该框架联合执行分割与双模式（有损与无损）压缩。分割模块精确识别叶片区域，随后我们的感兴趣区域（ROI）压缩器以优于图像其余部分的质量对该区域进行编码。与仅向显著区域分配更多比特的传统ROI方案不同，本框架整合了：（i）采用CRF正则化损失以实现精确叶片定位的鲁棒分割网络（BU-Netv2+P），（ii）为有损压缩优化的基于超先验的自编码器，以及（iii）采用分层模型以实现完全无损叶片重建的扩展比特回退编码器。此外，我们的ROI框架通过复用背景编码比特，消除了比特回退编码中的顺序依赖性，实现了并行高效的双模式压缩。据我们所知，这是首个完全集成的基于学习的ROI编解码器，它结合了分割、有损与无损压缩，确保后续缺陷检测不受影响。在大规模风力涡轮机数据集上的实验证明了其优越的压缩性能与效率，为自动化巡检提供了实用解决方案。

摘要 (Abstract)

Transferring large volumes of high-resolution images during wind turbine inspections introduces a bottleneck in assessing and detecting severe defects. Efficient coding must preserve high fidelity in blade regions while aggressively compressing the background. In this work, we propose an end-to-end deep learning framework that jointly performs segmentation and dual-mode (lossy and lossless) compression. The segmentation module accurately identifies the blade region, after which our region-of-interest (ROI) compressor encodes it at superior quality compared to the rest of the image. Unlike conventional ROI schemes that merely allocate more bits to salient areas, our framework integrates: (i) a robust segmentation network (BU-Netv2+P) with a CRF-regularized loss for precise blade localization, (ii) a hyperprior-based autoencoder optimized for lossy compression, and (iii) an extended bits-back coder with hierarchical models for fully lossless blade reconstruction. Furthermore, our ROI framework removes the sequential dependency in bits-back coding by reusing background-coded bits, enabling parallelized and efficient dual-mode compression. To the best of our knowledge, this is the first fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression, ensuring that subsequent defect detection is not compromised. Experiments on a large-scale wind turbine dataset demonstrate superior compression performance and efficiency, offering a practical solution for automated inspections.

关键词: image compression, wind turbine inspection, deep learning, segmentation, region-of-interest coding, lossy compression, lossless compression, autoencoder

40. ❌ Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence

作者: Georgii Mikriukov, Grégoire Montavon, Marina M. -C. Höhne 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29915v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于可解释人工智能（XAI）领域，研究后解释方法的可靠性问题，提出使用认知不确定性作为解释可靠性的低成本代理指标。论文内容与大多数关键词（如大模型、训练方法、推理优化、AI代理等）完全无关，仅与’Mechanistic Interpretability OR Explainable AI’高度相关，因为这是论文的核心研究领域。论文未涉及大模型、深度学习技术原理创新或科学领域应用，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出使用认知不确定性作为低成本代理指标来评估后解释方法的可靠性，通过实验验证了认知不确定性与解释稳定性之间的强负相关关系，并展示了其在改善最坏情况解释和召回高质量解释两个应用场景中的有效性。

摘要翻译

后验解释方法被广泛用于解释黑盒预测，但其生成过程通常计算成本高昂且可靠性难以保证。我们提出将认知不确定性作为解释可靠性的低成本代理指标：高认知不确定性标识出决策边界定义模糊的区域，这些区域的解释往往不稳定且缺乏忠实性。这一洞见催生了两种互补的应用场景：“改进最差情况解释”（根据预期解释可靠性将样本分流至廉价或昂贵的XAI方法）和"召回高质量解释”（在有限预算下推迟对不确定性样本的解释生成）。通过在四个表格数据集、五种不同架构模型和四种XAI方法上的实验，我们观察到认知不确定性与解释稳定性之间存在显著负相关。进一步分析表明，认知不确定性不仅能区分稳定与不稳定的解释，还能辨别忠实与非忠实的解释。图像分类实验证实了我们的发现可推广至表格数据之外的领域。

摘要 (Abstract)

Post-hoc explanation methods are widely used to interpret black-box predictions, but their generation is often computationally expensive and their reliability is not guaranteed. We propose epistemic uncertainty as a low-cost proxy for explanation reliability: high epistemic uncertainty identifies regions where the decision boundary is poorly defined and where explanations become unstable and unfaithful. This insight enables two complementary use cases: improving worst-case explanations' (routing samples to cheap or expensive XAI methods based on expected explanation reliability), and recalling high-quality explanations’ (deferring explanation generation for uncertain samples under constrained budget). Across four tabular datasets, five diverse architectures, and four XAI methods, we observe a strong negative correlation between epistemic uncertainty and explanation stability. Further analysis shows that epistemic uncertainty distinguishes not only stable from unstable explanations, but also faithful from unfaithful ones. Experiments on image classification confirm that our findings generalize beyond tabular data.

关键词: Explainable AI, Post-hoc explanations, Epistemic uncertainty, Explanation reliability, Decision boundary, Explanation stability, Faithfulness, Cost-aware XAI

41. ❌ Training deep learning based dynamic MR image reconstruction using synthetic fractals

作者: Anirudh Raman, Olivier Jaubert, Mark Wrobel, Tina Yao, Ruaraidh Campbell, Rebecca Baker, Ruta Virsinskaite, Daniel Knight, Michael Quail, Jennifer Steeden, Vivek Muthurangu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29922v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究使用合成分形数据训练深度学习模型进行动态MRI重建，属于医学影像AI应用。所有关键词均与大模型技术原理、训练方法、推理优化、代理系统等直接相关，而本文仅涉及传统深度学习（3D UNet）在医学影像的具体应用，未涉及大模型、LLM、MoE、SFT、RLHF、RAG、推理加速、代理等任何大模型相关技术。唯一相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学影像AI可视为AI在科学（医学）领域的应用，但非核心创新点，故给5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该研究探索了使用合成分形数据替代真实心脏MRI数据训练深度学习模型进行动态MRI重建，结果表明分形数据训练的模型在图像质量和临床测量上与真实数据训练的模型表现相当，为医学影像重建提供了可扩展的替代数据源。

摘要翻译

目的：探究合成生成的分形数据能否用于训练动态磁共振成像重建的深度学习模型，从而规避心脏磁共振训练数据集相关的隐私、许可及可获取性限制。方法：利用四元数朱利亚分形生成二维+时间图像以构建训练数据集。通过模拟多线圈磁共振采集，生成配对的完全采样与径向欠采样k空间数据。使用该分形数据训练三维UNet深度伪影抑制模型，并与基于心脏磁共振数据训练的相同模型进行对比。两种模型均在10例患者前瞻性采集的径向实时心脏磁共振数据上进行评估。重建结果与压缩感知及低秩深度图像先验方法进行比较。所有重建图像均进行图像质量评分，同时将心室容积与射血分数与参考的屏气电影磁共振成像数据进行对比。结果：分形数据训练的深度学习模型与心脏磁共振数据训练的模型在定性评分上无显著差异，两者均优于压缩感知与低秩深度图像先验方法。基于分形数据训练模型获得的心室容积和功能参数与心脏磁共振数据训练模型结果相近，相较于参考电影成像无显著偏差且一致性界限可接受；而低秩深度图像先验方法存在显著偏差且一致性界限更宽。结论：使用合成分形数据训练的深度学习模型能够重建实时心脏磁共振图像，其图像质量与临床测量指标与基于真实心脏磁共振数据训练的模型相当。分形训练数据为临床数据集提供了一种开放、可扩展的替代方案，可能促进开发更具泛化能力的动态磁共振成像深度学习重建模型。

摘要 (Abstract)

Purpose: To investigate whether synthetically generated fractal data can be used to train deep learning (DL) models for dynamic MRI reconstruction, thereby avoiding the privacy, licensing, and availability limitations associated with cardiac MR training datasets. Methods: A training dataset was generated using quaternion Julia fractals to produce 2D+time images. Multi-coil MRI acquisition was simulated to generate paired fully sampled and radially undersampled k-space data. A 3D UNet deep artefact suppression model was trained using these fractal data (F-DL) and compared with an identical model trained on cardiac MRI data (CMR-DL). Both models were evaluated on prospectively acquired radial real-time cardiac MRI from 10 patients. Reconstructions were compared against compressed sensing(CS) and low-rank deep image prior (LR-DIP). All reconstrctuions were ranked for image quality, while ventricular volumes and ejection fraction were compared with reference breath-hold cine MRI. Results: There was no significant difference in qualitative ranking between F-DL and CMR-DL (p=0.9), while both outperformed CS and LR-DIP (p<0.001). Ventricular volumes and function derived from F-DL were similar to CMR-DL, showing no significant bias and accptable limits of agreement compared to reference cine imaging. However, LR-DIP had a signifcant bias (p=0.016) and wider lmits of agreement. Conclusion: DL models trained using synthetic fractal data can reconstruct real-time cardiac MRI with image quality and clinical measurements comparable to models trained on true cardiac MRI data. Fractal training data provide an open, scalable alternative to clinical datasets and may enable development of more generalisable DL reconstruction models for dynamic MRI.

关键词: deep learning, MRI reconstruction, synthetic fractal data, cardiac MRI, 3D UNet, dynamic imaging, medical imaging, data generation

42. ❌ SISA: A Scale-In Systolic Array for GEMM Acceleration

作者: Luigi Altamura, Alessio Cicero, Mateo Vázquez Maceiras, Mohammad Ali Maleki, Pedro Trancoso 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29913v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SISA专注于硬件加速器架构设计，旨在优化LLMs中GEMM操作的执行效率。摘要明确提到LLMs作为主要应用场景，因此与’Large Language Models’高度相关（10分）。同时，论文通过新型脉动阵列架构实现高达8.52倍加速，直接属于’Inference Acceleration’范畴（10分）。其他关键词涉及模型训练、对齐、推理方法、应用领域等，均与本文的硬件架构研究无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为SISA的新型可扩展脉动阵列架构，通过分区设计解决了LLMs中因输入依赖和矩阵形状不规则导致的硬件资源利用率低的问题，相比传统阵列实现了最高8.52倍的加速和93%的能耗延迟积降低。

摘要翻译

当前主流的AI/ML工作负载，例如大语言模型（Large Language Models, LLMs），依赖于通用矩阵乘法（General Matrix-Matrix Multiplication, GEMM）操作的高效执行。因此，大多数系统都配备了基于处理单元（Processing Elements, PEs）构成的方形脉动阵列（Systolic Arrays, SAs）的专用矩阵硬件加速器。虽然这种架构对于传统的深度神经网络（Deep Neural Networks, DNNs）是有效的，但LLMs引入了输入依赖且高度不规则的矩阵，导致SA资源利用率不足。为应对这一挑战，我们提出了SISA（Scale-In Systolic Array），一种新颖的SA架构，它将传统的方形阵列划分为水平的矩形子阵列。SISA以极小的开销，通过独立调度的子阵列来暴露并行性，从而高效执行小型或不规则形状的矩阵运算，同时保留完整阵列操作以处理大型GEMM。与具有相同数量PE的先进单体式SA相比，SISA在代表性的LLMs上实现了高达8.52倍的加速和93%的能量延迟积（Energy-Delay Product, EDP）降低。

摘要 (Abstract)

The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction for representative LLMs compared to a state-of-the-art monolithic SA with the same number of PEs.

关键词: Systolic Array, GEMM Acceleration, Large Language Models, Hardware Accelerator, Matrix Multiplication, Inference Efficiency, Energy-Delay Product, Processing Elements

43. ❌ ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

作者: Yinuo Liu, Zi Qian, Heng Zhou, Jiahao Zhang, Yajie Zhang, Zhihang Li, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29902v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）的代理工具规划，核心涉及LLM代理、工具调用和多代理系统，与’Large Language Models’、‘LLM Agents’、‘Tool Use’高度相关（10分）。‘Retrieval-Augmented Generation’和’Multi-agent Systems’相关（8分），因论文提到检索增强并引入多代理评估系统。‘Chain of Thought’、‘System 2 Thinking’和’Hallucination Mitigation’有一定关联（5分），因代理规划涉及推理和事实性。其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了ATP-Bench基准和MAM评估系统，以解决多模态大语言模型在交错文本-图像生成中代理工具规划的挑战，实验发现现有模型在连贯规划和工具使用行为上存在显著差异。

摘要翻译

交错式图文生成是多模态大语言模型（MLLMs）的一个重要前沿方向，它提供了一种更直观的方式来传达复杂信息。当前的研究范式主要依赖于图像生成或检索增强技术，但这些方法通常将两者视为互斥的路径，未能将事实准确性与创造性统一起来。我们认为，该领域的下一个里程碑是智能体化工具规划，即模型作为中央控制器，自主决定何时、何地以及调用何种工具，以针对视觉关键性查询生成交错式响应。为系统评估这一范式，我们提出了ATP-Bench，这是一个新颖的基准测试集，涵盖八个类别和25种视觉关键性意图，包含7,702个问答对（其中1,592个为视觉问答对），并提供了人工验证的查询和真实答案。此外，为独立于端到端执行和可变工具后端来评估智能体规划能力，我们提出了一种多智能体MLLM即评判系统（Multi-Agent MLLM-as-a-Judge, MAM）。MAM能够评估工具调用的精确性，识别工具使用的遗漏机会，并在无需真实答案参考的情况下评估整体响应质量。我们在10个先进MLLMs上进行的广泛实验表明，现有模型在连贯的交错式规划方面存在困难，且在工具使用行为上表现出显著差异，这揭示了巨大的改进空间，并为推进交错式生成提供了可行的指导方向。数据集和代码可在https://github.com/Qwen-Applications/ATP-Bench获取。

摘要 (Abstract)

Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries. To systematically evaluate this paradigm, we introduce ATP-Bench, a novel benchmark comprising 7,702 QA pairs (including 1,592 VQA pairs) across eight categories and 25 visual-critical intents, featuring human-verified queries and ground truths. Furthermore, to evaluate agentic planning independent of end-to-end execution and changing tool backends, we propose a Multi-Agent MLLM-as-a-Judge (MAM) system. MAM evaluates tool-call precision, identifies missed opportunities for tool use, and assesses overall response quality without requiring ground-truth references. Our extensive experiments on 10 state-of-the-art MLLMs reveal that models struggle with coherent interleaved planning and exhibit significant variations in tool-use behavior, highlighting substantial room for improvement and providing actionable guidance for advancing interleaved generation. Dataset and code are available at https://github.com/Qwen-Applications/ATP-Bench.

关键词: Multimodal Large Language Models, Agentic Tool Planning, Interleaved Generation, Tool Use, Multi-agent Systems, Benchmark Evaluation, Visual-critical Queries, MLLM-as-a-Judge

44. ❌ C-TRAIL: A Commonsense World Framework for Trajectory Planning in Autonomous Driving

作者: Zhihong Cui, Haoran Tang, Tianyi Li, Yushuai Li, Peiyuan Guan, Amir Taherkordi, Tor Skeie 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29908v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文C-TRAIL提出一个用于自动驾驶轨迹规划的框架，核心是利用LLM进行常识推理，并通过信任机制解决LLM输出的不可靠性问题。因此，与LLM直接相关的关键词（如Large Language Models、LLM Agents）高度相关（10分）。论文使用Monte Carlo Tree Search（MCTS）进行规划，且明确结合LLM，故’MCTS AND LLM’得10分。论文关注LLM输出的可靠性问题，涉及幻觉缓解和事实性，因此’Hallucination Mitigation OR Factuality OR Truthfulness’得10分。框架基于’Commonsense World’，与’World Models AND General World Models’高度相关（10分）。其他关键词如Chain of Thought、System 2 Thinking、Self-Correction、Explainable AI与论文的推理和解释性方面有一定关联（5分）。其余关键词如MoE、SLMs、Scaling Laws、训练方法（Pre-training、SFT等）、优化技术（Quantization、RAG等）在论文中未涉及，得0分。AI for Science等关键词虽属大模型应用，但论文专注于自动驾驶，非生物医药等科学领域，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶轨迹规划中LLM输出不可靠的问题，提出了C-TRAIL框架，通过结合常识世界模型和信任机制来增强规划可靠性，实验表明其在多个数据集上显著优于现有基线方法。

摘要翻译

自动驾驶的轨迹规划日益依赖大语言模型（LLM）进行常识推理，然而LLM的输出本质上不可靠，在安全关键应用中构成风险。我们提出了C-TRAIL框架，该框架构建于一个常识世界之上，通过将LLM衍生的常识与信任机制相结合来指导轨迹规划。C-TRAIL通过一个闭环的“回忆-规划-更新”周期运行：回忆模块向LLM查询语义关系，并通过双重信任机制量化其可靠性；规划模块通过狄利克雷信任策略将加权信任的常识注入蒙特卡洛树搜索（MCTS）；更新模块则根据环境反馈自适应地优化信任分数与策略参数。在Highway-env的四个模拟场景以及两个真实世界levelXData数据集（highD, rounD）上的实验表明，C-TRAIL始终优于现有先进基线方法，平均将ADE降低了40.2%，FDE降低了51.7%，并将SR提升了16.9个百分点。源代码发布于https://github.com/ZhihongCui/CTRAIL。

摘要 (Abstract)

Trajectory planning for autonomous driving increasingly leverages large language models (LLMs) for commonsense reasoning, yet LLM outputs are inherently unreliable, posing risks in safety-critical applications. We propose C-TRAIL, a framework built on a Commonsense World that couples LLM-derived commonsense with a trust mechanism to guide trajectory planning. C-TRAIL operates through a closed-loop Recall, Plan, and Update cycle: the Recall module queries an LLM for semantic relations and quantifies their reliability via a dual-trust mechanism; the Plan module injects trust-weighted commonsense into Monte Carlo Tree Search (MCTS) through a Dirichlet trust policy; and the Update module adaptively refines trust scores and policy parameters from environmental feedback. Experiments on four simulated scenarios in Highway-env and two real-world levelXData datasets (highD, rounD) show that C-TRAIL consistently outperforms state-of-the-art baselines, reducing ADE by 40.2%, FDE by 51.7%, and improving SR by 16.9 percentage points on average. The source code is available at https://github.com/ZhihongCui/CTRAIL.

关键词: autonomous driving, trajectory planning, large language models, commonsense reasoning, Monte Carlo Tree Search, trust mechanism, hallucination mitigation, world models

45. ❌ ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

作者: Rui Ai, Yu Pan, David Simchi-Levi, Chonghuan Wang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29871v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在推荐、头脑风暴等场景中生成多候选推荐时的强化学习后训练问题，提出ShapE-GRPO方法改进GRPO。高度相关关键词：1) ‘Large Language Models’ (论文明确研究LLM应用)，2) ‘Post-training’ (聚焦后训练阶段)，3) ‘RLHF’ (涉及强化学习优化，GRPO是RLHF变体)。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对LLM在多候选推荐场景中现有GRPO方法奖励分配粗糙的问题，提出ShapE-GRPO方法，利用Shapley值分解集合级奖励为候选级信号，实验证明其能加速收敛并优于标准GRPO。

摘要翻译

在推荐、头脑风暴和代码建议等用户-智能体交互场景中，大语言模型（LLMs）常生成一组候选推荐，其目标在于最大化整个集合的整体效用，而非独立优化单个候选。然而，现有的强化学习后训练范式，如组相对策略优化（Group Relative Policy Optimization, GRPO），通常为集合中的每个候选分配相同的集合级标量奖励。这会导致训练信号存在噪声，使得低质量候选因单个优质同伴产生的高奖励而“搭便车”，从而引发次优探索。为解决此问题，我们提出沙普利值增强的GRPO（ShapE-GRPO）。通过利用集合级效用的排列不变性，我们从合作博弈论中推导出一种沙普利值增强的公式，将集合级奖励分解为细粒度的、针对具体候选的个体信号。我们证明该公式在保持沙普利值基本公理的同时，仍能以多项式时间复杂度实现高效计算。实验表明，在多种数据集上，ShapE-GRPO均持续优于标准GRPO，并在训练过程中实现了更快的收敛速度。

摘要 (Abstract)

In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.

关键词: Large Language Models, Group Relative Policy Optimization, Shapley value, reinforcement learning, post-training, multi-candidate recommendation, reward allocation, set-level utility

46. ❌ Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports

作者: Benjamin Josef Schüßler, Jakob Prange 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29861v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究ESG报告的可读性评估，使用了LLM提示和微调transformer模型进行句子级可读性预测。与关键词的相关性分析：1. ‘Large Language Models’得5分，因为摘要提到’LLM prompting has potential’，但LLM不是核心方法；2. ‘Post-training/SFT’得5分，因为使用了微调transformer模型；3. 其他关键词均得0分，因为论文未涉及MoE、SLMs、Scaling Laws、RAG、推理加速、AI for Science等技术。论文属于大模型在特定领域（文本可读性评估）的应用，但技术深度和创新性有限。

!!! tip deepseek-chat TL;DR

该研究评估了德国ESG报告句子级可读性，发现微调transformer模型比LLM提示能更准确地预测人类可读性评分。

摘要翻译

随着经济与社会可持续发展议题日益紧迫，伴随而来的海量信息使消费者需要可靠的信息获取渠道。为满足这一需求，企业开始自愿或依法发布所谓的环境、社会及治理（Environmental, Social, and Governance，简称ESG）报告。为服务公众，这些报告不仅需要面向金融专家，也必须考虑非专业读者群体。然而，其文本表述是否足够清晰？本研究基于现有德文ESG报告句子级数据集，通过众包方式补充了可读性标注。我们发现，总体上母语者认为ESG报告中的句子易于阅读，但可读性感知存在主观差异。我们应用多种可读性评分方法，并基于预测误差以及与人工评分的相关性对其进行评估。分析表明，虽然大语言模型提示法在区分清晰句与难读句方面具有潜力，但经过微调的小型Transformer模型能以最低误差预测人工可读性评分。通过集成多个模型的预测结果可略微提升性能，但会以降低推理速度为代价。

摘要 (Abstract)

With the ever-growing urgency of sustainability in the economy and society, and the massive stream of information that comes with it, consumers need reliable access to that information. To address this need, companies began publishing so called Environmental, Social, and Governance (ESG) reports, both voluntarily and forced by law. To serve the public, these reports must be addressed not only to financial experts but also to non-expert audiences. But are they written clearly enough? In this work, we extend an existing sentence-level dataset of German ESG reports with crowdsourced readability annotations. We find that, in general, native speakers perceive sentences in ESG reports as easy to read, but also that readability is subjective. We apply various readability scoring methods and evaluate them regarding their prediction error and correlation with human rankings. Our analysis shows that, while LLM prompting has potential for distinguishing clear from hard-to-read sentences, a small finetuned transformer predicts human readability with the lowest error. Averaging predictions of multiple models can slightly improve the performance at the cost of slower inference.

关键词: ESG reports, readability scoring, sentence-level, German language, LLM prompting, transformer fine-tuning, human annotations, model evaluation

47. ❌ Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis

作者: Han Deng, Anqi Zou, Hanling Zhang, Ben Fei, Chengyu Zhang, Haobo Wang, Xinru Guo, Zhenyu Li, Xuzhu Wang, Peng Yang, Fujian Zhang, Weiyu Guo, Xiaohong Shao, Zhaoyang Liu, Shixiang Tang, Zhihui Wang, Wanli Ouyang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29828v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究科学仪器自动化和数据分析的智能代理系统，与大多数大模型技术关键词无关。仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为系统是自主代理；与’Tool Use OR Function Calling OR API Tool Use’有一定关联（8分），涉及GUI操作工具；与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），属于科学AI应用。其他关键词未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了Owl-AuraID系统，通过GUI原生范式实现科学仪器的自主操作和数据分析，解决了自动化实验室中仪器操作通用性差的问题，并在多种精密仪器和科学工作流中验证了其有效性。

摘要翻译

科学发现日益依赖于高通量表征技术，但自动化进程受限于专有的图形用户界面（GUI）以及现有基于应用程序接口（API）系统的有限普适性。我们提出Owl-AuraID，一种软硬件协同的具身智能体系统，它采用原生GUI范式，通过与人专家相同的界面操作仪器。该系统以技能为核心框架，将第一类（GUI操作）与第二类（数据分析）技能整合至端到端工作流中，从而将物理样品处理与科学解读相连接。Owl-AuraID展示了在十类精密仪器及多种工作流中的广泛覆盖能力，包括多模态光谱分析、显微成像与晶体学分析，支持傅里叶变换红外光谱（FTIR）、核磁共振（NMR）、原子力显微镜（AFM）、热重分析（TGA）等多种表征模式。总体而言，Owl-AuraID为自主实验室提供了实用且可扩展的基础框架，并通过可复用的操作与分析技能，展示了一条通向实验室智能演进的可行路径。代码发布于https://github.com/OpenOwlab/AuraID。

摘要 (Abstract)

Scientific discovery increasingly depends on high-throughput characterization, yet automation is hindered by proprietary GUIs and the limited generalizability of existing API-based systems. We present Owl-AuraID, a software-hardware collaborative embodied agent system that adopts a GUI-native paradigm to operate instruments through the same interfaces as human experts. Its skill-centric framework integrates Type-1 (GUI operation) and Type-2 (data analysis) skills into end-to-end workflows, connecting physical sample handling with scientific interpretation. Owl-AuraID demonstrates broad coverage across ten categories of precision instruments and diverse workflows, including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities such as FTIR, NMR, AFM, and TGA. Overall, Owl-AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills. The code are available at https://github.com/OpenOwlab/AuraID.

关键词: autonomous scientific instrumentation, GUI-native paradigm, embodied agent system, scientific data analysis, precision instruments, multimodal spectral analysis, autonomous laboratories, skill-centric framework

48. ❌ From Density Matrices to Phase Transitions in Deep Learning: Spectral Early Warnings and Interpretability

作者: Max Hennick, Guillaume Corlouer 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29805v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究深度学习训练过程中的相变检测和可解释性方法，提出了基于量子化学启发的2RDM方法。与绝大多数关键词无关，因为这些关键词主要针对大语言模型的具体技术、训练方法、应用场景或优化技术。唯一相关的是’Mechanistic Interpretability OR Explainable AI’（评分10），因为论文核心是开发可解释AI工具来理解训练动态；‘AI for Science OR Bioinformatics OR Cheminformatics’（评分5）有弱关联，因为论文受量子化学启发并应用于AI科学问题，但并非直接生物/化学信息学应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于量子化学启发的2RDM方法，用于检测和解释深度学习模型训练过程中的相变，并通过谱热容和参与比提供早期预警和重组维度分析。

摘要翻译

现代人工智能研究的一个核心问题是在训练过程中预测和理解模型涌现的能力。受量子化学反应研究方法的启发，我们提出了“双数据点约化密度矩阵”（2RDM）。我们证明，该对象为训练过程中的相变提供了一个计算高效、统一的观测量。通过追踪滑动窗口内2RDM的特征值统计量，我们推导出两个互补的信号：谱热容——我们证明其能通过临界慢化现象为二阶相变提供早期预警；以及参与率——它揭示了底层重组过程的维度。值得注意的是，2RDM的顶部特征向量具有直接的可解释性，使得研究相变的本质变得直观。我们在四个不同的场景中进行了验证：深度线性网络、归纳头形成、顿悟现象以及涌现的错位。最后，我们讨论了基于2RDM的未来研究方向。

摘要 (Abstract)

A key problem in the modern study of AI is predicting and understanding emergent capabilities in models during training. Inspired by methods for studying reactions in quantum chemistry, we present the ``2-datapoint reduced density matrix”. We show that this object provides a computationally efficient, unified observable of phase transitions during training. By tracking the eigenvalue statistics of the 2RDM over a sliding window, we derive two complementary signals: the spectral heat capacity, which we prove provides early warning of second-order phase transitions via critical slowing down, and the participation ratio, which reveals the dimensionality of the underlying reorganization. Remarkably, the top eigenvectors of the 2RDM are directly interpretable making it straightforward to study the nature of the transitions. We validate across four settings distinct settings: deep linear networks, induction head formation, grokking, and emergent misalignment. We then discuss directions for future work using the 2RDM.

关键词: density matrices, phase transitions, deep learning, spectral early warnings, interpretability, 2RDM, training dynamics, emergent capabilities

49. ❌ Reasoning-Driven Synthetic Data Generation and Evaluation

作者: Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29791v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Simula框架，这是一种基于推理驱动的合成数据生成和评估方法，采用无种子、智能体驱动的方式大规模生成合成数据集。论文与’LLM Agents/Autonomous Agents/Agentic Workflow’高度相关（10分），因为框架本质上是智能体驱动的；与’Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’高度相关（10分），因为框架是推理驱动的；与’Large Language Models/LLMs/Foundation Models’有一定关联（8分），因为合成数据生成通常涉及大模型；与’Scaling Laws AND Data Quality’有一定关联（5分），因为涉及大规模数据生成和质量评估；与’Mechanistic Interpretability/Explainable AI’有一定关联（5分），因为框架强调可解释性和可控性。其他关键词与论文内容无关或未提及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Simula的推理驱动框架，用于大规模生成和评估合成数据，以解决AI应用中数据稀缺或隐私问题，并通过实验验证了其有效性。

摘要翻译

尽管许多备受关注的AI应用需要专门的多模态模型，但训练此类模型所需的相关数据本质上稀缺或难以获取。通过人工标注填补这些空白成本极高、易出错且耗时，这促使模型构建者日益将合成数据视为可扩展的替代方案。然而，现有的合成数据生成方法通常依赖于人工提示、进化算法或来自目标分布的大量种子数据——这限制了其可扩展性、可解释性和可控性。本文提出Simula：一种新颖的推理驱动数据生成与评估框架。它采用无种子（seedless）、智能体驱动（agentic）的方法大规模生成合成数据集，允许用户通过可解释且可控的流程定义期望的数据集特征，从而实现细粒度的资源分配。我们在多种数据集上验证了该方法的有效性，对其内在特性与下游性能进行了严格测试。本项工作（1）为合成数据机制设计提供了指导原则，（2）为大规模生成与评估合成数据提供了见解，（3）为在数据稀缺或隐私问题至关重要的领域开发与部署AI开辟了新机遇。

摘要 (Abstract)

Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

关键词: synthetic data generation, reasoning-driven framework, agentic approach, data scarcity, explainable process, controllable process, dataset evaluation, AI applications

50. ❌ From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety

作者: Ganen Sethupathy, Lalit Dumka, Jan Schagen 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29777v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于边缘计算的混合动作检测系统，结合了骨架分析和视觉语言模型进行语义场景理解。虽然提到了视觉语言模型，但所有关键词都专注于大语言模型（LLM）及其特定技术、训练方法、推理优化、对齐、代理等。论文的核心是计算机视觉和边缘计算系统设计，而非大语言模型技术本身或其创新应用。因此，所有关键词均不相关，得分为0。

!!! tip deepseek-chat TL;DR

该论文设计并部署了一个混合边缘动作检测系统，结合骨架分析和视觉语言模型，以解决公共安全应用中实时视频分析在延迟、隐私和资源方面的挑战，结果表明混合架构能有效平衡快速检测与高级语义推理。

摘要翻译

交通枢纽、城市中心及活动场馆等公共空间需要及时可靠地检测潜在暴力行为以保障公共安全。尽管自动化视频分析已取得显著进展，但其实际部署仍受延迟、隐私和资源限制的制约，尤其在边缘计算条件下更为突出。本文提出了一种基于混合边缘架构的行为检测系统设计与演示部署方案，该系统将基于骨架的运动分析与视觉-语言模型相结合，以实现语义场景理解。骨架处理方法能以较低计算开销实现持续且注重隐私的监控，而视觉-语言模型则为复杂及既往未见场景提供上下文理解与零样本推理能力。本文的贡献不在于提出新的识别模型，而是聚焦于在现实边缘约束条件下对两种范式进行系统级比较。该系统在配备GPU的边缘设备上实现，并通过演示装置在延迟、资源占用及运行权衡方面进行评估。结果凸显了以运动为中心的方法与语义方法的互补优势与局限，进而提出一种混合架构——该架构通过高层语义推理选择性地增强快速骨架检测能力。所提出的系统为公共安全应用中注重隐私的实时视频分析提供了实践基础。

摘要 (Abstract)

Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.

关键词: action detection, edge computing, skeleton-based analysis, vision-language models, public safety, real-time video analysis, privacy-aware monitoring, hybrid architecture

51. ❌ Tracking vs. Deciding: The Dual-Capability Bottleneck in Searchless Chess Transformers

作者: Quanhao Li, Wei Jiang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29761v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Transformer模型在象棋领域的应用，主要关注从移动序列训练模型时的状态跟踪与决策质量之间的瓶颈问题。论文与大多数关键词无关，因为这些关键词主要针对通用大语言模型（LLM）的技术、应用或评估方法。论文的核心是特定领域的Transformer模型训练，而非通用LLM。仅与以下关键词有中等关联：1. ‘Scaling Laws AND Data Quality’（5分）：论文探讨了模型规模（28M到120M参数）和数据质量（Elo加权训练）对性能的影响，这与扩展定律和数据质量的概念相关，但并非核心研究。2. ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：论文涉及从移动序列训练模型，可视为特定领域的预训练或适应，但未明确使用这些术语。3. ‘Post-training OR Supervised Fine-tuning OR SFT’（5分）：训练过程可能涉及监督微调以优化决策，但未详细说明。其他关键词如LLM、MoE、RAG、RLHF等均未涉及，因为论文专注于象棋Transformer，而非通用大模型技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文研究了在无搜索的象棋Transformer中，从移动序列训练时状态跟踪与决策质量之间的双重能力瓶颈问题，并通过模型缩放和Elo加权训练相结合的方法，使120M参数模型达到了Lichess bullet 2570的评分和55.2%的人类移动预测Top-1准确率。

摘要翻译

一个拟人化的国际象棋引擎应当模仿高水平棋手的风格、失误与一致性，而非单纯追求棋力最大化。本文证明，仅通过走子序列进行训练会迫使模型学习两种能力：状态追踪（从历史走法中重建棋盘状态）与决策质量（基于重建状态选择优质走法）。这两者存在矛盾的数据需求：低等级对局为追踪能力提供了必要的多样性，而高等级对局则为决策学习提供了质量信号。移除低等级数据会导致性能下降。
我们将这种矛盾形式化为双重能力瓶颈，即 P ≤ min(T,Q)，其中整体性能受限于较弱的那项能力。基于这一视角，我们首先将模型参数量从2800万扩展到1.2亿以提升追踪能力，随后引入基于Elo等级加权的训练方法，在保持多样性的同时提升决策质量。通过2×2因子消融实验表明：扩大规模可改善追踪能力，加权训练可提升决策质量，且二者结合具有超叠加效应。线性加权效果最佳，而过激的加权策略虽能降低验证损失，却会损害追踪能力。我们还提出了覆盖衰减公式 t* = log(N/kcrit)/log b，用以量化对局内退化风险的可靠预测范围。
最终，我们的1.2亿参数模型在不依赖搜索的情况下，于253场评级对局中达到Lichess快棋等级分2570。在人类走子预测任务上，其Top-1准确率达到55.2%，超越了Maia-2快棋版与超快棋版。与基于局面的方法不同，序列输入天然编码完整对局历史，使得模型能够做出依赖历史情境的决策，这是单局面模型无法实现的特性。

摘要 (Abstract)

A human-like chess engine should mimic the style, errors, and consistency of a strong human player rather than maximize playing strength. We show that training from move sequences alone forces a model to learn two capabilities: state tracking, which reconstructs the board from move history, and decision quality, which selects good moves from that reconstructed state. These impose contradictory data requirements: low-rated games provide the diversity needed for tracking, while high-rated games provide the quality signal for decision learning. Removing low-rated data degrades performance. We formalize this tension as a dual-capability bottleneck, P <= min(T,Q), where overall performance is limited by the weaker capability. Guided by this view, we scale the model from 28M to 120M parameters to improve tracking, then introduce Elo-weighted training to improve decisions while preserving diversity. A 2 x 2 factorial ablation shows that scaling improves tracking, weighting improves decisions, and their combination is superadditive. Linear weighting works best, while overly aggressive weighting harms tracking despite lower validation loss. We also introduce a coverage-decay formula, t* = log(N/kcrit)/log b, as a reliability horizon for intra-game degeneration risk. Our final 120M-parameter model, without search, reached Lichess bullet 2570 over 253 rated games. On human move prediction it achieves 55.2% Top-1 accuracy, exceeding Maia-2 rapid and Maia-2 blitz. Unlike position-based methods, sequence input naturally encodes full game history, enabling history-dependent decisions that single-position models cannot exhibit.

关键词: chess transformers, state tracking, decision quality, dual-capability bottleneck, Elo-weighted training, model scaling, human move prediction, searchless engine

52. ❌ TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios

作者: Qiucheng Yu, Ruijie Xu, Mingang Chen, Xuequan Lu, Jianfeng Dong, Chaochao Lu, Xin Tan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉语言模型（VLMs）在室内安全危害评估领域的基准构建和评估，研究内容涉及数据集构建、模型评估和性能提升。所有给定的关键词均与大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理系统等）或特定科学领域AI应用（如生物信息学）相关。论文未涉及LLMs技术原理、训练方法、推理技术、模型优化或特定科学领域的AI应用，也未提及任何LLMs相关概念。因此，所有关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有视觉语言模型在室内安全危害评估中存在的基准局限性，提出了一个名为TSHA的综合基准数据集，并通过实验证明当前模型在该任务上能力不足，而使用TSHA训练能显著提升模型性能并增强泛化能力。

摘要翻译

视觉语言模型（VLMs）的最新进展加速了其在室内安全隐患评估中的应用。然而，现有基准存在三个根本性局限：（1）严重依赖通过仿真软件构建的合成数据集，导致与现实环境存在显著领域差距；（2）安全任务过于简化，对危险类型和场景类型施加了人为限制，从而限制了模型的泛化能力；（3）缺乏严格的评估协议来全面评估模型在复杂家庭安全场景中的能力。为应对这些挑战，我们提出了TSHA（Trustworthy Safety Hazards Assessment，可信安全隐患评估）基准，这是一个综合性基准，包含来自四个互补来源的81,809个精心筛选的训练样本：现有室内数据集、互联网图像、AIGC（人工智能生成内容）图像以及新采集的图像。该基准集还包含一个极具挑战性的测试集，共1707个样本，不仅包含从训练分布中精心挑选的子集，还新增了包含多种安全隐患的视频和全景图像，用于评估模型在复杂安全场景中的鲁棒性。对23个流行VLMs的广泛实验表明，当前模型缺乏稳健的安全隐患评估能力。重要的是，在TSHA训练集上训练的模型不仅在TSHA测试集上实现了高达+18.3分的显著性能提升，还在其他基准上表现出更强的泛化能力，这凸显了TSHA基准的重大贡献与重要性。

摘要 (Abstract)

Recent advances in vision-language models (VLMs) have accelerated their application to indoor safety hazards assessment. However, existing benchmarks suffer from three fundamental limitations: (1) heavy reliance on synthetic datasets constructed via simulation software, creating a significant domain gap with real-world environments; (2) oversimplified safety tasks with artificial constraints on hazard and scene types, thereby limiting model generalization; and (3) absence of rigorous evaluation protocols to thoroughly assess model capabilities in complex home safety scenarios. To address these challenges, we introduce TSHA (\textbf{T}rustworthy \textbf{S}afety \textbf{H}azards \textbf{A}ssessment), a comprehensive benchmark comprising 81,809 carefully curated training samples drawn from four complementary sources: existing indoor datasets, internet images, AIGC images, and newly captured images. This benchmark set also includes a highly challenging test set with 1707 samples, comprising not only a carefully selected subset from the training distribution but also newly added videos and panoramic images containing multiple safety hazards, used to evaluate the model’s robustness in complex safety scenarios. Extensive experiments on 23 popular VLMs demonstrate that current VLMs lack robust capabilities for safety hazard assessment. Importantly, models trained on the TSHA training set not only achieve a significant performance improvement of up to +18.3 points on the TSHA test set but also exhibit enhanced generalizability across other benchmarks, underscoring the substantial contribution and importance of the TSHA benchmark.

关键词: vision-language models, safety hazard assessment, benchmark, indoor safety, domain gap, generalization, trustworthy assessment

53. ❌ CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing

作者: Chathurangi Shyalika, Utkarshani Jaimini, Cory Henson, Amit Sheth 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文聚焦于智能制造的因果诊断，提出一个多智能体协同系统CausalPulse。核心相关关键词包括：‘LLM Agents/Autonomous Agents/Agentic Workflow’（高度相关，论文核心是多智能体协同系统）、‘Tool Use/Function Calling/API Tool Use’（高度相关，系统涉及工具使用和规划）、‘Multi-agent Systems/Agent Coordination’（高度相关，系统本质是多智能体协作）。‘Self-Correction/Self-Improvement/Self-Reflection’（有一定关联，论文提到自反思成功率达97.3%）。‘Mechanistic Interpretability/Explainable AI’（有一定关联，论文强调可解释性）。‘AI for Science/Bioinformatics/Cheminformatics’（有一定关联，应用于智能制造，属于AI在工业科学领域的应用）。其他关键词（如LLM技术原理、训练方法、推理优化等）在摘要中未提及，完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了CausalPulse，一个用于智能制造因果诊断的工业级神经符号多智能体协同系统，通过统一异常检测、因果发现和推理，在真实制造环境中实现了高可靠性（成功率98%以上）和实时性能。

摘要翻译

现代制造环境需要实时、可信且可解释的根源分析洞察以维持生产效率和产品质量。传统分析流程通常将异常检测、因果推断与根源分析视为孤立阶段，限制了可扩展性与可解释性。本研究提出CausalPulse——一个工业级多智能体协同系统，用于实现智能制造中的自动化因果诊断。该系统通过基于标准化智能体协议构建的神经符号架构，将异常检测、因果发现与推理过程相统一。CausalPulse正在罗伯特·博世制造工厂部署，能够无缝集成现有监测工作流并支持生产规模的实时运行。在公开数据集（Future Factories）与专有数据集（Planar Sensor Element）上的评估显示其具有高可靠性，整体成功率分别达到98.0%和98.73%。分项成功率表现为：规划与工具使用达98.75%，自我反思达97.3%，协同能力达99.2%。运行时实验显示每个诊断工作流的端到端延迟为50-60秒，并呈现近线性扩展性（R^2=0.97），证实了实时运行能力。与现有工业协同系统的对比突显了其在模块化、可扩展性和部署成熟度方面的显著优势。这些结果表明，CausalPulse通过模块化、人机协同的设计，为新一代制造业提供了可靠、可解释且具备生产就绪性的自动化解决方案。

摘要 (Abstract)

Modern manufacturing environments demand real-time, trustworthy, and interpretable root-cause insights to sustain productivity and quality. Traditional analytics pipelines often treat anomaly detection, causal inference, and root-cause analysis as isolated stages, limiting scalability and explainability. In this work, we present CausalPulse, an industry-grade multi-agent copilot that automates causal diagnostics in smart manufacturing. It unifies anomaly detection, causal discovery, and reasoning through a neurosymbolic architecture built on standardized agentic protocols. CausalPulse is being deployed in a Robert Bosch manufacturing plant, integrating seamlessly with existing monitoring workflows and supporting real-time operation at production scale. Evaluations on both public (Future Factories) and proprietary (Planar Sensor Element) datasets show high reliability, achieving overall success rates of 98.0% and 98.73%. Per-criterion success rates reached 98.75% for planning and tool use, 97.3% for self-reflection, and 99.2% for collaboration. Runtime experiments report end-to-end latency of 50-60s per diagnostic workflow with near-linear scalability (R^2=0.97), confirming real-time readiness. Comparison with existing industrial copilots highlights distinct advantages in modularity, extensibility, and deployment maturity. These results demonstrate how CausalPulse’s modular, human-in-the-loop design enables reliable, interpretable, and production-ready automation for next-generation manufacturing.

关键词: multi-agent copilot, causal diagnostics, smart manufacturing, neurosymbolic architecture, agentic protocols, real-time operation, human-in-the-loop, industrial deployment

作者: Edoardo Allegrini, Edoardo Di Paolo, Angelo Spognardi, Marinella Petrocchi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29741v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文BotVerse是一个基于LLM的社交智能体仿真框架，核心是使用LLM构建自主智能体进行多智能体协调的社会模拟。因此，与’Large Language Models’高度相关（10分），因为系统依赖LLM作为智能体核心；与’LLM Agents’高度相关（10分），因为论文明确构建LLM-based agents；与’Multi-agent Systems’高度相关（10分），因为框架模拟多个智能体的协调互动。其他关键词如MoE、SFT、RAG等涉及具体技术细节，论文未提及，故给0分。AI for Science虽涉及科学应用，但论文聚焦社交模拟而非生物/化学信息学，故0分。

!!! tip deepseek-chat TL;DR

该论文提出了BotVerse框架，通过LLM驱动的自主智能体在受控环境中进行高保真社交模拟，解决了在实时网络中研究自主智能体的伦理风险，并以协调虚假信息场景为例展示了其作为红队和计算社会科学实验平台的有效性。

摘要翻译

BotVerse是一种基于大语言模型智能体的、可扩展的事件驱动高保真社交模拟框架。它通过将交互隔离在受控环境中，同时将其锚定于来自Bluesky生态系统的实时内容流，从而解决了在实时网络中研究自主智能体所面临的伦理风险。该系统具备异步编排API和模拟引擎，可模拟类人时间模式与认知记忆。通过合成社交观测台，研究人员能够部署可定制角色，并大规模观察多模态交互。我们通过一个协同虚假信息场景展示BotVerse，为红队演练与计算社会科学研究者提供了一个安全的实验框架。该框架的视频演示可在https://youtu.be/eZSzO5Jarqk查看。

摘要 (Abstract)

BotVerse is a scalable, event-driven framework for high-fidelity social simulation using LLM-based agents. It addresses the ethical risks of studying autonomous agents on live networks by isolating interactions within a controlled environment while grounding them in real-time content streams from the Bluesky ecosystem. The system features an asynchronous orchestration API and a simulation engine that emulates human-like temporal patterns and cognitive memory. Through the Synthetic Social Observatory, researchers can deploy customizable personas and observe multimodal interactions at scale. We demonstrate BotVersevia a coordinated disinformation scenario, providing a safe, experimental framework for red-teaming and computational social scientists. A video demonstration of the framework is available at https://youtu.be/eZSzO5Jarqk.

关键词: LLM-based agents, social simulation, autonomous agents, multi-agent systems, event-driven framework, computational social science, disinformation scenario, real-time simulation

55. ❌ Spontaneous Functional Differentiation in Large Language Models: A Brain-Like Intelligence Economy

作者: Junjie Zhang, Zhen Shen, Gang Xiong, Xisong Dong 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29735v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的自发功能分化现象，使用集成信息分解方法分析其内部处理机制，与人类大脑的协同处理进行类比。这直接高度相关于’Large Language Models’（核心研究对象）和’Mechanistic Interpretability’（通过分析模型内部机制来解释其行为）。其他关键词如MoE、SFT、RAG、推理方法、代理等均未在摘要中提及或暗示，因此评分为0。论文属于大模型技术原理的创新研究，符合评分背景要求。

!!! tip deepseek-chat TL;DR

该研究发现大语言模型会自发形成类似人脑的协同处理核心，其中间层表现出超越冗余的协同信息整合，这种组织随任务难度增加而动态演化，协同组件的移除会导致性能崩溃，从而揭示了抽象推理的物理实体并连接了人工与生物智能。

摘要翻译

人工系统智能的演化为了解普适性计算原理提供了独特契机。本研究证明，大型语言模型会自发形成协同核心，其信息整合能力显著超越各组成部分，这一现象与人类大脑高度相似。通过跨多种架构的集成信息分解分析，我们发现中间层表现出协同信息处理，而早期与晚期层则依赖冗余处理。这种组织结构具有动态性，并随着任务难度增加呈现物理相变特征。关键的是，消除协同成分会导致性能崩溃性下降，这证实了协同核心作为抽象推理物理载体的作用，从而架起了人工与生物智能之间的桥梁。

摘要 (Abstract)

The evolution of intelligence in artificial systems provides a unique opportunity to identify universal computational principles. Here we show that large language models spontaneously develop synergistic cores where information integration exceeds individual parts remarkably similar to the human brain. Using Integrated Information Decomposition across multiple architectures we find that middle layers exhibit synergistic processing while early and late layers rely on redundancy. This organization is dynamic and emerges as a physical phase transition as task difficulty increases. Crucially ablating synergistic components causes catastrophic performance loss confirming their role as the physical entity of abstract reasoning and bridging artificial and biological intelligence.

关键词: Large Language Models, Functional Differentiation, Integrated Information Decomposition, Synergistic Processing, Brain-Like Intelligence, Phase Transition, Abstract Reasoning, Mechanistic Interpretability

56. ❌ Reinforced Reasoning for End-to-End Retrosynthetic Planning

作者: Chenyang Zuo, Siqi Fan, Yizhen Luo, Zaiqing Nie 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29723v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文ReTriP专注于有机化学中的逆合成规划，将任务重新表述为直接的Chain-of-Thought推理任务，因此与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），并涉及深度推理策略（‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’得8分）。作为AI在科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文未明确提及大模型、深度学习技术原理或其他关键词，因此其他关键词得0分。加权总分计算为(101.0 + 81.0 + 10*1.0) = 28.0。

!!! tip deepseek-chat TL;DR

该论文解决了有机化学中逆合成规划的挑战，通过引入ReTriP框架将其重新表述为Chain-of-Thought推理任务，结合强化学习实现端到端生成，在RetroBench上达到了最先进的性能。

摘要翻译

逆合成规划是有机化学中的一项基础任务，但由于其组合复杂性，仍然具有挑战性。为解决这一问题，传统方法通常依赖于将单步预测与外部搜索启发式相结合的混合框架，这不可避免地割裂了局部分子转化与全局规划目标之间的逻辑连贯性。为了弥合这一差距，并将复杂的战略前瞻性直接嵌入模型的化学推理中，我们引入了ReTriP，这是一个端到端的生成框架，它将逆合成重新表述为直接的思维链推理任务。我们建立了一种路径连贯的分子表示方法，并采用了一种渐进式训练策略，该策略从推理蒸馏过渡到带有可验证奖励的强化学习，从而有效地将逐步生成与实际路线效用对齐。在RetroBench上的实证评估表明，ReTriP实现了最先进的性能，与混合基线方法相比，在长程规划中展现出更优的鲁棒性。

摘要 (Abstract)

Retrosynthetic planning is a fundamental task in organic chemistry, yet remains challenging due to its combinatorial complexity. To address this, conventional approaches typically rely on hybrid frameworks that combine single-step predictions with external search heuristics, inevitably fracturing the logical coherence between local molecular transformations and global planning objectives. To bridge this gap and embed sophisticated strategic foresight directly into the model’s chemical reasoning, we introduce ReTriP, an end-to-end generative framework that reformulates retrosynthesis as a direct Chain-of-Thought reasoning task. We establish a path-coherent molecular representation and employ a progressive training curriculum that transitions from reasoning distillation to reinforcement learning with verifiable rewards, effectively aligning stepwise generation with practical route utility. Empirical evaluation on RetroBench demonstrates that ReTriP achieves state-of-the-art performance, exhibiting superior robustness in long-horizon planning compared to hybrid baselines.

关键词: retrosynthetic planning, organic chemistry, Chain-of-Thought reasoning, end-to-end generative framework, reinforcement learning, molecular representation, long-horizon planning, RetroBench

57. ❌ Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

作者: Joakim Edin, Andreas Motzfeldt, Simon Flachs, Lars Maaløe 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29709v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Symphony for Medical Coding系统，这是一个基于大语言模型（LLMs）的智能体系统，用于医疗编码任务。系统通过检索增强生成（RAG）访问编码指南，使用链式思维（CoT）和深度推理（System 2 Thinking）进行临床叙述分析，并作为智能体（LLM Agents）使用工具（Tool Use）执行编码任务。系统提供可解释性（Explainable AI）输出，属于生物信息学（Bioinformatics）领域的AI应用。其他关键词如MoE、SFT、量化等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文提出了Symphony for Medical Coding系统，这是一个基于大语言模型的智能体系统，通过检索增强生成和深度推理实现跨编码系统的自动化医疗编码，并在多个真实世界数据集上取得了最先进的性能。

摘要翻译

医疗编码将自由文本临床记录转化为从分类系统中提取的标准化代码，这些系统包含数万条目且每年更新。它是计费、临床研究和质量报告的核心环节，但目前仍主要依赖人工操作，效率低下且易出错。现有的自动化方法通过标注数据学习预测固定代码集，导致其无法适应新代码或不同编码系统，除非使用不同数据重新训练。这些方法也无法为预测提供解释，限制了在安全关键场景中的可信度。我们推出医疗编码系统Symphony，其工作方式模拟专业编码员：通过直接参照编码指南对临床叙述进行推理。该设计使Symphony能够跨任意编码系统运行，并提供片段级证据——将每个预测代码与支持它的文本段落相关联。我们在两个公共基准和三个真实世界数据集上进行评估，涵盖美国与英国的住院、门诊、急诊及专科医疗场景。Symphony在所有场景中均取得最先进的性能表现，为自动化临床编码建立了灵活且可直接部署的基础框架。

摘要 (Abstract)

Medical coding translates free-text clinical documentation into standardized codes drawn from classification systems that contain tens of thousands of entries and are updated annually. It is central to billing, clinical research, and quality reporting, yet remains largely manual, slow, and error-prone. Existing automated approaches learn to predict a fixed set of codes from labeled data, thereby preventing adaptation to new codes or different coding systems without retraining on different data. They also provide no explanation for their predictions, limiting trust in safety-critical settings. We introduce Symphony for Medical Coding, a system that approaches the task the way expert human coders do: by reasoning over the clinical narrative with direct access to the coding guidelines. This design allows Symphony to operate across any coding system and to provide span-level evidence linking each predicted code to the text that supports it. We evaluate on two public benchmarks and three real-world datasets spanning inpatient, outpatient, emergency, and subspecialty settings across the United States and the United Kingdom. Symphony achieves state-of-the-art results across all settings, establishing itself as a flexible, deployment-ready foundation for automated clinical coding.

关键词: medical coding, large language models, agentic system, retrieval-augmented generation, explainable AI, clinical documentation, automated coding, bioinformatics

58. ❌ Exploring the Impact of Skin Color on Skin Lesion Segmentation

作者: Kuniko Paxton, Medina Kapo, Amila Akagić, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29694v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于皮肤病变分割的公平性评估，属于AI在生物医学（具体为皮肤病学）领域的应用研究。论文的核心是评估不同分割模型（UNet、DeepLabV3、DINOv2）在皮肤病变分割任务中，肤色（特别是病变与皮肤的对比度）对分割性能的影响，并提出了一种基于连续色素分布（Wasserstein距离）的量化分析方法。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究是AI在生物医学（皮肤病学/生物信息学相关）领域的直接应用，且是论文的核心主题，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究评估了皮肤病变分割模型的性能公平性，发现分割误差主要与病变-皮肤的低对比度相关，而非全局肤色类别，并提出基于连续色素分布的量化方法比离散肤色分类更能有效评估模型偏差。

摘要翻译

皮肤癌，尤其是黑色素瘤，依然是发病率和死亡率的主要原因，这使得早期检测至关重要。人工智能驱动的皮肤病学系统通常依赖皮肤病变分割作为预处理步骤，以将病变区域与周围皮肤区分开来，并为下游分析提供支持。尽管针对肤色在病变分类中的公平性问题已得到广泛研究，但肤色对分割阶段的影响仍未得到充分量化，且常使用粗略、离散的肤色类别进行评估。在本研究中，我们在两个公开的皮肤镜数据集（HAM10000和ISIC2017）上评估了三种强大的分割架构（UNet、基于ResNet50骨干网络的DeepLabV3和DINOv2），并引入了一种连续的色素或对比度分析方法，将像素级的个体类型角度（ITA）值视为分布进行处理。通过计算单张图像内纯皮肤区域、纯病变区域和全图像区域分布之间的瓦瑟斯坦距离，我们量化了病变与皮肤的对比度，并将其与多种指标下的分割性能相关联。在这些数据集所代表的范围内，全局肤色指标（菲茨帕特里克分组或平均ITA）与分割质量仅呈现弱关联。相比之下，低病变-皮肤对比度始终与模型较大的分割误差相关，表明边界模糊和低对比度是导致分割失败的关键因素。这些发现提示，皮肤镜分割中的公平性改进应优先考虑对低对比度病变的稳健处理，且基于分布的色素测量方法比离散的肤色类别提供了更具信息量的审计信号。

摘要 (Abstract)

Skin cancer, particularly melanoma, remains a major cause of morbidity and mortality, making early detection critical. AI-driven dermatology systems often rely on skin lesion segmentation as a preprocessing step to delineate the lesion from surrounding skin and support downstream analysis. While fairness concerns regarding skin tone have been widely studied for lesion classification, the influence of skin tone on the segmentation stage remains under-quantified and is frequently assessed using coarse, discrete skin tone categories. In this work, we evaluate three strong segmentation architectures (UNet, DeepLabV3 with a ResNet50 backbone, and DINOv2) on two public dermoscopic datasets (HAM10000 and ISIC2017) and introduce a continuous pigment or contrast analysis that treats pixel-wise ITA values as distributions. Using Wasserstein distances between within-image distributions for skin-only, lesion-only, and whole-image regions, we quantify lesion skin contrast and relate it to segmentation performance across multiple metrics. Within the range represented in these datasets, global skin tone metrics (Fitzpatrick grouping or mean ITA) show weak association with segmentation quality. In contrast, low lesion-skin contrast is consistently associated with larger segmentation errors in models, indicating that boundary ambiguity and low contrast are key drivers of failure. These findings suggest that fairness improvements in dermoscopic segmentation should prioritize robust handling of low-contrast lesions, and the distribution-based pigment measures provide a more informative audit signal than discrete skin-tone categories.

关键词: skin lesion segmentation, fairness, skin tone, contrast analysis, dermoscopic datasets, Wasserstein distance, segmentation performance, AI in dermatology

59. ❌ Measuring the metacognition of AI

作者: Richard Servajean, Philippe Servajean 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29693v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的元认知能力评估，直接涉及’Large Language Models’（实验对象为GPT-5等三个LLMs）和’Self-Correction/Self-Improvement/Self-Reflection’（研究LLMs评估自身决策可靠性的元认知能力），这两项高度相关给10分；其他关键词如MoE、量化、推理加速等均未在摘要中提及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出采用meta-d'框架和信号检测理论来测量大型语言模型的元认知能力（即评估自身决策可靠性的能力），并通过在GPT-5等三个LLMs上的实验证明这些方法能有效比较不同模型、任务下的元认知敏感性及风险调节行为。

摘要翻译

一个稳健的决策过程必须考虑不确定性，尤其是在选择涉及固有风险时。由于人工智能系统日益融入决策工作流，管理不确定性越来越依赖于这些系统的元认知能力，即其评估自身决策可靠性并进行调节的能力。因此，采用稳健的方法来衡量AI的元认知能力至关重要。本文主要是一项方法论贡献，主张采用元d’框架或其免模型替代方案，作为评估AI元认知敏感性的黄金标准——即生成能区分正确与错误回答的信心评级的能力。此外，我们建议利用信号检测理论来衡量AI基于不确定性和风险自发调节其决策的能力。为了证明这些心理物理学框架的实际效用，我们在三个大语言模型上进行了两组实验：GPT-5、DeepSeek-V3.2-Exp和Mistral-Medium-2508。在第一组实验中，LLMs先进行主要判断，随后给出信心评级。在第二组实验中，LLMs仅进行主要判断，同时我们操纵了与不同回答相关的风险。一方面，应用元d’框架使我们能够沿三个维度进行比较：将LLM与最优性能比较、在不同LLM之间针对给定任务进行比较，以及在同一LLM跨不同任务时进行比较。另一方面，信号检测理论使我们能够评估LLMs在风险较高时是否变得更加保守。

摘要 (Abstract)

A robust decision-making process must take into account uncertainty, especially when the choice involves inherent risks. Because artificial Intelligence (AI) systems are increasingly integrated into decision-making workflows, managing uncertainty relies more and more on the metacognitive capabilities of these systems; i.e, their ability to assess the reliability of and regulate their own decisions. Hence, it is crucial to employ robust methods to measure the metacognitive abilities of AI. This paper is primarily a methodological contribution arguing for the adoption of the meta-d’ framework, or its model-free alternatives, as the gold standard for assessing the metacognitive sensitivity of AIs–the ability to generate confidence ratings that distinguish correct from incorrect responses. Moreover, we propose to leverage signal detection theory (SDT) to measure the ability of AIs to spontaneously regulate their decisions based on uncertainty and risk. To demonstrate the practical utility of these psychophysical frameworks, we conduct two series of experiments on three large language models (LLMs)–GPT-5, DeepSeek-V3.2-Exp, and Mistral-Medium-2508. In the first experiments, LLMs performed a primary judgment followed by a confidence rating. In the second, LLMs only performed the primary judgment, while we manipulated the risk associated with either response. On the one hand, applying the meta-d’ framework allows us to conduct comparisons along three axes: comparing an LLM to optimality, comparing different LLMs on a given task, and comparing the same LLM across different tasks. On the other hand, SDT allows us to assess whether LLMs become more conservative when risks are high.

关键词: metacognition, large language models, uncertainty, confidence ratings, signal detection theory, decision-making, risk regulation, AI evaluation

60. ❌ A First Step Towards Even More Sparse Encodings of Probability Distributions

作者: Florian Andreas Marwitz, Tanya Braun, Ralf Möller 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29691v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究概率分布的稀疏编码方法，通过提取一阶逻辑公式来减少表示分布所需的值数量，从而增加编码的稀疏性。这与关键词"Mixture of Experts OR MoE OR Sparse Models"中的"Sparse Models"（稀疏模型）概念相关，因为论文的核心目标是实现更稀疏的编码。然而，论文未涉及大语言模型（LLMs）、深度学习、AI for Science或其他特定的大模型技术，因此其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种从概率分布中提取一阶逻辑公式的方法，通过减少分布中的值数量并最小化逻辑公式，显著增加了编码的稀疏性，同时保留了核心信息。

摘要翻译

现实世界场景可通过提升概率分布进行建模。然而，分布通常以表格或列表形式编码，需要指数级数量的参数。为此，我们提出一种从概率分布中提取一阶逻辑公式的方法：首先通过减少分布中的参数数量，随后为每个参数提取逻辑公式并进行最小化处理，从而显著降低所需参数数量。这种约简与最小化过程能够在泛化给定分布的同时，显著提升编码的稀疏性。实验评估表明，通过提取少量简洁公式，可在保留核心信息的前提下实现编码稀疏性的极大提升。

摘要 (Abstract)

Real world scenarios can be captured with lifted probability distributions. However, distributions are usually encoded in a table or list, requiring an exponential number of values. Hence, we propose a method for extracting first-order formulas from probability distributions that require significantly less values by reducing the number of values in a distribution and then extracting, for each value, a logical formula to be further minimized. This reduction and minimization allows for increasing the sparsity in the encoding while also generalizing a given distribution. Our evaluation shows that sparsity can increase immensely by extracting a small set of short formulas while preserving core information.

关键词: probability distributions, sparse encodings, first-order formulas, logical formula minimization, value reduction, generalization, lifted probability distributions, exponential values

61. ❌ KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models

作者: Zhenning Chen, Hanbei Zhan, Yanwei Huang, Xin Wu, Dazhen Deng, Di Weng, Yingcai Wu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29689v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究知识编辑技术以纠正LLMs中的事实错误，与’Large Language Models’高度相关（10分）。知识编辑旨在提高事实准确性，与’Hallucination Mitigation’直接相关（10分）。系统通过可视化分析增强编辑过程的可解释性，与’Mechanistic Interpretability’相关（10分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在事实问答中可能提供错误信息的问题，提出了一个名为KEditVis的可视化分析系统，通过交互式可视化帮助用户更深入地理解知识编辑过程，从而改善编辑效果并为知识编辑算法的未来发展提供有价值的见解。

摘要翻译

大型语言模型（LLMs）在事实性问答任务中展现出卓越的能力，但有时仍会提供错误回答。为解决这一问题，知识编辑技术已成为修正LLMs中事实信息的有效方法。然而，典型的知识编辑流程难以确定用于编辑的最佳模型层集合，且依赖的摘要指标提供的指导不足。这种透明度的缺乏阻碍了对不同编辑策略的有效比较与最优方案的识别。本文提出KEditVis，一种新颖的可视化分析系统，旨在通过交互式可视化帮助用户深入理解知识编辑过程，提升编辑效果，并为知识编辑算法的未来发展发掘有价值的洞见。借助KEditVis，用户可以选择合适的层作为编辑目标，探究编辑失效背后的原因，并执行更具针对性、更有效的编辑。我们通过使用场景分析、专家访谈和用户研究进行的评估验证了该系统的有效性和可用性。

摘要 (Abstract)

Large Language Models (LLMs) demonstrate exceptional capabilities in factual question answering, yet they sometimes provide incorrect responses. To address this issue, knowledge editing techniques have emerged as effective methods for correcting factual information in LLMs. However, typical knowledge editing workflows struggle with identifying the optimal set of model layers for editing and rely on summary indicators that provide insufficient guidance. This lack of transparency hinders effective comparison and identification of optimal editing strategies. In this paper, we present KEditVis, a novel visual analytics system designed to assist users in gaining a deeper understanding of knowledge editing through interactive visualizations, improving editing outcomes, and discovering valuable insights for the future development of knowledge editing algorithms. With KEditVis, users can select appropriate layers as the editing target, explore the reasons behind ineffective edits, and perform more targeted and effective edits. Our evaluation, including usage scenarios, expert interviews, and a user study, validates the effectiveness and usability of the system.

关键词: Knowledge Editing, Large Language Models, Visual Analytics, Factual Question Answering, Model Layer Selection, Interactive Visualization, Editing Strategies

62. ❌ Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor

作者: Christopher Koch 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29681v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确研究大型语言模型（LLMs）对人类认知和任务表现的影响，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术、方法或应用，如MoE、SLMs、训练技术、推理优化、代理系统、压缩、科学AI等，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）使用对人类任务表现和元认知准确性的影响，提出了AI介导的元认知解耦模型，以解释过度自信、依赖不当等现象，而非简单地强化邓宁-克鲁格效应。

摘要翻译

关于生成式人工智能仅会放大邓宁-克鲁格效应的普遍论断过于笼统，无法涵盖现有证据。最清晰的研究结果表明，大型语言模型（LLM）的使用能够改善可观测的输出结果与短期任务表现，但同时会削弱元认知准确性，并拉平不同技能群体间经典的能力-信心梯度。本文综合了人机交互、学习研究与模型评估领域的证据，提出了“人工智能中介的元认知解耦”这一工作模型：即产出结果、底层理解、校准准确性与自我评估能力四者间日益扩大的分离。相较于“邓宁-克鲁格曲线普遍陡峭化”这一简化隐喻，这一四变量框架能更有效地解释过度自信、过度依赖与依赖不足、拐杖效应以及弱迁移现象。文章最后探讨了该模型对工具设计、评估及知识工作的启示。

摘要 (Abstract)

The common claim that generative AI simply amplifies the Dunning-Kruger effect is too coarse to capture the available evidence. The clearest findings instead suggest that large language model (LLM) use can improve observable output and short-term task performance while degrading metacognitive accuracy and flattening the classic competence-confidence gradient across skill groups. This paper synthesizes evidence from human-AI interaction, learning research, and model evaluation, and proposes the working model of AI-mediated metacognitive decoupling: a widening gap among produced output, underlying understanding, calibration accuracy, and self-assessed ability. This four-variable account better explains overconfidence, over- and under-reliance, crutch effects, and weak transfer than the simpler metaphor of a uniformly steeper Dunning-Kruger curve. The paper concludes with implications for tool design, assessment, and knowledge work.

关键词: large language models, LLMs, metacognitive decoupling, Dunning-Kruger effect, human-AI interaction, overconfidence, task performance, calibration accuracy

63. ❌ View-oriented Conversation Compiler for Agent Trace Analysis

作者: Lvmin Zhang, Maneesh Agrawala 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29678v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发VCC编译器，用于结构化处理智能体对话轨迹以提升上下文学习效果。与’LLM Agents’高度相关（10分），因为论文专门处理智能体对话轨迹；与’In-context Learning’高度相关（10分），因为实验证明VCC能提升上下文学习效果；与’Chain of Thought’相关（8分），因为论文提到智能体对话包含推理块；与’Tool Use’相关（8分），因为提到工具调用；与’Context Window Extension’和’Multi-agent Systems’有一定关联（各5分），分别涉及上下文窗口压缩和子智能体调用；与’Large Language Models’有一定关联（5分），因为上下文学习通常涉及大模型；其他关键词与论文内容无关或未提及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了VCC编译器，通过将原始智能体对话日志转换为结构化视图，显著提升了上下文学习的效果，同时减少了计算资源消耗。

摘要翻译

在情境学习与框架驱动的智能体认知时代，智能体轨迹的分析价值日益凸显，然而先前研究大多将会话格式视为无关紧要的工程细节。现代智能体会话包含深度结构化的内容，如嵌套工具调用与结果、思维链推理块、子智能体调用、上下文窗口压缩边界以及框架注入的系统指令，其复杂性远超简单的用户-助手对话。若将此类轨迹以纯文本、JSON、YAML格式或通过grep直接输入反思器（reflector）或其他分析机制，会显著降低分析质量。本文提出VCC（面向视图的会话编译器），该编译器通过完整的编译流程（词法分析、语法分析、中间表示、代码优化、目标生成）将原始智能体JSONL日志转化为一系列结构化视图：完整视图（作为标准行号坐标系的无损记录）、用户界面视图（还原用户实际感知的交互过程）以及自适应视图（由相关性谓词控制的结构保持投影）。在AppWorld的情境学习实验中，仅将反思器的输入格式从原始JSONL替换为VCC编译视图，即可使所有三种测试模型配置的通过率提升，同时将反思器的令牌消耗减少二分之一至三分之二，并生成更精炼的学习记忆。这些结果表明，消息格式应被视为情境学习的基础设施，而非偶然的实现选择。

摘要 (Abstract)

Agent traces carry increasing analytical value in the era of context learning and harness-driven agentic cognition, yet most prior work treats conversation format as a trivial engineering detail. Modern agent conversations contain deeply structured content, including nested tool calls and results, chain-of-thought reasoning blocks, sub-agent invocations, context-window compaction boundaries, and harness-injected system directives, whose complexity far exceeds that of simple user-assistant exchanges. Feeding such traces to a reflector or other analytical mechanism in plain text, JSON, YAML, or via grep can materially degrade analysis quality. This paper presents VCC (View-oriented Conversation Compiler), a compiler (lex, parse, IR, lower, emit) that transforms raw agent JSONL logs into a family of structured views: a full view (lossless transcript serving as the canonical line-number coordinate system), a user-interface view (reconstructing the interaction as the user actually perceived it), and an adaptive view (a structure-preserving projection governed by a relevance predicate). In a context-learning experiment on AppWorld, replacing only the reflector’s input format, from raw JSONL to VCC-compiled views, leads to higher pass rates across all three model configurations tested, while cutting reflector token consumption by half to two-thirds and producing more concise learned memory. These results suggest that message format functions as infrastructure for context learning, not as an incidental implementation choice.

关键词: agent traces, conversation compiler, structured views, context learning, tool calls, chain-of-thought, agentic cognition, JSONL logs

64. ❌ Mind the Gap: A Framework for Assessing Pitfalls in Multimodal Active Learning

作者: Dustin Eisenhardt, Yunhee Jeong, Florian Buettner 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29677v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多模态主动学习的基准框架和挑战，主要关注神经网络在多模态数据下的主动学习策略评估，未涉及大语言模型、深度学习技术原理创新或科学领域应用。所有关键词均与大模型技术、深度学习创新或AI科学应用相关，而本文专注于传统神经网络的多模态主动学习，与给定关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一个评估多模态主动学习陷阱的基准框架，发现现有方法在多模态环境下会导致模型依赖单一模态，且多模态策略并不总是优于单模态策略。

摘要翻译

多模态学习使神经网络能够整合来自异构源的信息，但该场景下的主动学习面临独特挑战。这些挑战包括模态缺失、模态难度差异以及多变的交互结构——这些在单模态情境中并不存在。尽管主动学习策略在单模态环境中的行为已得到充分刻画，但其在多模态条件下的表现仍鲜为人知。我们提出了一种新的多模态主动学习基准框架，通过合成数据集分离这些潜在问题，从而在无混杂噪声干扰的情况下进行系统性评估。利用该框架，我们比较了单模态与多模态查询策略，并在两个真实世界数据集上验证了研究结果。实验表明，模型持续形成不平衡的表征，主要依赖单一模态而忽视其他模态。现有查询方法未能缓解此效应，且多模态策略并未持续优于单模态策略。这些发现揭示了当前主动学习方法的局限性，并强调需要开发能显式应对这些问题的模态感知查询策略。代码与基准资源将公开提供。

摘要 (Abstract)

Multimodal learning enables neural networks to integrate information from heterogeneous sources, but active learning in this setting faces distinct challenges. These include missing modalities, differences in modality difficulty, and varying interaction structures. These are issues absent in the unimodal case. While the behavior of active learning strategies in unimodal settings is well characterized, their behavior under such multimodal conditions remains poorly understood. We introduce a new framework for benchmarking multimodal active learning that isolates these pitfalls using synthetic datasets, allowing systematic evaluation without confounding noise. Using this framework, we compare unimodal and multimodal query strategies and validate our findings on two real-world datasets. Our results show that models consistently develop imbalanced representations, relying primarily on one modality while neglecting others. Existing query methods do not mitigate this effect, and multimodal strategies do not consistently outperform unimodal ones. These findings highlight limitations of current active learning methods and underline the need for modality-aware query strategies that explicitly address these pitfalls. Code and benchmark resources will be made publicly available.

关键词: multimodal active learning, benchmark framework, modality imbalance, query strategies, synthetic datasets, neural networks, representation learning, real-world validation

65. ❌ Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models

作者: Brian Felipe Keith-Norambuena, Carolina Inés Rojas-Córdova, Claudio Juvenal Meneses-Villegas, Elizabeth Johanna Lam-Esquenazi, Angélica María Flores-Bustos, Ignacio Alejandro Molina-Villablanca, Joshua Emanuel Leyton-Vallejos 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29661v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）来引导叙事提取算法，在路径查找过程中集成LLM来根据议程对候选文档进行排序。因此，只有’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLM是方法的核心组件。其他关键词如MoE、SLMs、训练技术、推理优化、代理系统、科学AI等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于议程的叙事提取方法，通过将大语言模型集成到Narrative Trails路径查找过程中，引导故事线构建以符合用户指定的议程，在保持叙事连贯性的同时显著提高了议程对齐度。

摘要翻译

现有叙事提取方法在连贯性、交互性与多故事线支持之间存在权衡。叙事地图通过其覆盖约束的副产品实现了丰富的交互并生成多故事线，但这以牺牲单一路径连贯性为代价。叙事轨迹通过最大容量路径优化实现了高连贯性，但未提供用户引导或多视角机制。我们提出基于议程的叙事提取方法，通过将大语言模型整合至叙事轨迹路径查找过程来弥合这一鸿沟，从而将故事线构建引导至用户指定的视角。我们的方法在每一步使用大语言模型根据候选文档与给定议程的契合度进行排序，同时保持叙事连贯性。通过在不同议程下运行算法，可在同一语料库中生成不同的故事线。我们在新闻文章语料库上使用Claude Opus 4.5和GPT 5.1作为大语言模型评估员进行评估，测量了64个端点对和6种议程下的连贯性与议程契合度。在语义议程上，大语言模型引导比关键词匹配的契合度高出9.9%（p=0.017），其中在“政权镇压”议程上提升达13.3%（p=0.037），而关键词匹配在具有字面关键词重叠的议程上仍具竞争力。连贯性代价极小：与无议程基线相比，大语言模型引导仅降低2.2%的连贯性。与源材料相悖的反向议程在所有方法中均获得一致低分（2.2-2.5），证实引导机制无法编造缺乏支撑的叙事。

摘要 (Abstract)

Existing narrative extraction methods face a trade-off between coherence, interactivity, and multi-storyline support. Narrative Maps supports rich interaction and generates multiple storylines as a byproduct of its coverage constraints, though this comes at the cost of individual path coherence. Narrative Trails achieves high coherence through maximum capacity path optimization but provides no mechanism for user guidance or multiple perspectives. We introduce agenda-based narrative extraction, a method that bridges this gap by integrating large language models into the Narrative Trails pathfinding process to steer storyline construction toward user-specified perspectives. Our approach uses an LLM at each step to rank candidate documents based on their alignment with a given agenda while maintaining narrative coherence. Running the algorithm with different agendas yields different storylines through the same corpus. We evaluated our approach on a news article corpus using LLM judges with Claude Opus 4.5 and GPT 5.1, measuring both coherence and agenda alignment across 64 endpoint pairs and 6 agendas. LLM-driven steering achieves 9.9% higher alignment than keyword matching on semantic agendas (p=0.017), with 13.3% improvement on \textit{Regime Crackdown} specifically (p=0.037), while keyword matching remains competitive on agendas with literal keyword overlap. The coherence cost is minimal: LLM steering reduces coherence by only 2.2% compared to the agenda-agnostic baseline. Counter-agendas that contradict the source material score uniformly low (2.2-2.5) across all methods, confirming that steering cannot fabricate unsupported narratives.

关键词: narrative extraction, large language models, pathfinding algorithms, agenda-based steering, storyline construction, coherence, alignment evaluation

66. ❌ Semantic Interaction for Narrative Map Sensemaking: An Insight-based Evaluation

作者: Brian Felipe Keith-Norambuena, Fausto German, Eric Krokos, Sarah Joseph, Chris North 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究语义交互（Semantic Interaction）在叙事地图理解中的应用评估，属于人机交互（HCI）和可视化分析领域。论文涉及用户研究、认知过程整合和AI模型交互，但未提及任何大模型、深度学习技术原理或科学AI应用。所有评分关键词均聚焦于大模型技术、训练方法、推理优化、代理系统等，与该论文的HCI/可视化主题完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过用户实验评估了语义交互在叙事地图理解中的有效性，发现基于地图的原型比时间线基线产生更多洞察，且语义交互功能提供了替代的模型优化途径。

摘要翻译

语义交互（Semantic Interaction，SI）允许分析人员通过直接操作可视化视图，将认知过程融入人工智能模型。尽管已有针对叙事提取的语义交互框架被提出，但其有效性的实证评估仍较为有限。本文通过一项用户研究评估语义交互在叙事地图意义建构中的应用，研究设置了三种实验条件：时间线基线组、基础叙事地图组以及具备语义交互功能的交互式叙事地图组，共33名参与者参与。结果显示，基于地图的原型系统比时间线基线产生了更多洞察，其中具备语义交互功能的条件达到统计显著性，而基础地图条件也呈现相同趋势。具备语义交互功能的条件表现出最高的平均绩效；两种地图条件之间的差异虽未达到统计显著性，但显示出较大的效应量（d > 0.8），表明本研究统计效力不足以检测此类差异。定性分析识别出两种不同的语义交互使用模式——修正型与增补型，它们使分析人员能够对提取的叙事施加质量判断和组织结构。我们还发现，语义交互用户以更少的参数操作实现了可比的探索广度，这表明语义交互为模型优化提供了替代路径。本研究为“基于地图的呈现方式在叙事意义建构中优于时间线”提供了实证证据，并对分析人员如何利用语义交互进行叙事优化提供了定性见解。

摘要 (Abstract)

Semantic interaction (SI) enables analysts to incorporate their cognitive processes into AI models through direct manipulation of visualizations. While SI frameworks for narrative extraction have been proposed, empirical evaluations of their effectiveness remain limited. This paper presents a user study that evaluates SI for narrative map sensemaking, involving 33 participants under three conditions: a timeline baseline, a basic narrative map, and an interactive narrative map with SI capabilities. The results show that the map-based prototypes yielded more insights than the timeline baseline, with the SI-enabled condition reaching statistical significance and the basic map condition trending in the same direction. The SI-enabled condition showed the highest mean performance; differences between the map conditions were not statistically significant but showed large effect sizes (d > 0.8), suggesting that the study was underpowered to detect them. Qualitative analysis identified two distinct SI approaches-corrective and additive-that enable analysts to impose quality judgments and organizational structure on extracted narratives. We also find that SI users achieved comparable exploration breadth with less parameter manipulation, suggesting that SI serves as an alternative pathway for model refinement. This work provides empirical evidence that map-based representations outperform timelines for narrative sensemaking, along with qualitative insights into how analysts use SI for narrative refinement.

关键词: Semantic Interaction, Narrative Map, Sensemaking, User Study, Visualization, Insight Evaluation, Model Refinement, Human-AI Interaction

67. ❌ Optimizing Donor Outreach for Blood Collection Sessions: A Scalable Decision Support Framework

作者: André Carneiro, Pedro T. Monteiro, Rui Henriques 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29643v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是血液捐赠中心的运营优化问题，使用传统的运筹学方法（BILP和贪心启发式算法）解决捐赠者邀请调度问题。论文完全不涉及大模型、深度学习、AI技术原理或AI for Science等主题，所有关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个优化框架来解决血液捐赠中心在多站点网络中分配捐赠者到采集会话的运营问题，通过BILP和贪心启发式算法有效缩小了供需差距，其中贪心算法在性能接近的情况下显著降低了计算资源需求。

摘要翻译

献血中心在匹配供需关系与管理献血者可用性方面面临挑战。尽管定向招募至关重要，但过度邀约可能导致献血者疲劳。有效的招募需要在合适的时间定位合适的献血者，在各项约束条件与献血者便利性及资格要求之间取得平衡。尽管在血液供应链优化方面已有大量研究，且算法化献血者招募日益受到关注，但如何在多站点网络中统筹考虑献血者资格、站点容量、血型需求目标、地理便利性及献血者安全等因素，将献血者分配至具体献血时段这一运营问题，仍未得到解决。
针对这一空白，我们提出了一个包含献血者资格、出行便利性、血型需求目标及惩罚机制的献血邀约调度优化框架。我们评估了两种策略：（一）二进制整数线性规划（BILP）模型；（二）高效贪心启发式算法。评估使用葡萄牙血液与移植研究所（IPST）的登记数据，以4个月为周期对里斯本运营区域进行邀约规划。一个前瞻性流程整合了自然到场预测、基于分位数的需求目标以及剩余容量估算，以制定具有前瞻性的邀约计划。结果表明，该框架在缩小里斯本运营区域供需差距方面发挥关键作用。一项受控对比显示，贪心启发式算法取得了与BILP模型相近的结果，其峰值内存占用降低188倍，运行速度提升115倍；其权衡之处包括需求满足率降低3.9个百分点（86.1% 对比 90.0%）、献血者至献血点距离增大、具有不良反应风险的献血者暴露度更高，以及每位非高频献血者承受的邀约负担更重，这反映了局部优化与全局优化的差异。实验评估了具备约束感知能力的调度如何通过动员符合条件的非活跃/即将流失献血者来弥补供需缺口。

摘要 (Abstract)

Blood donation centers face challenges in matching supply with demand while managing donor availability. Although targeted outreach is important, it can cause donor fatigue via over-solicitation. Effective recruitment requires targeting the right donors at the right time, balancing constraints with donor convenience and eligibility. Despite extensive work on blood supply chain optimization and growing interest in algorithmic donor recruitment, the operational problem of assigning donors to sessions across a multi-site network, taking into account eligibility, capacity, blood-type demand targets, geographic convenience, and donor safety, remains unaddressed. We address this gap with an optimization framework for donor invitation scheduling incorporating donor eligibility, travel convenience, blood-type demand targets, and penalties. We evaluate two strategies: (i) a binary integer linear programming (BILP) formulation and (ii) an efficient greedy heuristic. Evaluation uses the registry from Instituto Português do Sangue e da Transplantação (IPST) for invite planning in the Lisbon operational region using 4-month windows. A prospective pipeline integrates organic attendance forecasting, quantile-based demand targets, and residual capacity estimation for forward-looking invitation plans. Results reveal its key role in closing the supply-demand gap in the Lisbon operational region. A controlled comparison shows that the greedy heuristic achieves results comparable to the BILP, with 188x less peak memory and 115x faster runtime; trade-offs include 3.9 pp lower demand fulfillment (86.1% vs. 90.0%), larger donor-session distance, higher adverse-reaction donor exposure, and greater invitation burden per non-high-frequency donor, reflecting local versus global optimization. Experiments assess how constraint-aware scheduling can close gaps by mobilizing eligible inactive/lapsing donors.

关键词: blood donation, donor outreach, optimization framework, binary integer linear programming, greedy heuristic, supply-demand gap, donor eligibility, operational scheduling

68. ❌ MacTok: Robust Continuous Tokenization for Image Generation

作者: Hengyu Zeng, Xin Gao, Guanghao Li, Yuxiang Yan, Jiaoyang Ruan, Junpeng Ma, Haoyu Albert Wang, Jian Pu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29634v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像生成和连续图像标记化技术，研究的是视觉表示学习、变分自编码器和图像掩码增强方法。所有给定的关键词都专门针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文完全不涉及语言模型或文本处理，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了连续图像标记化中因使用较少标记而导致后验崩溃的问题，提出了一种名为MacTok的掩码增强1D连续标记器，通过图像掩码和表示对齐来学习紧凑且鲁棒的表示，在ImageNet上实现了高效的图像生成，同时将标记使用量减少了高达64倍。

摘要翻译

连续图像分词器能够实现高效的视觉生成，其中基于变分框架的分词器可通过KL正则化学习平滑、结构化的潜在表示。然而，在使用较少分词数量时，这常常导致后验坍缩，即编码器无法将信息丰富的特征编码到压缩的潜在空间中。为解决这一问题，我们提出了\textbf{MacTok}，即一种基于\textbf{掩码增强的一维连续分词器}，它利用图像掩码和表示对齐来防止坍缩，同时学习紧凑且鲁棒的表示。MacTok同时采用随机掩码来正则化潜在学习，以及基于DINO引导的语义掩码来强调图像中的信息丰富区域，从而迫使模型从不完整的视觉证据中编码鲁棒的语义信息。结合全局与局部表示对齐技术，MacTok能够在高度压缩的一维潜在空间（仅需64或128个分词）中保留丰富的判别性信息。在ImageNet数据集上，MacTok与SiT-XL结合，在256$\times$256分辨率下取得了1.44的竞争性gFID分数，在512$\times$512分辨率下达到了1.52的领先性能，同时将分词使用量降低了高达64倍。这些结果证实，掩码与语义引导相结合能够有效防止后验坍缩，并实现高效、高保真度的分词表示。

摘要 (Abstract)

Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce \textbf{MacTok}, a \textbf{M}asked \textbf{A}ugmenting 1D \textbf{C}ontinuous \textbf{Tok}enizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256$\times$256 and a state-of-the-art 1.52 at 512$\times$512 with SiT-XL, while reducing token usage by up to 64$\times$. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.

关键词: Continuous Tokenization, Image Generation, Posterior Collapse, Masked Augmenting, Variational Frameworks, Latent Representations, DINO-guided Semantic Masking, Representation Alignment

69. ❌ ASI-Evolve: AI Accelerates AI

作者: Weixian Xu, Tiantian Mi, Yixiu Liu, Yang Nan, Zhimeng Zhou, Lyumanshan Ye, Lin Zhang, Yu Qiao, Pengfei Liu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29640v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出ASI-Evolve框架，旨在通过AI加速AI自身发展，属于AI-for-AI研究。核心相关关键词：1) ‘LLM Agents’ (10分)：框架是agentic系统，用于AI研究循环；2) ‘KV Cache Compression/Linear Attention’ (10分)：在神经架构设计中发现了105个SOTA线性注意力架构；3) ‘AI for Science’ (10分)：框架应用于数学和生物医学领域，属于科学AI应用。其他关键词：‘Large Language Models’ (5分)：涉及AI发展，可能隐含大模型背景；‘Pre-training’ (5分)：涉及预训练数据管理。其余关键词与论文具体内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了AI能否加速AI自身发展的问题，提出了ASI-Evolve代理框架，通过AI驱动的发现循环在神经架构设计、预训练数据管理和强化学习算法三个核心AI开发领域取得了显著性能提升，并展示了向数学和生物医学领域的初步迁移能力。

摘要翻译

人工智能能否加速其自身的发展？尽管近期涌现的智能体系统已在反馈迅速、目标明确的任务中展现出强大性能，但其能否应对驱动真实AI进步所需的高成本、长周期、弱监督的研究闭环，仍是一个悬而未决的问题。本文提出ASI-Evolve——一个用于“AI-for-AI”研究的智能体框架，它通过“学习-设计-实验-分析”的循环实现了这一闭环。ASI-Evolve在标准进化智能体的基础上增强了两个核心组件：一是认知基座（cognition base），它将积累的人类先验知识注入每一轮探索；二是专用分析器（analyzer），它将复杂的实验结果提炼为可复用的洞察，供后续迭代使用。据我们所知，ASI-Evolve是首个在AI发展的三大核心组成部分——数据、架构与学习算法——上均实现AI驱动发现的统一框架。在神经网络架构设计中，它发现了105个性能达到前沿水平的线性注意力架构，其中最优模型超越DeltaNet达+0.97分，增益接近近期人工设计改进的3倍。在预训练数据筛选任务中，进化出的数据流水线将基准测试平均性能提升+3.96分，在MMLU上的增益更超过18分。在强化学习算法设计中，所发现的算法在AMC32上超越GRPO达+12.5分，在AIME24上提升+11.67分，在OlympiadBench上提高+5.04分。我们进一步通过数学与生物医学领域的实验提供了初步证据，表明这种“AI-for-AI”范式可迁移至AI技术栈之外的领域。综上所述，这些结果表明ASI-Evolve代表着向实现AI在基础研发阶段自我加速迈出的重要一步，为闭环AI研究的可行性提供了早期实证。

摘要 (Abstract)

Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. In neural architecture design, it discovered 105 SOTA linear attention architectures, with the best discovered model surpassing DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points, with gains exceeding 18 points on MMLU. In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 points on AIME24, and +5.04 points on OlympiadBench. We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.

关键词: AI-for-AI, agentic framework, neural architecture design, linear attention, pretraining data curation, reinforcement learning algorithm, closed-loop research, ASI-Evolve

70. ❌ Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems

作者: Pegah Ramezani, Thomas Kinfe, Andreas Maier, Achim Schilling, Patrick Krauss 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29617v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究人类和人工神经网络中语言结构的表征收敛性，主要涉及认知神经科学和语言学。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为论文提到了transformer-based language models作为比较对象。与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为研究涉及理解神经网络如何表示语言结构，属于解释性AI范畴。其他关键词（如MoE、SFT、RAG、量化等）均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究通过脑电图实验和人工神经网络分析，发现人类大脑和人工语言模型在处理不同句法结构时表现出相似的神经表征模式，支持语言结构作为形式-意义映射的神经编码假说。

摘要翻译

理解大脑如何处理语言构式是认知神经科学和语言学的核心挑战。近期计算研究表明，人工神经语言模型能自发形成对论元结构构式的差异化表征，并预测构式层级信息在处理过程中何时及如何浮现。本研究通过脑电图测试了这些预测在人类神经活动中的表现。十名英语母语者聆听200句合成生成的句子，涵盖四种构式类型（及物、双及物、致使移动、动结式），同时记录其神经响应。使用时频分析、特征提取和机器学习分类法的分析显示，构式特异性神经特征主要出现在句末位置——即论元结构完全消歧之处，并在α波段最为显著。配对分类显示构式间存在可靠区分，尤以双及物构式与动结式构式之间最为明显，其他构式对则存在重叠。关键的是，这些神经效应的时间浮现模式与相似性结构，与基于循环神经网络和Transformer的语言模型中的模式相吻合——构式表征均产生于整合处理阶段。这些发现支持了“语言构式在神经层面被编码为独特的形式-意义映射”的观点，与构式语法理论一致，并表明生物系统与人工系统在表征解决方案上存在趋同性。更广泛而言，这种趋同性与以下观点相符：学习系统会在底层表征景观——近期被称为“柏拉图表征空间”——中发现稳定区域，这些区域制约着高效语言抽象结构的浮现。

摘要 (Abstract)

Understanding how the brain processes linguistic constructions is a central challenge in cognitive neuroscience and linguistics. Recent computational studies show that artificial neural language models spontaneously develop differentiated representations of Argument Structure Constructions (ASCs), generating predictions about when and how construction-level information emerges during processing. The present study tests these predictions in human neural activity using electroencephalography (EEG). Ten native English speakers listened to 200 synthetically generated sentences across four construction types (transitive, ditransitive, caused-motion, resultative) while neural responses were recorded. Analyses using time-frequency methods, feature extraction, and machine learning classification revealed construction-specific neural signatures emerging primarily at sentence-final positions, where argument structure becomes fully disambiguated, and most prominently in the alpha band. Pairwise classification showed reliable differentiation, especially between ditransitive and resultative constructions, while other pairs overlapped. Crucially, the temporal emergence and similarity structure of these effects mirror patterns in recurrent and transformer-based language models, where constructional representations arise during integrative processing stages. These findings support the view that linguistic constructions are neurally encoded as distinct form-meaning mappings, in line with Construction Grammar, and suggest convergence between biological and artificial systems on similar representational solutions. More broadly, this convergence is consistent with the idea that learning systems discover stable regions within an underlying representational landscape - recently termed a Platonic representational space - that constrains the emergence of efficient linguistic abstractions.

关键词: linguistic constructions, neural representations, EEG, artificial neural networks, argument structure constructions, Construction Grammar, transformer models, cognitive neuroscience

71. ❌ Generating Key Postures of Bharatanatyam Adavus with Pose Estimation

作者: Jagadish Kashinath Kamble, Jayanta Mukhopadhyay, Debaditya Roy, Partha Pratim Das 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29570v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用条件生成对抗网络（cGAN）和条件扩散模型结合姿态估计来生成印度古典舞蹈Bharatanatyam的关键姿势，属于计算机视觉和生成模型在文化遗产保护领域的应用。所有关键词均与大语言模型（LLM）及其相关技术（如MoE、缩放定律、对齐、推理、代理等）直接相关，而本文未涉及任何LLM技术，也未提及生物信息学或化学信息学，因此除’AI for Science OR Bioinformatics OR Cheminformatics’因涉及AI在科学/文化领域的应用而给予5分（有一定关联）外，其余关键词均评0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合姿态估计的生成框架，用于准确生成印度古典舞蹈Bharatanatyam的关键姿势，实验表明姿态监督显著提升了生成姿势的质量、真实性和文化保真度。

摘要翻译

在数字时代，保存那些根植于数百年传统并受严格结构与象征规则制约的非物质文化舞蹈面临着独特挑战。其中，印度古典舞蹈形式婆罗多舞因其强调规范化的阿达武动作和精确的关键姿态而尤为突出。准确生成这些姿态不仅对保持解剖学与风格完整性至关重要，也能通过数字手段实现有效的记录、分析，并向更广泛的全球受众传播。我们提出一种集成姿态估计模块的姿态感知生成框架，该框架通过基于关键点的损失函数和姿态一致性约束进行引导。这些监督信号确保了合成输出在解剖学准确性和风格完整性上的表现。我们评估了四种配置方案：标准条件生成对抗网络、带姿态监督的条件生成对抗网络、条件扩散模型，以及带姿态监督的条件扩散模型。所有模型均以关键姿态类别标签为条件输入，并通过优化保持几何结构。在条件生成对抗网络和条件扩散模型两种设置中，集成的姿态引导使生成姿态与真实关键点结构对齐，从而提升文化保真度。实验结果表明，引入姿态监督能显著提高生成婆罗多舞姿态的质量、真实感与艺术本真性。该框架为传统舞蹈形式的数字化保存、教育与传播提供了可扩展的解决方案，在实现高保真生成的同时不损害文化精确性。代码发布于https://github.com/jagidsh/Generating-Key-Postures-of-Bharatanatyam-Adavus-with-Pose-Estimation。

摘要 (Abstract)

Preserving intangible cultural dances rooted in centuries of tradition and governed by strict structural and symbolic rules presents unique challenges in the digital era. Among these, Bharatanatyam, a classical Indian dance form, stands out for its emphasis on codified adavus and precise key postures. Accurately generating these postures is crucial not only for maintaining anatomical and stylistic integrity, but also for enabling effective documentation, analysis, and transmission to broader global audiences through digital means. We propose a pose-aware generative framework integrated with a pose estimation module, guided by keypoint-based loss and pose consistency constraints. These supervisory signals ensure anatomical accuracy and stylistic integrity in the synthesized outputs. We evaluate four configurations: standard conditional generative adversarial network (cGAN), cGAN with pose supervision, conditional diffusion, and conditional diffusion with pose supervision. Each model is conditioned on key posture class labels and optimized to maintain geometric structure. In both cGAN and conditional diffusion settings, the integrated pose guidance aligns generated poses with ground-truth keypoint structures, promoting cultural fidelity. Our results demonstrate that incorporating pose supervision significantly enhances the quality, realism, and authenticity of generated Bharatanatyam postures. This framework provides a scalable approach for the digital preservation, education, and dissemination of traditional dance forms, enabling high-fidelity generation without compromising cultural precision. Code is available at https://github.com/jagidsh/Generating-Key-Postures-of-Bharatanatyam-Adavus-with-Pose-Estimation.

关键词: Bharatanatyam, pose estimation, generative framework, conditional GAN, conditional diffusion, cultural preservation, key postures, pose supervision

72. ❌ FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration

作者: Qiyao Wang, Hongbo Wang, Longze Chen, Zhihao Yang, Guhong Chen, Hamid Alinejad-Rokny, Hui Li, Yuan Lin, Min Yang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文FlowPIE提出一个用于科学思想生成的检索-生成框架，核心创新在于将文献探索和思想生成建模为一个协同进化的过程。它与多个关键词高度相关：1) 使用LLM作为生成奖励模型(GRM)进行质量评估，是核心组件；2) 采用检索增强生成(RAG)范式，但将其动态化；3) 使用受GFlowNets启发的流引导蒙特卡洛树搜索(MCTS)进行文献探索；4) 应用于AI for Science领域的科学思想生成任务；5) 整体框架可被视为一种LLM驱动的自主代理工作流。其他关键词如MoE、量化、对齐等未在摘要中体现，故评分为0。

!!! tip deepseek-chat TL;DR

该研究针对现有AI驱动的科学思想生成方法受限于静态检索-生成范式导致思想同质化的问题，提出了FlowPIE框架，通过将文献探索与思想生成建模为协同进化过程，并采用流引导MCTS和基于LLM的奖励模型，在测试时实现了更新颖、可行和多样化的科学思想生成。

摘要翻译

科学构想生成（SIG）对人工智能驱动的自主研究至关重要，然而现有方法通常受限于静态的“检索-生成”范式，导致构想同质化且发散性不足。本研究提出FlowPIE，一个紧密耦合的检索-生成框架，将文献探索与构想生成视为协同演化的过程。FlowPIE通过受GFlowNets启发的流引导蒙特卡洛树搜索（MCTS）扩展文献轨迹，利用基于大语言模型（LLM）的生成式奖励模型（GRM）对当前构想质量的评估作为监督信号，以指导自适应检索并构建多样化、高质量的初始种群。基于此种群，FlowPIE将构想生成建模为测试阶段的构想演化过程，采用隔离岛范式结合选择、交叉与变异操作，并通过基于GRM的适应度计算融入跨领域知识。该方法有效缓解了因过度依赖参数化知识与静态文献所形成的信息茧房。大量实验评估表明，与基于大语言模型和智能体的强基线框架相比，FlowPIE能持续产生具有更高新颖性、可行性与多样性的构想，同时支持在测试阶段进行奖励缩放。

摘要 (Abstract)

Scientific idea generation (SIG) is critical to AI-driven autonomous research, yet existing approaches are often constrained by a static retrieval-then-generation paradigm, leading to homogeneous and insufficiently divergent ideas. In this work, we propose FlowPIE, a tightly coupled retrieval-generation framework that treats literature exploration and idea generation as a co-evolving process. FlowPIE expands literature trajectories via a flow-guided Monte Carlo Tree Search (MCTS) inspired by GFlowNets, using the quality of current ideas assessed by an LLM-based generative reward model (GRM) as a supervised signal to guide adaptive retrieval and construct a diverse, high-quality initial population. Based on this population, FlowPIE models idea generation as a test-time idea evolution process, applying selection, crossover, and mutation with the isolation island paradigm and GRM-based fitness computation to incorporate cross-domain knowledge. It effectively mitigates the information cocoons arising from over-reliance on parametric knowledge and static literature. Extensive evaluations demonstrate that FlowPIE consistently produces ideas with higher novelty, feasibility and diversity compared to strong LLM-based and agent-based frameworks, while enabling reward scaling during test time.

关键词: Scientific Idea Generation, Retrieval-Generation Framework, Monte Carlo Tree Search (MCTS), LLM-based Generative Reward Model, Test-Time Idea Evolution, Literature Exploration, AI for Science, Autonomous Research

73. ❌ Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

作者: Linda Zeng, Steven Y. Feng, Michael C. Frank 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29552v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	2.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	3.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文使用GPT-2（小型语言模型）研究多语言习得，与’Small Language Models’高度相关（8分），涉及’Pre-training’（8分）。研究模拟儿童语言学习，属于’AI for Science’应用（5分）。使用GPT-2而非大模型，与’Large Language Models’略有相关（2分）。数据创建涉及’Scaling Laws AND Data Quality’（3分）。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究使用GPT-2模型模拟儿童多语言习得，通过创建匹配的单语和双语数据集进行训练，发现双语模型在两种语言上表现均良好，表明双语输入对统计学习器没有原则性挑战。

摘要翻译

在世界范围内，多语言现象极为普遍，这引发了关于儿童如何同时习得多种语言的重要理论与实际问题。例如：多语言习得是否会导致学习延迟？构建多语言输入的方式是否存在优劣之分？许多相关性研究探讨了这些问题，但令人惊讶的是，要获得明确答案异常困难，因为儿童无法被随机分配为多语者，且不同语言间的数据通常无法匹配。我们采用语言模型训练作为模拟多种高度受控暴露条件的方法，并利用合成数据与机器翻译构建了匹配的1亿词单语及双语数据集。我们基于反映一系列暴露机制的单语和双语数据训练GPT-2模型，并通过困惑度、语法性和语义知识评估其表现。在不同模型规模和测量指标下，双语模型在单一语言中的表现与单语模型相似，同时在第二语言中也展现出强劲性能。这些结果表明，不同双语暴露机制之间不存在显著差异，且双语输入对于不可知论的统计学习者而言并不构成原则性挑战。

摘要 (Abstract)

Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.

关键词: multilingual language acquisition, small-scale models, GPT-2, bilingual input, language model training, synthetic data, machine translation, perplexity evaluation

74. ❌ Reducing Complexity for Quantum Approaches in Train Load Optimization

作者: Zhijie Tang, Albert Nieto-Morales, Arit Kumar Bishwas 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究火车装载优化问题的数学建模方法，属于运筹学/组合优化领域，完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词均与大模型技术、AI方法或AI科学应用相关，而本文专注于传统数学优化和模拟退火算法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对火车装载优化问题提出了一种紧凑的数学建模方法，通过隐式计算重处理成本显著减少了变量和约束数量，并使用模拟退火算法验证了模型的有效性和可扩展性。

摘要翻译

将集装箱高效装载至列车是一项计算上极具挑战性的组合优化问题，在物流与供应链管理中处于核心地位。该问题复杂性的一个主要来源在于需对翻箱作业进行建模与优化——即为了取出被遮挡集装箱而进行的非生产性起重机移动。传统的数学建模方法通过为每一次潜在的翻箱引入显式二元变量及一系列逻辑约束来处理此问题，这导致模型规模庞大且难以求解。本文从根本上突破了这一范式。我们为列车装载优化问题提出了一种创新且紧凑的数学建模方法，其中翻箱成本在目标函数中被隐式计算。这一新颖方法避免了对专用翻箱变量及其相关约束的需求，从而显著缩减了模型规模。我们通过与传统模型的形式化对比，从分析上证明了变量与约束数量的显著减少。我们通过模拟退火元启发式算法评估了该紧凑模型的有效性，该算法为多种问题实例找到了高质量的装载方案。结果证实，我们的模型不仅更为简洁，而且具有实际有效性，为现代铁路物流提供了一个可扩展的强大工具。

摘要 (Abstract)

Efficiently planning container loads onto trains is a computationally challenging combinatorial optimization problem, central to logistics and supply chain management. A primary source of this complexity arises from the need to model and reduce rehandle operations-unproductive crane moves required to access blocked containers. Conventional mathematical formulations address this by introducing explicit binary variables and a web of logical constraints for each potential rehandle, resulting in large-scale models that are difficult to solve. This paper presents a fundamental departure from this paradigm. We introduce an innovative and compact mathematical formulation for the Train Load Optimization (TLO) problem where the rehandle cost is calculated implicitly within the objective function. This novel approach helps prevent the need for dedicated rehandle variables and their associated constraints, leading to a dramatic reduction in model size. We provide a formal comparison against a conventional model to analytically demonstrate the significant reduction in the number of variables and constraints. The efficacy of our compact formulation is assessed through a simulated annealing metaheuristic, which finds high-quality loading plans for various problem instances. The results confirm that our model is not only more parsimonious but also practically effective, offering a scalable and powerful tool for modern rail logistics.

关键词: Train Load Optimization, combinatorial optimization, mathematical formulation, rehandle operations, simulated annealing, logistics, model complexity reduction, rail logistics

75. ❌ Mean Masked Autoencoder with Flow-Mixing for Encrypted Traffic Classification

作者: Xiao Liu, Xiaowei Fu, Fuxiang Huang, Lei Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于网络流量分类，提出了一种基于掩码自编码器（MAE）的自监督预训练方法（MMAE），并引入了流混合策略和包重要性感知掩码预测器。论文的核心是自监督预训练技术，与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文明确提出了’encrypted traffic pre-training model’。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、RLHF等）或大模型在科学领域的应用（如AI for Science），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MMAE的教师-学生掩码自编码器预训练模型，通过流混合策略和包重要性感知掩码预测器，提升了加密网络流量分类的性能，并在多个数据集上实现了最先进的结果。

摘要翻译

基于掩码自编码器（MAE）的自监督预训练模型在网络流量分类领域展现出巨大潜力。然而，现有方法局限于对单条流量的孤立字节级重建，缺乏对流量中多粒度上下文关系的充分感知。为克服这一局限，我们提出均值掩码自编码器（MMAE），这是一种采用流量混合策略的师生式MAE范式，用于构建加密流量预训练模型。MMAE通过自蒸馏机制实现师生交互，其中教师模型提供未掩码的流级语义监督，推动学生模型从局部字节重建进阶至多粒度理解。为突破单条流量的信息瓶颈，我们引入动态流量混合（FlowMix）策略以替代传统随机掩码机制。该策略通过构建具有干扰的跨流量混合样本，迫使模型从失真标记中学习判别性表征。此外，我们设计了数据包重要性感知掩码预测器（PMP），该模块配备注意力偏置机制，利用数据包级侧信道统计信息动态掩码高语义密度的标记。在涵盖加密应用、恶意软件及攻击流量的多个数据集上的大量实验表明，MMAE实现了最先进的性能。代码已发布于https://github.com/lx6c78/MMAE

摘要 (Abstract)

Network traffic classification using self-supervised pre-training models based on Masked Autoencoders (MAE) has demonstrated a huge potential. However, existing methods are confined to isolated byte-level reconstruction of individual flows, lacking adequate perception of the multi-granularity contextual relationship in traffic. To address this limitation, we propose Mean MAE (MMAE), a teacher-student MAE paradigm with flow mixing strategy for building encrypted traffic pre-training model. MMAE employs a self-distillation mechanism for teacher-student interaction, where the teacher provides unmasked flow-level semantic supervision to advance the student from local byte reconstruction to multi-granularity comprehension. To break the information bottleneck in individual flows, we introduce a dynamic Flow Mixing (FlowMix) strategy to replace traditional random masking mechanism. By constructing challenging cross-flow mixed samples with interferences, it compels the model to learn discriminative representations from distorted tokens. Furthermore, we design a Packet-importance aware Mask Predictor (PMP) equipped with an attention bias mechanism that leverages packet-level side-channel statistics to dynamically mask tokens with high semantic density. Numerous experiments on a number of datasets covering encrypted applications, malware, and attack traffic demonstrate that MMAE achieves state-of-the-art performance. The code is available at https://github.com/lx6c78/MMAE

关键词: Masked Autoencoder, self-supervised pre-training, encrypted traffic classification, flow mixing, teacher-student paradigm, packet-importance aware masking, network traffic, semantic supervision

76. ❌ Baby Scale: Investigating Models Trained on Individual Children’s Language Input

作者: Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究在儿童规模数据上训练的语言模型，与"Small Language Models"和"Scaling Laws AND Data Quality"高度相关（10分），因为直接研究小规模数据和数据质量对模型性能的影响。与"Large Language Models"和"Pre-training"相关（8分），因为涉及语言模型训练和基准测试。其他关键词如MoE、SFT、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究通过使用儿童语言数据训练语言模型，探讨了数据规模、数据质量对模型性能的影响，并发现模型似然与儿童词汇学习相关，揭示了高效语言数据属性对模型学习和人类语言习得的启示。

摘要翻译

现代语言模型（LMs）必须接受比人类儿童在开始产生有效行为前所接触到的训练数据多出多个数量级的词汇训练。评估这种“数据差距”的本质和起源，需要在人类规模的数据集上对语言模型进行基准测试，以理解语言知识如何从儿童的自然训练数据中涌现。利用BabyView数据集（记录6-36个月儿童日常的视频转录文本），我们研究了以下问题：（1）在儿童规模数据机制下的扩展性能，（2）基于不同儿童经历的数据集之间模型性能的差异以及数据集质量的语言学预测因素，以及（3）模型与儿童语言学习成果之间的关系。在儿童数据上训练的语言模型在语法任务上表现出可接受的扩展性，但在语义和世界知识任务上的扩展性低于在合成数据上训练的模型；我们还观察到基于不同儿童数据的表现存在显著差异。除了数据集规模外，模型性能主要与分布性和交互性语言特征的组合相关，这与儿童语言发展所需高质量输入的特征基本一致。最后，模型对单个词语的似然估计与儿童对这些词语的学习情况相关，这表明面向儿童的输入特性可能同时影响模型学习和人类语言发展。总体而言，理解哪些特性使语言数据更有利于高效学习，既能催生更强大的小规模语言模型，也有助于揭示人类语言习得的机制。

摘要 (Abstract)

Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this “data gap” requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children’s natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children’s experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children’s learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.

关键词: language models, child language data, scaling performance, data quality, linguistic predictors, model learning, human language acquisition, BabyView dataset

77. ❌ TrafficMoE: Heterogeneity-aware Mixture of Experts for Encrypted Traffic Classification

作者: Qing He, Xiaowei Fu, Lei Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29520v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TrafficMoE专注于加密流量分类，核心创新是提出了一种基于稀疏混合专家（MoE）的异构感知框架。因此，仅与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为论文明确使用并创新了MoE架构（双分支稀疏MoE）来解决领域特定问题。论文未涉及大语言模型（LLMs）、科学AI应用或其他深度学习技术原理，故其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对加密流量分类中静态同质化建模的瓶颈，提出了TrafficMoE框架，通过解耦-过滤-聚合范式和双分支稀疏混合专家（MoE）实现异构感知建模，在六个数据集上超越了现有方法。

摘要翻译

加密流量分类是网络安全中的关键任务。尽管深度学习推动了该领域的发展，但加密对载荷语义的遮蔽严重挑战了标准建模方法。现有大多数框架依赖于静态同质的处理流程，对所有输入采用统一的参数共享和静态融合策略。这种“一刀切”的静态设计存在固有缺陷：通过将结构化头部与随机化载荷强制纳入统一处理流程，它不可避免地使原始协议信号与随机加密噪声相互纠缠，从而削弱了细粒度判别特征。本文提出TrafficMoE框架，通过建立“解耦-过滤-聚合”范式，突破了静态建模的瓶颈。具体而言，为解决组件间的结构冲突，该架构采用双分支稀疏专家混合模型解耦头部与载荷，实现面向模态的专用建模。为减轻随机噪声的影响，引入不确定性感知过滤机制以量化可靠性，并选择性抑制高方差表征。最后，为克服静态融合的局限，采用路由引导策略动态聚合跨模态特征，根据流量上下文自适应权衡各模态贡献。通过这一DFA范式，TrafficMoE仅聚焦于最具判别性的流量特征，从而最大化表征效率。在六个数据集上的大量实验表明，TrafficMoE持续优于现有先进方法，验证了加密流量分析中异构感知建模的必要性。源代码公开于https://github.com/Posuly/TrafficMoE_main。

摘要 (Abstract)

Encrypted traffic classification is a critical task for network security. While deep learning has advanced this field, the occlusion of payload semantics by encryption severely challenges standard modeling approaches. Most existing frameworks rely on static and homogeneous pipelines that apply uniform parameter sharing and static fusion strategies across all inputs. This one-size-fits-all static design is inherently flawed: by forcing structured headers and randomized payloads into a unified processing pipeline, it inevitably entangles the raw protocol signals with stochastic encryption noise, thereby degrading the fine-grained discriminative features. In this paper, we propose TrafficMoE, a framework that breaks through the bottleneck of static modeling by establishing a Disentangle-Filter-Aggregate (DFA) paradigm. Specifically, to resolve the structural between-components conflict, the architecture disentangles headers and payloads using dual-branch sparse Mixture-of-Experts (MoE), enabling modality-specific modeling. To mitigate the impact of stochastic noise, an uncertainty-aware filtering mechanism is introduced to quantify reliability and selectively suppress high-variance representations. Finally, to overcome the limitations of static fusion, a routing-guided strategy aggregates cross-modality features dynamically, that adaptively weighs contributions based on traffic context. With this DFA paradigm, TrafficMoE maximizes representational efficiency by focusing solely on the most discriminative traffic features. Extensive experiments on six datasets demonstrate TrafficMoE consistently outperforms state-of-the-art methods, validating the necessity of heterogeneity-aware modeling in encrypted traffic analysis. The source code is publicly available at https://github.com/Posuly/TrafficMoE_main.

关键词: Encrypted Traffic Classification, Mixture of Experts, Sparse Models, Heterogeneity-aware Modeling, Disentangle-Filter-Aggregate, Network Security, Deep Learning, TrafficMoE

78. ❌ Target-Aligned Reinforcement Learning

作者: Leonard S. Pleiss, James Harrison, Maximilian Schiffer 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29501v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习算法中目标网络对齐的改进，属于传统强化学习领域，未涉及大模型、深度学习技术原理创新或大模型在不同领域的应用，与所有给定的大模型相关关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了目标对齐强化学习（TARL）框架，通过专注于目标网络和在线网络估计高度对齐的转换来缓解目标网络更新中的稳定性-时效性权衡，从而加速收敛并提高标准强化学习算法的性能。

摘要翻译

许多强化学习算法依赖目标网络——即在线网络的滞后副本——来稳定训练。尽管有效，该机制引入了一个根本性的稳定性-时效性权衡：较慢的目标更新能提升稳定性，但会降低学习信号的时效性，从而阻碍收敛速度。我们提出了目标对齐强化学习（Target-Aligned Reinforcement Learning，TARL）框架，该框架强调目标网络与在线网络估计高度对齐的转移样本。通过将更新集中于对齐良好的目标，TARL在保留目标网络稳定优势的同时，缓解了陈旧目标估计的不利影响。我们提供了理论分析，证明目标对齐校正能加速收敛，并在多种基准环境中通过实验验证了其相对于标准强化学习算法的持续改进。

摘要 (Abstract)

Many reinforcement learning algorithms rely on target networks - lagged copies of the online network - to stabilize training. While effective, this mechanism introduces a fundamental stability-recency tradeoff: slower target updates improve stability but reduce the recency of learning signals, hindering convergence speed. We propose Target-Aligned Reinforcement Learning (TARL), a framework that emphasizes transitions for which the target and online network estimates are highly aligned. By focusing updates on well-aligned targets, TARL mitigates the adverse effects of stale target estimates while retaining the stabilizing benefits of target networks. We provide a theoretical analysis demonstrating that target alignment correction accelerates convergence, and empirically demonstrate consistent improvements over standard reinforcement learning algorithms across various benchmark environments.

关键词: reinforcement learning, target networks, stability-recency tradeoff, target alignment, convergence acceleration, online network, benchmark environments

79. ❌ Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics

作者: Alain Vázquez, Maria Inés Torres 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究对话系统中的自然语言生成（NLG），通过任务演示器（MR-sentence对）增强微调模型的生成质量，并跨多个数据集和指标进行分析。论文主要涉及监督微调（SFT）技术，因为使用了微调模型，但未涉及大模型、深度学习技术原理创新或科学领域应用。其他关键词如LLMs、MoE、Scaling Laws、RLHF、RAG、Agents等均未涉及，因此除SFT相关关键词外，其余均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在对话任务中，通过使用任务演示器（MR-sentence对）来增强微调模型的自然语言生成质量，发现这种方法对复杂任务和小数据集有效，且语义指标比词汇指标更能准确评估生成质量。

摘要翻译

对话系统应生成多样化的语言形式，以便与用户进行流畅而准确的交互。在此背景下，自然语言生成（NLG）引擎将意义表示（MRs）转换为句子，直接影响用户感知。这些意义表示通常通过对话行为（DAs）编码交际功能（如告知、请求、确认），并通过槽值对列举语义内容。本研究旨在分析向生成器提供任务示例是否能提升微调模型的生成效果。该示例是从原始数据集中提取的MR-句子对，可在训练和推理阶段丰富输入信息。分析涉及五个聚焦不同语言学维度的评估指标，以及四个在领域、规模、词汇表、MR变异性和采集过程等多方面特征各异的数据集。据我们所知，这是首个在对话NLG领域实施的比较研究，系统分析了意义表示对生成质量的影响，并跨领域、语料库特征及评估指标进行综合考察。我们的核心发现是：所提出的增强输入方法对复杂任务及MR与句子高变异性的小规模数据集效果显著；在零样本场景下，该方法对所有领域均有助益。此外，指标分析表明语义指标比词汇指标能更精确地捕捉生成质量。在这些语义指标中，基于人工评分训练的指标能够检测出嵌入型指标常忽略的遗漏及其他细微语义问题。最后，指标分数的演变趋势以及槽位准确率和对话行为准确率的优异表现证明，生成模型对不同任务具有快速适应能力，并在语义与交际意图层面展现出强大鲁棒性。

摘要 (Abstract)

Conversational systems should generate diverse language forms to interact fluently and accurately with users. In this context, Natural Language Generation (NLG) engines convert Meaning Representations (MRs) into sentences, directly influencing user perception. These MRs usually encode the communicative function (e.g., inform, request, confirm) via DAs and enumerate the semantic content with slot-value pairs. In this work, our objective is to analyse whether providing a task demonstrator to the generator enhances the generations of a fine-tuned model. This demonstrator is an MR-sentence pair extracted from the original dataset that enriches the input at training and inference time. The analysis involves five metrics that focus on different linguistic aspects, and four datasets that differ in multiple features, such as domain, size, lexicon, MR variability, and acquisition process. To the best of our knowledge, this is the first study on dialogue NLG implementing a comparative analysis of the impact of MRs on generation quality across domains, corpus characteristics, and the metrics used to evaluate these generations. Our key insight is that the proposed enriched inputs are effective for complex tasks and small datasets with high variability in MRs and sentences. They are also beneficial in zero-shot settings for any domain. Moreover, the analysis of the metrics shows that semantic metrics capture generation quality more accurately than lexical metrics. In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss. Finally, the evolution of the metric scores and the excellent results for Slot Accuracy and Dialogue Act Accuracy demonstrate that the generative models present fast adaptability to different tasks and robustness at semantic and communicative intention levels.

关键词: Natural Language Generation, Dialogue Systems, Meaning Representations, Fine-tuned Model, Task Demonstrator, Semantic Metrics, Zero-shot Settings, Comparative Analysis

80. ❌ Metriplector: From Field Theory to Neural Architecture

作者: Dan Oprisa, Peter Toth 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29496v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种受物理场论启发的通用神经架构原语Metriplector，并在多个领域（包括语言建模）进行了评估。其核心贡献是架构创新，而非直接针对大多数关键词中的具体技术。与’Large Language Models’相关度较高（8分），因为论文明确将Metriplector应用于语言建模任务，并与GPT基线进行了比较。与’AI for Science’有一定关联（5分），因为其核心思想源于物理场论，属于跨学科的科学启发式AI研究。其他关键词（如MoE、SFT、RAG、量化等）均未在摘要中提及或涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种受物理场论启发的神经架构原语Metriplector，通过配置抽象物理系统（场、源、算子）并进行动力学计算，在迷宫寻路、数独求解、图像识别和语言建模等多个任务上实现了高性能，其中语言建模任务在训练token减少3.6倍的情况下达到了1.182 bits/byte的性能。

摘要翻译

我们提出Metriplector，一种神经架构基元，其输入配置一个抽象的物理系统——包括场、源与算符——而该系统的动力学过程即为计算本身。多个场通过耦合的度量泊松（metriplectic）动力学演化，且由诺特定理导出的应力-能量张量 $T^{μν}$ 提供输出。该度量泊松框架允许一系列自然的实例化：仅使用耗散分支即可通过共轭梯度法精确求解屏蔽泊松方程；激活完整结构——包括反对称泊松括号——则产生适用于图像识别与语言建模的场动力学。我们在四个领域评估Metriplector，每个领域均使用基于此共享基元构建、具有渐进丰富物理特性的任务专用架构：在迷宫寻路任务中达到F1=1.0，并能从15×15训练网格泛化至未见过的39×39网格；数独求解的精确率达到97.2%且无需结构注入；在CIFAR-100数据集上取得81.03%准确率（参数量226万）；在语言建模任务中达到1.182比特/字节，且训练词元数量比GPT基线减少3.6倍。

摘要 (Abstract)

We present Metriplector, a neural architecture primitive in which the input configures an abstract physical system – fields, sources, and operators – and the dynamics of that system is the computation. Multiple fields evolve via coupled metriplectic dynamics, and the stress-energy tensor $T^{μν}$, derived from Noether’s theorem, provides the readout. The metriplectic formulation admits a natural spectrum of instantiations: the dissipative branch alone yields a screened Poisson equation solved exactly via conjugate gradient; activating the full structure – including the antisymmetric Poisson bracket – gives field dynamics for image recognition and language modeling. We evaluate Metriplector across four domains, each using a task-specific architecture built from this shared primitive with progressively richer physics: F1=1.0 on maze pathfinding, generalizing from 15x15 training grids to unseen 39x39 grids; 97.2% exact Sudoku solve rate with zero structural injection; 81.03% on CIFAR-100 with 2.26M parameters; and 1.182 bits/byte on language modeling with 3.6x fewer training tokens than a GPT baseline.

关键词: Metriplector, neural architecture primitive, field theory, metriplectic dynamics, language modeling, physical system, Noether’s theorem, cross-domain evaluation

81. ❌ MemFactory: Unified Inference & Training Framework for Agent Memory

作者: Ziliang Guo, Ziheng Li, Zhiyu Li 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29493v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究记忆增强的LLM智能体框架，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确研究基于LLM的智能体；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文专注于智能体框架开发；其他关键词如MoE、SFT、RAG等未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了MemFactory，一个用于记忆增强智能体的统一训练和推理框架，通过模块化设计和GRPO优化，在MemAgent架构上实现了高达14.8%的性能提升。

摘要翻译

记忆增强型大语言模型（LLMs）是开发具备长期能力的智能体（AI agents）的关键。近年来，应用强化学习（RL）来优化记忆操作（如提取、更新和检索）已成为一个极具前景的研究方向。然而，现有的实现方案仍高度分散且局限于特定任务，缺乏一个统一的基础设施来简化这些复杂流程的集成、训练与评估。为弥补这一空白，我们提出了MemFactory，这是首个专为记忆增强型智能体设计的统一、高度模块化的训练与推理框架。受LLaMA-Factory等统一微调框架成功的启发，MemFactory将记忆生命周期抽象为原子化、即插即用的组件，使研究人员能够通过“乐高式”架构无缝构建定制化的记忆智能体。此外，该框架原生集成了群体相对策略优化（Group Relative Policy Optimization, GRPO），以基于多维环境奖励来微调内部记忆管理策略。MemFactory为包括Memory-R1、RMM和MemAgent在内的近期前沿范式提供了开箱即用的支持。我们使用公开可用的训练和评估数据，在开源的MemAgent架构上对MemFactory进行了实证验证。在领域内和分布外评估集上，MemFactory均能持续提升对应基础模型的性能，相对增益最高达14.8%。通过提供一个标准化、可扩展且易于使用的基础设施，MemFactory显著降低了研究门槛，为未来记忆驱动型智能体的创新铺平了道路。

摘要 (Abstract)

Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a “Lego-like” architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across both in-domain and out-of-distribution evaluation sets, MemFactory consistently improves performance over the corresponding base models, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.

关键词: Memory-augmented LLMs, AI agents, unified framework, training and inference, GRPO, modular components, MemAgent, performance improvement

82. ❌ Structural Compactness as a Complementary Criterion for Explanation Quality

作者: Mohammad Mahdi Mesgari, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin, Leander Weber 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29491v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于可解释AI领域，提出了一种新的评估归因解释质量的结构紧凑性指标MST-C。论文内容与深度学习模型的可解释性直接相关，因此与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。然而，论文并未涉及大语言模型、模型训练技术、推理优化、AI代理、科学AI应用等其他关键词，这些关键词均与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对归因解释质量评估中难以量化解释可读性的问题，提出了一种基于图结构的紧凑性度量方法MST-C，能够有效区分不同解释方法并揭示模型间的结构差异。

摘要翻译

在归因质量评估中，解释可读性的量化衡量尤为困难，因为其受到归因图形状与内部组织方式差异的影响，这些特征无法通过简单统计量捕捉。为解决此问题，我们提出最小生成树紧密度（Minimum Spanning Tree Compactness, MST-C），这是一种基于图的结构化度量方法，能够捕捉归因的高阶几何特性，例如分布广度与内聚性。这些要素被整合为单一评分，用于评估紧密度，其偏好那些显著点分布范围小且在空间上组织为少量高内聚类簇的归因形式。我们证明，MST-C能够可靠区分不同解释方法，揭示模型间根本的结构性差异，并为解释紧密度提供一种稳健且自足的诊断指标，从而对现有归因复杂度的理论形成补充。

摘要 (Abstract)

In the evaluation of attribution quality, the quantitative assessment of explanation legibility is particularly difficult, as it is influenced by varying shapes and internal organization of attributions not captured by simple statistics. To address this issue, we introduce Minimum Spanning Tree Compactness (MST-C), a graph-based structural metric that captures higher-order geometric properties of attributions, such as spread and cohesion. These components are combined into a single score that evaluates compactness, favoring attributions with salient points spread across a small area and spatially organized into few but cohesive clusters. We show that MST-C reliably distinguishes between explanation methods, exposes fundamental structural differences between models, and provides a robust, self-contained diagnostic for explanation compactness that complements existing notions of attribution complexity.

关键词: attribution quality, explanation legibility, Minimum Spanning Tree Compactness, structural metric, geometric properties, explanation compactness, interpretability, Explainable AI

83. ❌ iPoster: Content-Aware Layout Generation for Interactive Poster Design via Graph-Enhanced Diffusion Models

作者: Xudong Zhou, Jinyuan Liang, Qiuyi Guo, Guozheng Li 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文iPoster专注于图形设计和布局生成，使用图增强扩散模型进行交互式海报设计。所有评分关键词均涉及大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及LLM或深度学习在科学领域的应用，也未讨论大模型技术原理的创新。论文的核心是计算机视觉和图形学中的扩散模型应用，与评分关键词列表中的任何主题均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了iPoster，一个基于图增强扩散模型的交互式海报布局生成框架，允许用户通过指定约束来指导内容感知的布局设计，并在实验中实现了最先进的布局质量和可控性。

摘要翻译

本文提出iPoster——一种交互式布局生成框架，该框架使用户能够通过设定灵活约束来指导内容感知的海报布局设计。iPoster允许用户在意图模块中指定部分设计意向，例如元素类别、尺寸、位置或粗略的初始草图。随后，生成模块会即时生成精细且符合上下文情境的布局方案，并严格遵循这些约束条件。iPoster采用统一的图增强扩散架构，支持在用户指定约束下完成多种设计任务。这些约束通过掩码策略在去噪过程的每一步中精确保留用户输入。跨内容感知注意力模块将生成元素与画布的显著区域进行对齐，确保视觉连贯性。大量实验表明，iPoster不仅实现了最先进的布局生成质量，更为约束条件下的海报布局设计提供了响应迅速且高度可控的框架。

摘要 (Abstract)

We present iPoster, an interactive layout generation framework that empowers users to guide content-aware poster layout design by specifying flexible constraints. iPoster enables users to specify partial intentions within the intention module, such as element categories, sizes, positions, or coarse initial drafts. Then, the generation module instantly generates refined, context-sensitive layouts that faithfully respect these constraints. iPoster employs a unified graph-enhanced diffusion architecture that supports various design tasks under user-specified constraints. These constraints are enforced through masking strategies that precisely preserve user input at every denoising step. A cross content-aware attention module aligns generated elements with salient regions of the canvas, ensuring visual coherence. Extensive experiments show that iPoster not only achieves state-of-the-art layout quality, but offers a responsive and controllable framework for poster layout design with constraints.

关键词: interactive layout generation, content-aware poster design, graph-enhanced diffusion models, user-specified constraints, masking strategies, cross content-aware attention, visual coherence, state-of-the-art layout quality

84. ❌ M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

作者: Seung Hun Han, Youssef Mohamed, Mohamed Elhoseiny 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29467v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文M-MiniGPT4专注于多语言视觉大语言模型（VLLM）的对齐训练，核心涉及大语言模型（LLMs）技术、监督微调（SFT）和对齐（Alignment）方法。论文提出了一种多语言对齐训练阶段，使用并行文本语料库增强模型的多语言能力，这直接对应了’Post-training/SFT’和’Instruction Tuning/Alignment’关键词。模型基于MiniGPT4架构，属于视觉大语言模型，因此与’Large Language Models/Foundation Models’高度相关。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、Agents、Quantization等，论文未涉及这些具体技术或应用领域，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过使用混合原生多语言和翻译数据以及提出多语言对齐训练阶段，来提升视觉大语言模型在多语言视觉语言理解任务上的性能，最终M-MiniGPT4在11种语言上表现出色，并在多语言MMMU基准测试中取得了36%的准确率，超越了同类模型。

摘要翻译

本文提出了一种名为M-MiniGPT4的多语言视觉大语言模型。该模型在11种语言中展现出强大的视觉语言理解能力。我们采用原生多语言数据与翻译数据的混合训练方式，以提升MiniGPT4架构的多语言视觉语言理解性能。此外，我们提出了一种多语言对齐训练阶段，利用平行文本语料库进一步增强模型的多语言能力。M-MiniGPT4在多语言MMMU基准测试中取得了36%的准确率，超越了同等参数规模的最先进模型，包括在本研究主体工作完成后发布的基础模型。我们开源了模型、代码及翻译数据集，以促进低资源与多语言环境下的未来研究。

摘要 (Abstract)

This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.

关键词: Multilingual Vision Large Language Model, M-MiniGPT4, Vision-Language Understanding, Multilingual Alignment Training, Translated Data, MiniGPT4 Architecture, Parallel Text Corpora, MMMU Benchmark

85. ❌ An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms

作者: Nils Grünefeld, Jes Frellsen, Christian Hardmeier 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29466v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的预测不确定性量化方法，因此与’Large Language Models’高度相关（10分）。论文涉及不确定性估计与事实性评估，与’Hallucination Mitigation’有一定关联（8分）。方法基于梯度分析，与模型解释性相关，与’Mechanistic Interpretability’有一定关联（5分）。论文提到使用预训练模型，与’Pre-training’有基本关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级方法，通过梯度范数和各向同性假设来量化大语言模型的预测不确定性，并在问答任务中验证了该方法能有效区分不同类型的不确定性信号。

摘要翻译

现有量化神经网络预测不确定性的方法对于大型语言模型而言，要么计算上难以处理，要么需要访问通常无法获取的训练数据。我们通过两种近似推导出一种轻量级替代方案：一是将不确定性表示为预测梯度与参数协方差的一阶泰勒展开，二是对参数协方差采用各向同性假设。二者结合，仅需对未经修改的预训练模型进行一次前向-反向传播，即可将认知不确定性表示为梯度范数的平方，将偶然不确定性表示为点预测的伯努利方差。我们通过以下两点论证各向同性假设的合理性：首先，基于非训练数据构建的协方差估计会引入结构性失真，而各向同性协方差可避免此问题；其次，关于大型网络谱性质的理论结果支持该近似在大规模模型中的适用性。在合成问题上与参考的马尔可夫链蒙特卡洛估计进行对比验证，结果显示二者高度一致，且一致性随模型规模增大而提升。随后，我们运用该估计方法探究在大型语言模型问答任务中，各类不确定性何时对预测答案正确性具有有效信号价值，揭示出基准依赖性的差异：在TruthfulQA（其问题涉及合理答案间的真实冲突）上，综合不确定性估计取得了最高的平均AUROC；而在TriviaQA（侧重事实回忆）上，其表现降至接近随机水平。这表明参数层面的不确定性捕获的信号与自评估方法存在本质区别。

摘要 (Abstract)

Existing methods for quantifying predictive uncertainty in neural networks are either computationally intractable for large language models or require access to training data that is typically unavailable. We derive a lightweight alternative through two approximations: a first-order Taylor expansion that expresses uncertainty in terms of the gradient of the prediction and the parameter covariance, and an isotropy assumption on the parameter covariance. Together, these yield epistemic uncertainty as the squared gradient norm and aleatoric uncertainty as the Bernoulli variance of the point prediction, from a single forward-backward pass through an unmodified pretrained model. We justify the isotropy assumption by showing that covariance estimates built from non-training data introduce structured distortions that isotropic covariance avoids, and that theoretical results on the spectral properties of large networks support the approximation at scale. Validation against reference Markov Chain Monte Carlo estimates on synthetic problems shows strong correspondence that improves with model size. We then use the estimates to investigate when each uncertainty type carries useful signal for predicting answer correctness in question answering with large language models, revealing a benchmark-dependent divergence: the combined estimate achieves the highest mean AUROC on TruthfulQA, where questions involve genuine conflict between plausible answers, but falls to near chance on TriviaQA’s factual recall, suggesting that parameter-level uncertainty captures a fundamentally different signal than self-assessment methods.

关键词: uncertainty quantification, large language models, gradient norms, epistemic uncertainty, aleatoric uncertainty, predictive uncertainty, model calibration, question answering

86. ❌ Few-shot Writer Adaptation via Multimodal In-Context Learning

作者: Tom Simon, Stephane Nicolas, Pierrick Tranouez, Clement Chatelain, Thierry Paquet 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29450v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于手写文本识别（HTR）领域，提出了一种基于多模态上下文学习的少样本作者适应框架。论文的核心创新在于将上下文学习（In-context Learning）应用于HTR任务，通过少量目标作者的示例实现推理时的适应，无需参数更新。因此，仅与关键词"In-context Learning OR Many-shot Learning"高度相关（评分为10分），因为论文明确提出了"multimodal in-context learning"方法。其他所有关键词均与论文内容无关，因为论文不涉及大语言模型（LLMs）、模型架构（如MoE）、训练技术（如SFT、RLHF）、推理优化（如量化、推测解码）、代理系统或科学AI应用等主题。论文是计算机视觉（OCR/HTR）领域的研究，而非大模型或深度学习技术原理的创新。

!!! tip deepseek-chat TL;DR

该论文针对手写文本识别模型在处理未见或非典型作者笔迹时性能下降的问题，提出了一种基于多模态上下文学习的少样本作者适应框架，在IAM和RIMES数据集上实现了优于所有独立于作者的HTR模型的性能，且无需在推理时进行参数更新。

摘要翻译

尽管当前先进的手写文本识别（Handwritten Text Recognition, HTR）模型在标准基准测试中表现良好，但它们常常难以处理训练数据中代表性不足、风格高度特异的手写者。为应对未见过的和非典型的手写者，书写者自适应技术可将HTR模型个性化适配至个体笔迹风格。主流的书写者自适应方法需要离线微调或在推理时进行参数更新，两者均涉及梯度计算与反向传播，这会增加计算成本并需要仔细的超参数调优。本研究提出一种受多模态上下文学习启发的新型上下文驱动HTR框架，仅需目标书写者的少量示例即可在推理时实现自适应，且无需任何参数更新。我们进一步探究了上下文长度的影响，设计了一个紧凑的800万参数CNN-Transformer模型以实现少样本上下文自适应，并证明结合上下文驱动与标准光学字符识别（OCR）训练策略可带来互补性提升。在IAM和RIMES数据集上的实验验证了本方法的有效性，其字符错误率（Character Error Rate）分别达到3.92%和2.34%，超越了所有独立于书写者的HTR模型，且在推理时无需任何参数更新。

摘要 (Abstract)

While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

关键词: Handwritten Text Recognition, Writer Adaptation, Multimodal In-context Learning, Few-shot Learning, Inference-time Adaptation, CNN-Transformer, Character Error Rate, Parameter-free

87. ❌ NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification

作者: Youngung Han, Minkyung Cha, Kyeonghun Kim, Induk Um, Myeongbin Sho, Joo Young Bae, Jaewon Jung, Jung Hyeok Park, Seojun Lee, Nam-Joon Kim, Woo Kyoung Jeong, Won Jae Lee, Pa Hong, Ken Ying-Kai Liao, Hyuk-Jae Lee 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29449v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文NeoNet专注于医学影像领域，提出一个端到端的3D深度学习框架，用于预测胆管癌的神经周围侵犯（PNI）。它结合了分割、生成（使用3D Latent Diffusion Model）和分类模块，属于AI在生物医学领域的应用。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"高度相关，因为论文直接应用深度学习于生物医学成像（属于Bioinformatics/AI for Science范畴）。其他关键词主要涉及大语言模型（LLMs）、模型训练技术、推理优化、代理系统等通用AI技术，而本文未涉及这些内容，专注于特定领域的3D深度学习应用，因此相关度为0。

!!! tip deepseek-chat TL;DR

该研究提出NeoNet，一个端到端的3D深度学习框架，用于非侵入性预测胆管癌的神经周围侵犯，通过集成分割、生成和分类模块，在5折交叉验证中实现了最高AUC为0.7903的性能。

摘要翻译

最大限度减少侵入性诊断程序以降低患者损伤和感染风险是医学成像领域的核心目标。然而，对于沿周围神经浸润的肿瘤细胞这一关键预后因素——神经周围浸润（Perineural Invasion, PNI）的非侵入性诊断，由于缺乏明确且一致的影像学识别标准，目前仍具挑战性。为应对这一挑战，我们提出了NeoNet，一个用于胆管癌PNI预测的、不依赖预定义图像特征的集成式端到端三维深度学习框架。NeoNet整合了三个模块：（1）NeoSeg，采用肿瘤定位ROI裁剪（Tumor-Localized ROI Crop, TLCR）算法；（2）NeoGen，一个结合ControlNet的三维潜在扩散模型（3D Latent Diffusion Model, LDM），以解剖掩模为条件生成合成图像块，专门将数据集平衡至1:1比例；以及（3）NeoCls，最终预测模块。针对NeoCls，我们开发了PNI注意力网络（PNI-Attention Network, PattenNet），它利用冻结的LDM编码器和专门设计的三维双重注意力块（3D Dual Attention Blocks, DAB），以检测指示PNI的细微强度变化与空间模式。在五折交叉验证中，NeoNet的表现优于基线三维模型，并以0.7903的最高曲线下面积（AUC）取得了最佳性能。

摘要 (Abstract)

Minimizing invasive diagnostic procedures to reduce the risk of patient injury and infection is a central goal in medical imaging. And yet, noninvasive diagnosis of perineural invasion (PNI), a critical prognostic factor involving infiltration of tumor cells along the surrounding nerve, still remains challenging, due to the lack of clear and consistent imaging criteria criteria for identifying PNI. To address this challenge, we present NeoNet, an integrated end-to-end 3D deep learning framework for PNI prediction in cholangiocarcinoma that does not rely on predefined image features. NeoNet integrates three modules: (1) NeoSeg, utilizing a Tumor-Localized ROI Crop (TLCR) algorithm; (2) NeoGen, a 3D Latent Diffusion Model (LDM) with ControlNet, conditioned on anatomical masks to generate synthetic image patches, specifically balancing the dataset to a 1:1 ratio; and (3) NeoCls, the final prediction module. For NeoCls, we developed the PNI-Attention Network (PattenNet), which uses the frozen LDM encoder and specialized 3D Dual Attention Blocks (DAB) designed to detect subtle intensity variations and spatial patterns indicative of PNI. In 5-fold cross-validation, NeoNet outperformed baseline 3D models and achieved the highest performance with a maximum AUC of 0.7903.

关键词: 3D deep learning, medical imaging, perineural invasion prediction, cholangiocarcinoma, latent diffusion model, non-invasive diagnosis, attention network, synthetic data generation

88. ❌ RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment

作者: Qiyuan Zhuang, He-Yang Xu, Yijun Wang, Xin-Yang Zhao, Yang-Yang Li, Xiu-Shen Wei 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29419v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RAAP专注于机器人视觉中的物体可操作性预测，核心是检索增强的跨图像动作对齐框架。与绝大多数关键词无关，因为论文不涉及大语言模型、训练技术、推理方法、代理系统等。唯一相关的是’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’，因为论文使用了检索增强（Retrieval-Augmented）的概念来整合多参考信息，但这是应用于视觉任务而非文本生成，因此给5分（有一定关联）。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了RAAP框架，通过检索增强的跨图像动作对齐来解决机器人对未见物体进行可操作性预测的泛化问题，实现了在少量样本训练下的零样本机器人操作。

摘要翻译

理解物体可供性是使机器人能够在多样化非结构化环境中执行有目的、细粒度交互的关键。然而，现有方法要么依赖于检索机制——由于数据稀疏性和覆盖范围缺陷而显得脆弱，要么依赖于大规模模型——这些模型在应用于未见类别时经常错误定位接触点并误判接触后动作，从而阻碍了鲁棒的泛化能力。我们提出检索增强可供性预测框架，该框架将可供性检索与基于对齐的学习相统一。通过解耦静态接触点定位与动态动作方向预测，RAAP 通过稠密对应关系迁移接触点，并借助检索增强对齐模型预测动作方向——该模型采用双权重注意力机制整合多参考信息。仅使用 DROID 和 HOI4D 数据集的紧凑子集（每任务仅需数十样本）进行训练，RAAP 在未见物体和类别上均取得稳定性能，并在仿真与真实场景中实现了零样本机器人操作。项目网站：https://github.com/SEU-VIPGroup/RAAP。

摘要 (Abstract)

Understanding object affordances is essential for enabling robots to perform purposeful and fine-grained interactions in diverse and unstructured environments. However, existing approaches either rely on retrieval, which is fragile due to sparsity and coverage gaps, or on large-scale models, which frequently mislocalize contact points and mispredict post-contact actions when applied to unseen categories, thereby hindering robust generalization. We introduce Retrieval-Augmented Affordance Prediction (RAAP), a framework that unifies affordance retrieval with alignment-based learning. By decoupling static contact localization and dynamic action direction, RAAP transfers contact points via dense correspondence and predicts action directions through a retrieval-augmented alignment model that consolidates multiple references with dual-weighted attention. Trained on compact subsets of DROID and HOI4D with as few as tens of samples per task, RAAP achieves consistent performance across unseen objects and categories, and enables zero-shot robotic manipulation in both simulation and the real world. Project website: https://github.com/SEU-VIPGroup/RAAP.

关键词: affordance prediction, retrieval-augmented, cross-image alignment, robotic manipulation, zero-shot generalization, contact localization, action direction, dense correspondence

89. ❌ Adversarial Prompt Injection Attack on Multimodal Large Language Models

作者: Meiwen Ding, Song Xia, Chenqi Kong, Xudong Jiang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）的对抗性提示注入攻击，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’Instruction Tuning’相关（8分），因为攻击针对模型的指令跟随行为；其他关键词如MoE、SLMs、Scaling Laws等与论文的对抗攻击和安全研究主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了针对多模态大语言模型的不可感知视觉提示注入攻击方法，通过在输入图像中嵌入对抗性指令，实现了对多个闭源MLLMs的有效攻击。

摘要翻译

尽管多模态大语言模型（MLLMs）在现实应用中的部署日益广泛，但其遵循指令的特性使其易受提示注入攻击。现有的提示注入方法主要依赖于文本提示或人类用户可感知的视觉提示。在本研究中，我们针对强大的闭源MLLMs，探索了不可感知的视觉提示注入攻击，即将对抗性指令嵌入视觉模态中。我们的方法通过有界文本叠加将恶意提示自适应地嵌入输入图像，以提供语义引导。同时，通过迭代优化不可感知的视觉扰动，使受攻击图像的特征表示在粗粒度和细粒度上与恶意视觉及文本目标对齐。具体而言，视觉目标被实例化为文本渲染图像，并在优化过程中逐步细化，以更准确地表达预期语义并提升迁移能力。在多个闭源MLLMs上对两种多模态理解任务进行的广泛实验表明，与现有方法相比，我们的方法具有更优越的性能。

摘要 (Abstract)

Although multimodal large language models (MLLMs) are increasingly deployed in real-world applications, their instruction-following behavior leaves them vulnerable to prompt injection attacks. Existing prompt injection methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs, where adversarial instructions are embedded in the visual modality. Our method adaptively embeds the malicious prompt into the input image via a bounded text overlay to provide semantic guidance. Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and fine-grained levels. Specifically, the visual target is instantiated as a text-rendered image and progressively refined during optimization to more faithfully represent the desired semantics and improve transferability. Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.

关键词: Multimodal Large Language Models, Prompt Injection Attack, Adversarial Visual Prompt, Imperceptible Perturbation, Instruction-following Vulnerability, Closed-source MLLMs, Feature Representation Alignment, Transferability

90. ❌ Hallucination-aware intermediate representation edit in large vision-language models

作者: Wei Suo, Hanzu Zhang, Lijun Zhang, Ji Ma, Peng Wang, Yanning Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	15.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型视觉语言模型中的幻觉问题，提出了一种动态检测和编辑幻觉表示的方法。与关键词高度相关的包括：1）‘Large Language Models’（论文研究大型视觉语言模型，属于大模型范畴，给10分）；2）‘Hallucination Mitigation’（论文核心就是幻觉缓解，是主要研究内容，给15分）；3）‘Self-Correction’（论文提出的编辑方法属于模型自我修正机制，给8分）；4）‘Mechanistic Interpretability’（论文涉及检测和编辑中间表示，有一定解释性AI成分，给5分）。其他关键词如MoE、量化、推理加速、RAG等与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型中的幻觉问题，提出了一种动态检测和编辑幻觉中间表示的框架，以最小计算成本实现了最先进的幻觉消除性能。

摘要翻译

大型视觉语言模型在多模态推理与复杂场景理解方面展现出卓越性能。然而，这些模型仍面临严重的幻觉问题，即输出内容与视觉事实相矛盾。近期关于缓解幻觉的研究主要集中于重训练方法和对比解码方法。尽管两种方法均表现良好，但重训练方法需要大量训练资源，而对比解码方法会引入双重推理开销，这些因素限制了其实际应用。为解决上述问题，我们提出了一种动态检测幻觉表征并对其执行幻觉消除编辑的框架。该方法以极少的额外计算成本，在现有基准测试中实现了最先进的性能。大量实验证明了我们方法的有效性，凸显了其高效稳健的幻觉消除能力以及对幻觉的强大可控性。代码发布于 https://github.com/ASGO-MM/HIRE

摘要 (Abstract)

Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination-eliminating edits on these representations. With minimal additional computational cost, we achieve state-of-the-art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Code is available at https://github.com/ASGO-MM/HIRE

关键词: Large Vision-Language Models, hallucination mitigation, intermediate representation edit, dynamic detection, computational efficiency, state-of-the-art performance, multimodal reasoning

91. ❌ Hybrid Quantum-Classical Spatiotemporal Forecasting for 3D Cloud Fields

作者: Fu Wang, Qifeng Lu, Xinyu Long, Meng Zhang, Xiaofei Yang, Weijia Cao, Xiaowen Chu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29407v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于3D云场预测的时空建模，提出了一种混合量子经典框架QENO。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词均未在标题或摘要中出现，且论文研究的是特定领域的物理建模问题，而非通用大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（大气科学/地球观测）领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种混合量子经典时空预测框架QENO，用于准确预测三维云场结构，实验表明其在多个指标上优于现有基线模型。

摘要翻译

三维云场的精确预报对于大气分析和短时数值天气预报至关重要，但由于云演变涉及跨层相互作用、非局部依赖性和多尺度时空动力学，该任务仍具挑战性。现有基于卷积、循环或注意力机制的时空预测模型通常依赖局部偏置的表征，因此在体量预报任务中难以保持精细的云结构。为解决这一问题，我们提出了QENO——一种用于三维云场的混合量子启发式时空预报框架。该架构包含四个组件：用于紧凑潜在表征的经典时空编码器、用于建模潜在空间中非局部耦合的拓扑感知量子增强模块、用于将测量导出的量子特征与循环记忆动态融合的动态融合时序单元，以及用于重建未来云体数据的解码器。在CMA-MESO三维云场数据集上的实验表明，QENO在均方误差（MSE）、平均绝对误差（MAE）、均方根误差（RMSE）、结构相似性指数（SSIM）和基于阈值的检测指标上均持续优于代表性基线模型，包括ConvLSTM、PredRNN++、Earthformer、TAU及SimVP变体。具体而言，QENO实现了0.2038的MSE、0.4514的RMSE和0.6291的SSIM，同时保持了紧凑的参数规模。这些结果表明，拓扑感知的混合量子-经典特征建模是三维云结构预报及大气地球观测数据分析的一个有前景的方向。

摘要 (Abstract)

Accurate forecasting of three-dimensional (3D) cloud fields is important for atmospheric analysis and short-range numerical weather prediction, yet it remains challenging because cloud evolution involves cross-layer interactions, nonlocal dependencies, and multiscale spatiotemporal dynamics. Existing spatiotemporal prediction models based on convolutions, recurrence, or attention often rely on locality-biased representations and therefore struggle to preserve fine cloud structures in volumetric forecasting tasks. To address this issue, we propose QENO, a hybrid quantum-inspired spatiotemporal forecasting framework for 3D cloud fields. The proposed architecture consists of four components: a classical spatiotemporal encoder for compact latent representation, a topology-aware quantum enhancement block for modeling nonlocal couplings in latent space, a dynamic fusion temporal unit for integrating measurement-derived quantum features with recurrent memory, and a decoder for reconstructing future cloud volumes. Experiments on CMA-MESO 3D cloud fields show that QENO consistently outperforms representative baselines, including ConvLSTM, PredRNN++, Earthformer, TAU, and SimVP variants, in terms of MSE, MAE, RMSE, SSIM, and threshold-based detection metrics. In particular, QENO achieves an MSE of 0.2038, an RMSE of 0.4514, and an SSIM of 0.6291, while also maintaining a compact parameter budget. These results indicate that topology-aware hybrid quantum-classical feature modeling is a promising direction for 3D cloud structure forecasting and atmospheric Earth observation data analysis.

关键词: 3D cloud fields, spatiotemporal forecasting, quantum-inspired, hybrid quantum-classical, topology-aware, CMA-MESO, atmospheric analysis, Earth observation

92. ❌ Security in LLM-as-a-Judge: A Comprehensive SoK

作者: Aiman Almasoud, Antony Anju, Marco Arazzi, Mert Cihangiroglu, Vignesh Kumar Kembu, Serena Nicolazzo, Antonino Nocera, Vinod P., Saraga Sakthidharan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29403v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM-as-a-Judge系统的安全研究，核心围绕大型语言模型在评估任务中的应用、安全风险和防御机制，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词涉及具体技术（如MoE、量化、推理加速等）、训练方法（如预训练、微调）、应用领域（如科学AI）或特定能力（如工具使用、多智能体），论文未直接探讨这些方面，故均评0分。

!!! tip deepseek-chat TL;DR

该论文首次系统化研究了LLM-as-a-Judge系统的安全风险，通过分析863篇文献提出了攻击、防御和应用的分类法，揭示了评估框架的脆弱性并提出了改进方向。

摘要翻译

LLM-as-a-Judge（LaaJ，即大语言模型作为评判者）是一种新兴范式，其利用强大语言模型来评估生成内容的质量、安全性或正确性。尽管该范式显著提升了评估流程的可扩展性与效率，但也引入了尚未被充分探索的新型安全风险与可靠性问题。具体而言，基于大语言模型的评判者既可能成为对抗性操纵的目标，也可能成为实施攻击的工具，从而可能危及评估流程的可信度。本文首次针对LLM-as-a-Judge系统的安全性问题进行了知识体系化梳理。通过对主要学术数据库的全面文献调研，我们分析了863篇文献，并筛选出2020年至2026年间发表的45项相关研究。基于此，我们提出了一种分类框架，依据LLM-as-a-Judge在安全领域中所扮演的角色对近期研究进行系统归类，区分了针对LaaJ系统的攻击、通过LaaJ实施的攻击、利用LaaJ进行安全防护的防御方案，以及将LaaJ作为安全相关领域评估策略的应用场景。我们进一步对现有方法进行了比较分析，指出当前局限、新兴威胁及开放研究挑战。研究发现揭示了基于大语言模型的评估框架存在显著脆弱性，同时指出了提升其鲁棒性与可靠性的潜在方向。最后，我们展望了关键研究机遇，以指导构建更安全、可信的LLM-as-a-Judge系统。

摘要 (Abstract)

LLM-as-a-Judge (LaaJ) is a novel paradigm in which powerful language models are used to assess the quality, safety, or correctness of generated outputs. While this paradigm has significantly improved the scalability and efficiency of evaluation processes, it also introduces novel security risks and reliability concerns that remain largely unexplored. In particular, LLM-based judges can become both targets of adversarial manipulation and instruments through which attacks are conducted, potentially compromising the trustworthiness of evaluation pipelines. In this paper, we present the first Systematization of Knowledge (SoK) focusing on the security aspects of LLM-as-a-Judge systems. We perform a comprehensive literature review across major academic databases, analyzing 863 works and selecting 45 relevant studies published between 2020 and 2026. Based on this study, we propose a taxonomy that organizes recent research according to the role played by LLM-as-a-Judge in the security landscape, distinguishing between attacks targeting LaaJ systems, attacks performed through LaaJ, defenses leveraging LaaJ for security purposes, and applications where LaaJ is used as an evaluation strategy in security-related domains. We further provide a comparative analysis of existing approaches, highlighting current limitations, emerging threats, and open research challenges. Our findings reveal significant vulnerabilities in LLM-based evaluation frameworks, as well as promising directions for improving their robustness and reliability. Finally, we outline key research opportunities that can guide the development of more secure and trustworthy LLM-as-a-Judge systems.

关键词: LLM-as-a-Judge, security risks, adversarial manipulation, evaluation frameworks, Systematization of Knowledge, trustworthiness, robustness, reliability

93. ❌ ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

作者: Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29399v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI代理在数据工程任务（ELT管道构建）中的能力评估，主要涉及LLM驱动的AI代理（LLM Agents）和基准测试质量审计。论文明确提到使用升级的大型语言模型（Large Language Models）重新评估ELT-Bench，并开发了结合LLM驱动分析的审计方法。因此，这两个关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法、推理优化、AI for Science等，论文未涉及具体技术细节或应用领域，故评0分。

!!! tip deepseek-chat TL;DR

论文研究发现ELT-Bench基准测试存在质量缺陷（如评估脚本僵化、规范模糊、真值错误），导致AI代理能力被低估，通过升级LLM和审计方法构建修正版ELT-Bench-Verified后，代理性能显著提升。

摘要翻译

构建抽取-加载-转换（ELT）管道是一项劳动密集型的数据工程任务，也是人工智能自动化的重要目标。在首个端到端ELT管道构建基准测试ELT-Bench上，AI智能体最初表现出较低的成功率，表明其缺乏实际应用价值。
我们重新审视这些结果，发现导致智能体能力被严重低估的两个因素。首先，使用升级后的大型语言模型重新评估ELT-Bench显示，抽取和加载阶段已基本解决，而转换阶段的性能显著提升。其次，我们开发了一种审计-校正方法，该方法将可扩展的LLM驱动根因分析与严格的人工验证（评估者间一致性Fleiss’ kappa = 0.85）相结合，以审计基准质量。将此方法应用于ELT-Bench后发现，大多数失败的转换任务包含可归因于基准测试本身的错误——包括僵化的评估脚本、模糊的规范以及错误的基准答案——这些错误惩罚了智能体的正确输出。
基于这些发现，我们构建了ELT-Bench-Verified，这是一个经过修订的基准测试，具有精炼的评估逻辑和修正后的基准答案。在此版本上重新评估显示，性能的显著提升完全归功于基准测试的修正。我们的结果表明，模型的快速进步和基准测试的质量问题共同导致了智能体能力被低估。更广泛地说，我们的发现呼应了在文本到SQL基准测试中普遍存在的标注错误现象，表明数据工程评估中的质量问题具有系统性。对于复杂的智能体任务，系统化的质量审计应成为标准实践。我们发布ELT-Bench-Verified，旨在为人工智能驱动的数据工程自动化进展提供更可靠的基础。

摘要 (Abstract)

Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss’ kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors – including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth – that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.

关键词: AI agents, ELT pipelines, benchmark quality, large language models, data engineering, evaluation, audit methodology, ground truth correction

94. ❌ Extend3D: Town-Scale 3D Generation

作者: Seungwoo Yoon, Jinmo Kim, Jaesik Park 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29387v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Extend3D: Town-Scale 3D Generation》专注于3D场景生成技术，提出了一种基于对象中心3D生成模型的免训练流水线，用于从单张图像生成大规模3D场景。论文的核心贡献在于3D生成方法，包括潜在空间扩展、分块生成、点云初始化、SDEdit迭代优化和3D感知优化目标。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是计算机视觉和3D生成领域，未涉及任何大语言模型技术、深度学习原理创新或AI在生物医药等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Extend3D的免训练流水线，用于从单张图像生成大规模3D场景，通过扩展潜在空间、分块生成和迭代优化，在几何结构和纹理保真度上优于现有方法。

摘要翻译

本文提出Extend3D，一种基于以物体为中心的三维生成模型、无需训练的单图像三维场景生成流程。为克服以物体为中心的模型在表示广阔场景时固定大小潜在空间的局限性，我们在$x$和$y$方向上扩展了潜在空间。随后，通过将扩展后的潜在空间划分为重叠的补丁，我们将以物体为中心的三维生成模型应用于每个补丁，并在每个时间步对其进行耦合。由于基于图像条件的分块三维生成要求图像与潜在补丁之间严格的空间对齐，我们使用单目深度估计器提供的点云先验初始化场景，并通过SDEdit迭代优化被遮挡区域。我们发现，在三维优化过程中将三维结构的不完整性视为噪声，可通过一种我们称为“欠去噪”的概念实现三维补全。此外，为解决以物体为中心的模型在子场景生成中的次优问题，我们在去噪过程中对扩展潜在空间进行优化，确保去噪轨迹与子场景动态保持一致。为此，我们引入了三维感知的优化目标以提升几何结构与纹理保真度。实验表明，通过人工偏好评估与定量实验验证，本方法相比现有方法取得了更优的结果。

摘要 (Abstract)

In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

关键词: 3D scene generation, single image, training-free pipeline, object-centric 3D generative model, latent space extension, patch-wise generation, SDEdit, under-noising

95. ❌ PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization

作者: Jianpeng Wang, Haoyu Wang, Baoying Chen, Jishen Zeng, Yiming Qin, Yiqi Yang, Zhongjie Ba 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29386v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于AI图像伪造检测领域，具体研究基于提示的AI图像编辑的伪造定位问题。论文核心贡献包括：1）开发自动化掩码标注框架；2）构建大规模数据集PromptForge-350k；3）提出ICL-Net检测网络。虽然论文涉及AI图像编辑模型，但所有关键词均针对大语言模型（LLM）及相关技术，而本文研究的是计算机视觉领域的图像伪造检测，与LLM技术、模型训练方法、推理优化、对齐技术、代理系统等完全无关。唯一可能相关的“AI for Science”关键词，但本文属于计算机视觉安全应用，而非生物信息学或化学信息学等科学领域。

!!! tip deepseek-chat TL;DR

该论文针对基于提示的AI图像编辑技术带来的伪造风险，提出了自动化掩码标注框架、构建了大规模数据集PromptForge-350k，并开发了ICL-Net检测网络，在伪造定位任务上实现了62.5%的IoU，比现有方法提升5.1%。

摘要翻译

基于提示的人工智能图像编辑技术的快速普及，近期加剧了恶意内容伪造与虚假信息传播的风险。然而，针对此类新兴编辑技术的伪造定位方法仍存在明显的研究空白。为弥补这一不足，我们首先提出了一种全自动掩码标注框架，该框架利用关键点对齐与语义空间相似性，为被编辑区域生成精确的真实掩码标签。基于此框架，我们构建了PromptForge-350k数据集——一个涵盖四种前沿提示式AI图像编辑模型的大规模伪造定位数据集，从而缓解了该领域数据稀缺的问题。进一步，我们提出了ICL-Net，一种高效的伪造定位网络，其采用三流主干架构并引入图像内对比学习机制。该设计使模型能够捕捉高度鲁棒且可泛化的取证特征。大量实验表明，我们的方法在PromptForge-350k数据集上达到了62.5%的交并比（IoU），较现有最优方法提升了5.1%。此外，该方法对常见图像退化表现出强鲁棒性，IoU下降幅度小于1%，并在未见过的编辑模型上展现出良好的泛化能力，平均IoU达到41.5%。

摘要 (Abstract)

The rapid democratization of prompt-based AI image editing has recently exacerbated the risks associated with malicious content fabrication and misinformation. However, forgery localization methods targeting these emerging editing techniques remain significantly under-explored. To bridge this gap, we first introduce a fully automated mask annotating framework that leverages keypoint alignment and semantic space similarity to generate precise ground-truth masks for edited regions. Based on this framework, we construct PromptForge-350k, a large-scale forgery localization dataset covering four state-of-the-art prompt-based AI image editing models, thereby mitigating the data scarcity in this domain. Furthermore, we propose ICL-Net, an effective forgery localization network featuring a triple-stream backbone and intra-image contrastive learning. This design enables the model to capture highly robust and generalizable forensic features. Extensive experiments demonstrate that our method achieves an IoU of 62.5% on PromptForge-350k, outperforming SOTA methods by 5.1%. Additionally, it exhibits strong robustness against common degradations with an IoU drop of less than 1%, and shows promising generalization capabilities on unseen editing models, achieving an average IoU of 41.5%.

关键词: AI image forgery localization, prompt-based image editing, mask annotation framework, large-scale dataset, contrastive learning, forensic features, ICL-Net, generalization capability

96. ❌ AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

作者: Moiz Sadiq Awan, Maryam Raza 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29366v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文评估了三种商业LLM（GPT-4o、Claude Sonnet 4.5、Gemini 2.5 Pro）在生成医疗授权信方面的表现，因此与’Large Language Models’高度相关（10分）。研究属于医疗AI应用，与’AI for Science’有一定关联（8分），但未深入生物信息学或化学信息学具体技术。论文未涉及其他关键词的技术原理或创新，如MoE、SFT、RAG、推理方法等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究评估了大型语言模型在生成医疗授权信方面的能力，发现模型能产生高质量的临床内容，但在满足实际行政要求（如账单代码、授权期限）方面存在系统性缺陷。

摘要翻译

事先授权仍是美国医疗体系中最繁重的行政流程之一，每年消耗数十亿美元及数千小时的医生工作时间。尽管大语言模型在临床文本任务中展现出潜力，但其生成可直接提交的事先授权函的能力尚未得到充分关注——现有研究仅局限于单案例演示，缺乏结构化的多场景评估。本研究评估了三种商用大语言模型（GPT-4o、Claude Sonnet 4.5和Gemini 2.5 Pro）在45个经医生验证的合成场景中的表现，涵盖风湿病学、精神病学、肿瘤学、心脏病学和骨科领域。所有模型生成的授权函均具备扎实的临床内容：诊断准确、医疗必要性论证结构清晰、阶梯治疗记录完整。然而，对现实行政要求的二次分析揭示了临床评分体系未能捕捉的共性问题，包括账单编码缺失、授权期限请求遗漏以及随访计划不充分。这些发现重新界定了问题核心：临床部署的挑战不在于大语言模型能否撰写临床内容合格的授权函，而在于围绕其构建的系统能否提供支付方工作流程所要求的行政精确性。

摘要 (Abstract)

Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.

关键词: prior authorization, large language models, clinical content, administrative requirements, healthcare, GPT-4o, Claude Sonnet, Gemini

97. ❌ Deep Learning-Based Anomaly Detection in Spacecraft Telemetry on Edge Devices

作者: Christopher Goetze, Tim Schlippe, Daniel Lakey 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29375v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于航天器遥测数据的异常检测，并针对边缘设备（如CubeSat）进行模型优化。其核心是传统的深度学习模型（如用于时间序列预测和分类的神经网络）的架构优化和部署，而非大语言模型（LLM）或相关技术。因此，与绝大多数关键词（涉及LLM原理、训练、对齐、推理、应用范式等）完全无关。唯一略有相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将AI应用于航天科学领域（航天器安全），属于“AI for Science”的广义范畴，但并非其核心生物信息学或化学信息学子领域，故给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究解决了在硬件资源极度受限的航天器边缘设备上部署高效异常检测模型的挑战，通过多目标神经架构优化，在显著降低模型计算和内存开销（如RAM减少97.1%）的同时，保持了高性能（如88.8%的CEF0.5分数），使得星载实时异常检测变得可行。

摘要翻译

航天器异常检测对任务安全至关重要，但由于硬件限制，在星载设备上部署复杂模型面临重大挑战。本文研究了三种航天器遥测异常检测方法——预测与阈值法、直接分类法和图像分类法——并利用欧洲空间局异常数据集，通过多目标神经架构优化技术对其进行边缘部署优化。基线实验表明，预测与阈值法相比其他方法实现了更优的检测性能（92.7%的修正事件加权F0.5分数（CEF0.5））[1]。通过帕累托最优架构优化，我们在保持检测能力的同时大幅降低了计算需求——优化后的预测与阈值模型保持了88.8%的CEF0.5分数，同时将内存占用减少97.1%至仅59KB，运算量降低99.4%。部署可行性分析表明，优化后模型仅需占用立方星（CubeSat）内存的0.36-6.25%，使得即使在高度受限的硬件上实现星载异常检测成为可能。本研究表明，复杂的异常检测能力可在航天器边缘计算限制范围内成功部署，在不超过硬件限制或不影响任务安全的前提下，提供近实时的异常检测能力。

摘要 (Abstract)

Spacecraft anomaly detection is critical for mission safety, yet deploying sophisticated models on-board presents significant challenges due to hardware constraints. This paper investigates three approaches for spacecraft telemetry anomaly detection – forecasting & threshold, direct classification, and image classification – and optimizes them for edge deployment using multi-objective neural architecture optimization on the European Space Agency Anomaly Dataset. Our baseline experiments demonstrate that forecasting & threshold achieves superior detection performance (92.7% Corrected Event-wise F0.5-score (CEF0.5)) [1] compared to alternatives. Through Pareto-optimal architecture optimization, we dramatically reduced computational requirements while maintaining capabilities – the optimized forecasting & threshold model preserved 88.8% CEF0.5 while reducing RAM usage by 97.1% to just 59 KB and operations by 99.4%. Analysis of deployment viability shows our optimized models require just 0.36-6.25% of CubeSat RAM, making on-board anomaly detection practical even on highly constrained hardware. This research demonstrates that sophisticated anomaly detection capabilities can be successfully deployed within spacecraft edge computing constraints, providing near-instantaneous detection without exceeding hardware limitations or compromising mission safety.

关键词: Anomaly Detection, Spacecraft Telemetry, Edge Devices, Neural Architecture Optimization, Model Compression, CubeSat, Forecasting & Threshold, On-board Deployment

98. ❌ Rigorous Explanations for Tree Ensembles

作者: Yacine Izza, Alexey Ignatiev, Xuanxiang Huang, Peter J. Stuckey, Joao Marques-Silva 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29361v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究树集成模型（随机森林和提升树）的严格解释方法，属于传统机器学习可解释性领域。所有关键词均聚焦于大模型、深度学习及相关技术（如MoE、RLHF、RAG、量化等），而本文完全不涉及这些内容。唯一相关的是’Mechanistic Interpretability OR Explainable AI’，因为论文核心是开发可解释AI方法为树集成模型提供逻辑严密的解释，因此给10分（高度相关）。其他关键词与论文主题（传统树模型解释）无任何关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究如何为随机森林和提升树等树集成模型生成逻辑严密、严格定义的解释，以增强对这些模型预测的信任和理解。

摘要翻译

树集成模型在众多实际应用中发挥着重要作用，它代表了机器学习方法中最通用且最准确的类别之一。尽管这类模型在表示上通常较为简洁，但其运行机制对人类决策者而言仍难以透彻理解。建立对树集成模型运行信任的一种解决方案是自动识别其预测背后的解释。显然，只有当这些解释具有严谨性——即真实反映其所解释的底层预测器的特性时，我们才能通过解释来获得信任。本文针对树集成模型中两个著名实例——随机森林与提升树，研究了如何计算严格定义、逻辑合理的解释。

摘要 (Abstract)

Tree ensembles (TEs) find a multitude of practical applications. They represent one of the most general and accurate classes of machine learning methods. While they are typically quite concise in representation, their operation remains inscrutable to human decision makers. One solution to build trust in the operation of TEs is to automatically identify explanations for the predictions made. Evidently, we can only achieve trust using explanations, if those explanations are rigorous, that is truly reflect properties of the underlying predictor they explain This paper investigates the computation of rigorously-defined, logically-sound explanations for the concrete case of two well-known examples of tree ensembles, namely random forests and boosted trees.

关键词: Tree Ensembles, Random Forests, Boosted Trees, Explainable AI, Rigorous Explanations, Model Interpretability, Machine Learning, Trust in AI

99. ❌ CIPHER: Counterfeit Image Pattern High-level Examination via Representation

作者: Kyeonghun Kim, Youngung Han, Seoyoung Ju, Yeonju Jean, YooHyun Kim, Minseo Choi, SuYeon Lim, Kyungtae Park, Seungwoo Baek, Sieun Hyeon, Nam-Joon Kim, Hyuk-Jae Lee 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CIPHER专注于深度伪造检测，利用生成对抗网络（GANs）和扩散模型的判别器进行特征提取和微调，属于计算机视觉和多媒体安全领域。所有评分关键词均围绕大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理系统等），或特定科学AI应用（如生物信息学）。论文未涉及任何LLM技术、原理或应用，也未使用生物/化学信息学方法，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CIPHER的深度伪造检测框架，通过重用和微调图像生成模型的判别器来提取跨模型通用特征，在多个生成模型上实现了优越的跨模型检测性能，F1分数最高提升30%以上。

摘要翻译

生成对抗网络（GAN）与扩散模型的快速发展，使得合成人脸的生成质量日益提升，其与真实图像的区分难度不断加大。然而，这一进展也加剧了虚假信息传播、欺诈和身份盗用等风险，凸显了对能够跨多种生成模型保持鲁棒性的检测器的迫切需求。本研究提出了一种基于表征的伪造图像模式高层检测框架——CIPHER（Counterfeit Image Pattern High-level Examination via Representation），该框架系统性地重用并微调原本为图像生成任务训练的判别器。通过从ProGAN判别器中提取尺度自适应特征，并从扩散模型中提取时序一致性特征，CIPHER能够捕捉传统检测器常忽略的、与生成模型无关的伪造痕迹。在涵盖九种前沿生成模型的广泛实验中，CIPHER展现出卓越的跨模型检测性能，其F1分数最高达到74.33%，平均比现有基于视觉Transformer（ViT）的检测器高出30%以上。值得注意的是，在基线方法失效的挑战性数据集上，我们的方法仍保持鲁棒性能，如在CIFAKE数据集上取得高达88%的F1分数，而传统检测器的性能近乎为零。这些结果验证了判别器重用与跨模型微调的有效性，确立了CIPHER作为一种在生成技术快速演进的时代中构建更具泛化性和鲁棒性的深度伪造检测系统的可行路径。

摘要 (Abstract)

The rapid progress of generative adversarial networks (GANs) and diffusion models has enabled the creation of synthetic faces that are increasingly difficult to distinguish from real images. This progress, however, has also amplified the risks of misinformation, fraud, and identity abuse, underscoring the urgent need for detectors that remain robust across diverse generative models. In this work, we introduce Counterfeit Image Pattern High-level Examination via Representation(CIPHER), a deepfake detection framework that systematically reuses and fine-tunes discriminators originally trained for image generation. By extracting scale-adaptive features from ProGAN discriminators and temporal-consistency features from diffusion models, CIPHER captures generation-agnostic artifacts that conventional detectors often overlook. Through extensive experiments across nine state-of-the-art generative models, CIPHER demonstrates superior cross-model detection performance, achieving up to 74.33% F1-score and outperforming existing ViT-based detectors by over 30% in F1-score on average. Notably, our approach maintains robust performance on challenging datasets where baseline methods fail, with up to 88% F1-score on CIFAKE compared to near-zero performance from conventional detectors. These results validate the effectiveness of discriminator reuse and cross-model fine-tuning, establishing CIPHER as a promising approach toward building more generalizable and robust deepfake detection systems in an era of rapidly evolving generative technologies.

关键词: deepfake detection, generative adversarial networks, diffusion models, discriminator reuse, cross-model generalization, synthetic face detection, feature extraction, robust detection

100. ❌ BenchScope: How Many Independent Signals Does Your Benchmark Provide?

作者: Tommy Sha, Stella Zhao 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29357v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于AI基准测试评估方法学，提出了一种名为有效维度（ED）的统计诊断工具，用于量化基准测试中测量冗余度。论文的核心贡献是方法论创新，而非大模型技术本身。它分析了22个基准测试（包括Open LLM Leaderboard、BBH、MMLU-Pro等），涉及8,400多个模型评估，因此与"Large Language Models OR LLMs OR Foundation Models"有一定关联（评8分），因为其分析对象包含LLM基准测试。然而，论文不涉及任何具体的大模型架构、训练技术、推理优化、对齐方法、应用领域（如科学AI）或其他列出的技术关键词。它研究的是如何评估这些模型的基准测试本身，而不是模型的技术原理或应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为有效维度（ED）的统计诊断方法，用于量化AI基准测试中的测量冗余度，并通过对22个基准测试的分析发现当前基准测试存在显著冗余，例如Open LLM Leaderboard的六个分数仅相当于约两个有效测量轴。

摘要翻译

人工智能评估套件常报告大量分数，却未检验这些分数是否携带独立信息。我们引入有效维度（ED）——即中心化基准分数谱的参与比——作为一种快速、基于群体条件的测量广度上限诊断指标。该方法在实例粒度上应用于8个领域的22个基准测试及超过8,400次模型评估，ED揭示出显著的冗余性：包含六个分数的Open LLM Leaderboard实际仅表现出约两个有效测量轴（ED=1.7），BBH与MMLU-Pro近乎可互换（相关系数ρ=0.96，在七个子群体中保持稳定），且当前各基准测试的测量广度差异超过20倍。我们证明相对ED排序在维度匹配控制下具有稳定性，ED能够识别套件中的冗余组件、监测性能条件压缩情况并指导基准维护。由于二元谱会高估绝对潜在维度，我们将ED解释为筛查统计量而非确切的因子数量，并通过零假设、信度及饱和分析加以补充。我们提供了涵盖22个基准测试的参考图谱及四步诊断工作流程，基准维护者仅需使用分数矩阵和少量代码即可运行。

摘要 (Abstract)

AI evaluation suites often report many scores without checking whether those scores carry independent information. We introduce Effective Dimensionality (ED), the participation ratio of a centered benchmark-score spectrum, as a fast, population-conditional upper-bound diagnostic of measurement breadth. Applied at per-instance granularity to 22 benchmarks across 8 domains and more than 8,400 model evaluations, ED reveals substantial redundancy: the six-score Open LLM Leaderboard behaves like roughly two effective measurement axes (ED = 1.7), BBH and MMLU-Pro are near-interchangeable (rho = 0.96, stable across seven subpopulations), and measurement breadth varies more than 20x across current benchmarks. We show that relative ED rankings are stable under matched-dimension controls and that ED can flag redundant suite components, monitor performance-conditional compression, and guide benchmark maintenance. Because binary spectra overestimate absolute latent dimensionality, we interpret ED as a screening statistic rather than a literal factor count and complement it with null, reliability, and saturation analyses. We provide a 22-benchmark reference atlas and a four-step diagnostic workflow that benchmark maintainers can run with a score matrix and a few lines of code.

关键词: benchmark evaluation, effective dimensionality, measurement redundancy, AI evaluation suites, score matrix analysis, benchmark maintenance, Open LLM Leaderboard, BBH MMLU-Pro

101. ❌ Nomad: Autonomous Exploration and Discovery

作者: Bokang Jia, Samta Kamboj, Satheesh Katipomu, Seung Hun Han, Neha Sengupta, Andrew Jackson 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29353v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文介绍Nomad系统，这是一个用于自主数据探索和洞察发现的系统，核心是基于LLM的自主代理架构。系统使用探索代理进行假设生成和调查，使用工具（文档搜索、网络搜索、数据库工具），并包含独立的验证器，因此与LLM代理、工具使用、多代理系统高度相关。系统涉及检索增强生成（RAG）元素，并强调可信度和事实性（幻觉缓解）。系统进行探索地图构建和系统遍历，涉及推理过程（CoT、系统2思维）和自我纠正机制。论文未涉及MoE、SLMs、缩放定律、训练技术、量化、推理加速、可解释AI、世界模型、模型合并、上下文学习或特定科学AI应用。

!!! tip deepseek-chat TL;DR

论文提出了Nomad系统，通过自主代理架构解决传统查询驱动系统受限于人类框架的问题，实现了更可信、高质量和多样化的洞察发现。

摘要翻译

我们介绍Nomad，这是一个用于自主数据探索与洞察发现的系统。面对文档集、数据库或其他数据源，用户往往难以预知所有可探索的问题、假设或关联。因此，基于查询的问答系统和基于提示的深度研究系统仍受限于人工设定的框架，通常无法覆盖更广泛的洞察空间。
Nomad通过探索优先的架构解决这一问题。它首先在领域内构建一个显式的“探索地图”，并系统性地遍历该地图以平衡探索的广度与深度。系统生成并筛选假设，随后通过一个探索者代理进行验证——该代理能够调用文档搜索、网络搜索及数据库工具。候选洞察在进入报告生成流程前，会由独立的验证器进行核查，最终生成附有引用的详细报告以及更高层次的元报告。
我们还提出了一套用于评估自主发现系统的综合框架，该框架衡量系统的可信度、报告质量与多样性。基于选定的联合国和世界卫生组织报告数据集，我们证明\nomad{}相较于基线方法能够生成更可信、更高质量的报告，同时在多次运行中产生更多样化的洞察。
Nomad是迈向自主系统的一步，这类系统不仅能够回应用户提问或执行定向研究，更能主动发现哪些问题、研究方向与洞察值得首先被揭示。

摘要 (Abstract)

We introduce Nomad, a system for autonomous data exploration and insight discovery. Given a corpus of documents, databases, or other data sources, users rarely know the full set of questions, hypotheses, or connections that could be explored. As a result, query-driven question answering and prompt-driven deep-research systems remain limited by human framing and often fail to cover the broader insight space. Nomad addresses this problem with an exploration-first architecture. It constructs an explicit Exploration Map over the domain and systematically traverses it to balance breadth and depth. It generates and selects hypotheses and investigates them with an explorer agent that can use document search, web search, and database tools. Candidate insights are then checked by an independent verifier before entering a reporting pipeline that produces cited reports and higher-level meta-reports. We also present a comprehensive evaluation framework for autonomous discovery systems that measures trustworthiness, report quality, and diversity. Using a corpus of selected UN and WHO reports, we show that \nomad{} produces more trustworthy and higher-quality reports than baselines, while also producing more diverse insights over several runs. Nomad is a step toward autonomous systems that not only answer user questions or conduct directed research, but also discover which questions, research directions, and insights are worth surfacing in the first place.

关键词: autonomous exploration, insight discovery, agentic workflow, tool use, retrieval-augmented generation, hallucination mitigation, multi-agent systems, exploration map

102. ❌ Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning

作者: Kavindu Herath, Joshua Zhao, Saurabh Bagchi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29328v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究联邦学习中的后门攻击，提出了一种语义感知的后门攻击方法SABLE。论文核心关注联邦学习安全、后门攻击、语义触发器、模型聚合等主题，与所有评分关键词（均围绕大模型/深度学习技术原理、训练方法、推理优化、对齐、应用等）完全无关。论文未涉及任何大模型技术、训练方法、推理技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文研究联邦学习中更现实的语义感知后门攻击威胁，提出SABLE方法，通过自然语义触发器在保持良性准确率的同时实现高攻击成功率。

摘要翻译

联邦学习（FL）中的后门攻击通常使用实践中不太可能出现的合成角块或分布外（OOD）模式进行评估。本文在更现实的设定下重新审视标准联邦学习（单一全局模型）面临的后门威胁，该设定要求触发器必须具有语义意义、分布内且视觉上合理。我们提出SABLE，一种面向联邦学习环境的语义感知后门攻击方法，该方法构建自然且内容一致的触发器（例如语义属性改变，如添加太阳镜），并通过特征分离和参数正则化优化聚合感知的恶意目标，以保持攻击者更新与良性更新接近。我们在CelebA发色分类任务和德国交通标志识别基准（GTSRB）上实例化SABLE，仅毒化每个恶意客户端本地数据中一个可解释的小子集，同时遵循标准联邦学习协议。在异构客户端分区和多种聚合规则（FedAvg、Trimmed Mean、MultiKrum和FLAME）下，我们的语义驱动触发器在保持良性测试准确率的同时实现了较高的定向攻击成功率。这些结果表明，语义对齐的后门在联邦学习中仍然是一种强大且实际的威胁，而仅基于合成块触发器的鲁棒性声明可能过于乐观。

摘要 (Abstract)

Backdoor attacks on federated learning (FL) are most often evaluated with synthetic corner patches or out-of-distribution (OOD) patterns that are unlikely to arise in practice. In this paper, we revisit the backdoor threat to standard FL (a single global model) under a more realistic setting where triggers must be semantically meaningful, in-distribution, and visually plausible. We propose SABLE, a Semantics-Aware Backdoor for LEarning in federated settings, which constructs natural, content-consistent triggers (e.g., semantic attribute changes such as sunglasses) and optimizes an aggregation-aware malicious objective with feature separation and parameter regularization to keep attacker updates close to benign ones. We instantiate SABLE on CelebA hair-color classification and the German Traffic Sign Recognition Benchmark (GTSRB), poisoning only a small, interpretable subset of each malicious client’s local data while otherwise following the standard FL protocol. Across heterogeneous client partitions and multiple aggregation rules (FedAvg, Trimmed Mean, MultiKrum, and FLAME), our semantics-driven triggers achieve high targeted attack success rates while preserving benign test accuracy. These results show that semantics-aligned backdoors remain a potent and practical threat in federated learning, and that robustness claims based solely on synthetic patch triggers can be overly optimistic.

关键词: federated learning, backdoor attack, semantics-aware, SABLE, trigger, aggregation, robustness, adversarial

103. ❌ Scaling Whole-Body Human Musculoskeletal Behavior Emulation for Specificity and Diversity

作者: Yunyue Wei, Chenhui Zuo, Shanning Zhuang, Haixin Gong, Yaming Liu, Yanan Sui 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29332v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是基于大规模并行GPU模拟和强化学习的人类全身肌肉骨骼运动仿真框架，属于AI在生物力学/运动科学领域的应用。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联（AI在科学领域的应用），但论文并未涉及大模型、深度学习技术原理创新或关键词列表中的具体技术。论文的核心是强化学习在生物力学仿真中的应用，而非大模型相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于大规模并行GPU模拟和强化学习的MS-Emulator框架，解决了高维肌肉骨骼系统中运动仿真的优化瓶颈，能够准确再现舞蹈、侧手翻等高动态任务，并探索了肌肉控制策略的多样性和特异性。

摘要翻译

人体运动控制的具身学习需要全身神经驱动的肌肉骨骼动力学系统，而运动背后的内部肌肉驱动过程仍无法直接测量。计算建模提供了替代方案，但逆向动力学方法难以从高维度、过驱动系统中观测到的运动学数据解析冗余控制。基于深度强化学习的正向模仿方法因控制与奖励设计中的维度灾难问题而表现出跟踪性能不足。本文提出了一种基于生物力学原理的大规模并行肌肉骨骼计算框架，用于实现全身运动复现。通过将大规模并行GPU模拟与对抗性奖励聚合及价值引导流探索相结合，MS-Emulator框架克服了肌肉骨骼控制高维强化学习中的关键优化瓶颈，在由约700块肌肉驱动的全身人体肌肉骨骼系统中精确复现了广泛的动作库。该框架在舞蹈、侧手翻和后空翻等高动态任务中实现了高精度的关节角度还原与身体位置对齐。研究还利用该框架探索肌肉骨骼控制解空间，发现了能收敛至几乎相同外部运动学与力学测量结果的不同肌肉骨骼控制策略。这项工作为分析人类运动具身控制背后的特异性与多样性建立了一条可行的计算路径。项目页面：https://lnsgroup.cc/research/MS-Emulator。

摘要 (Abstract)

The embodied learning of human motor control requires whole-body neuro-actuated musculoskeletal dynamics, while the internal muscle-driven processes underlying movement remain inaccessible to direct measurement. Computational modeling offers an alternative, but inverse dynamics methods struggled to resolve redundant control from observed kinematics in the high-dimensional, over-actuated system. Forward imitation approaches based on deep reinforcement learning exhibited inadequate tracking performance due to the curse of dimensionality in both control and reward design. Here we introduce a large-scale parallel musculoskeletal computation framework for biomechanically grounded whole-body motion reproduction. By integrating large-scale parallel GPU simulation with adversarial reward aggregation and value-guided flow exploration, the MS-Emulator framework overcomes key optimization bottlenecks in high-dimensional reinforcement learning for musculoskeletal control, which accurately reproduces a broad repertoire of motions in a whole-body human musculoskeletal system actuated by approximately 700 muscles. It achieved high joint angle accuracy and body position alignment for highly dynamic tasks such as dance, cartwheel, and backflip. The framework was also used to explore the musculoskeletal control solution space, identifying distinct musculoskeletal control policies that converge to nearly identical external kinematic and mechanical measurements. This work establishes a tractable computational route to analyzing the specificity and diversity underlying human embodied control of movement. Project page: https://lnsgroup.cc/research/MS-Emulator.

关键词: musculoskeletal simulation, reinforcement learning, whole-body motion, GPU parallel computation, biomechanical modeling, motion reproduction, high-dimensional control, adversarial reward aggregation

104. ❌ Real-Time Band-Grouped Vocal Denoising Using Sigmoid-Driven Ideal Ratio Masking

作者: Daniel Williams 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29326v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于实时语音去噪的深度学习应用，使用特定的编码器-解码器架构和损失函数优化，但未涉及任何大语言模型（LLM）相关技术、训练方法（如预训练、微调、对齐）、推理优化、代理系统或科学AI应用。所有关键词均与大模型技术或其在科学领域的应用相关，而本文是传统的深度学习音频处理研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于sigmoid驱动理想比率掩码和频带分组编码器-解码器架构的深度学习模型，用于实时语音去噪，实现了低于10毫秒的延迟，并在平稳和非平稳噪声上分别提升了PESQ-WB分数0.21和0.12。

摘要翻译

基于深度学习的实时语音降噪技术在过去几年取得了显著进展，展现了人工智能在提升信噪比（SNR）的同时保持语音自然度的能力。然而，许多深度学习方法存在较高延迟，且需要长时上下文帧，难以配置于实时应用场景。为应对这些挑战，我们提出一种采用频谱损失训练的Sigmoid驱动理想比值掩模，旨在提升信噪比并最大化语音感知质量。该模型采用频带分组编码器-解码器架构，结合频率注意力机制，实现了低于10毫秒的总延迟，在平稳噪声和非平稳噪声环境下分别取得0.21和0.12的PESQ-WB指标提升。

摘要 (Abstract)

Real-time, deep learning-based vocal denoising has seen significant progress over the past few years, demonstrating the capability of artificial intelligence in preserving the naturalness of the voice while increasing the signal-to-noise ratio (SNR). However, many deep learning approaches have high amounts of latency and require long frames of context, making them difficult to configure for live applications. To address these challenges, we propose a sigmoid-driven ideal ratio mask trained with a spectral loss to encourage an increased SNR and maximized perceptual quality of the voice. The proposed model uses a band-grouped encoder-decoder architecture with frequency attention and achieves a total latency of less than 10,ms, with PESQ-WB improvements of 0.21 on stationary noise and 0.12 on nonstationary noise.

关键词: vocal denoising, real-time processing, deep learning, ideal ratio mask, band-grouped encoder-decoder, frequency attention, low latency, PESQ-WB improvement

105. ❌ IMPASTO: Integrating Model-Based Planning with Learned Dynamics Models for Robotic Oil Painting Reproduction

作者: Yingke Wang, Hao Li, Yifeng Zhu, Hong-Xing Yu, Ken Goldberg, Li Fei-Fei, Jiajun Wu, Yunzhu Li, Ruohan Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29315v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器人油画复制系统IMPASTO，涉及学习像素动态模型、基于模型的规划、力控和机器人执行，属于机器人学和计算机视觉领域。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文未涉及任何大模型、深度学习技术或AI科学应用（如生物信息学），因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

论文研究了机器人如何仅通过目标油画图像序列，学习像素动态模型并结合基于模型的规划来推断和执行笔画轨迹、力和颜色，从而复制油画，提出的IMPASTO系统在复制精度上优于基线方法。

摘要翻译

使用软刷与颜料进行油画机器人复现，需实现对可变形工具的力敏控制、笔触效果预测以及多步骤笔触规划，且通常缺乏人工逐步演示或高保真度仿真器。仅给定一系列目标油画图像序列，机器人能否推断并执行所需的笔触轨迹、施力及色彩以完成复现？我们提出IMPASTO系统，一种融合学习型像素动态模型与基于模型规划的机器人油画绘制系统。该动态模型通过图像观测与参数化笔触动作预测画布更新；随后采用滚动时域模型预测控制优化器规划轨迹与施力，并由力敏控制器驱动七自由度机械臂执行笔触。IMPASTO整合了底层力控、学习型动态模型与高层闭环规划，仅通过机器人自主演练学习，并成功模拟了人类艺术家的单笔触数据集与多笔触艺术作品，在复现精度上超越基线方法。项目网站：https://impasto-robopainting.github.io/

摘要 (Abstract)

Robotic reproduction of oil paintings using soft brushes and pigments requires force-sensitive control of deformable tools, prediction of brushstroke effects, and multi-step stroke planning, often without human step-by-step demonstrations or faithful simulators. Given only a sequence of target oil painting images, can a robot infer and execute the stroke trajectories, forces, and colors needed to reproduce it? We present IMPASTO, a robotic oil-painting system that integrates learned pixel dynamics models with model-based planning. The dynamics models predict canvas updates from image observations and parameterized stroke actions; a receding-horizon model predictive control optimizer then plans trajectories and forces, while a force-sensitive controller executes strokes on a 7-DoF robot arm. IMPASTO integrates low-level force control, learned dynamics models, and high-level closed-loop planning, learns solely from robot self-play, and approximates human artists’ single-stroke datasets and multi-stroke artworks, outperforming baselines in reproduction accuracy. Project website: https://impasto-robopainting.github.io/

关键词: robotic oil painting, learned dynamics models, model-based planning, force-sensitive control, stroke trajectory planning, canvas update prediction, receding-horizon MPC, self-play learning

106. ❌ PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

作者: Hongyi Nie, Xunyuan Liu, Yudong Bai, Yaqing Wang, Yang Liu, Quanming Yao, Zhen Wang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29318v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于智能手机GUI代理的个性化评估基准，核心是评估代理在个性化设置下的表现。与LLM代理、推理方法（CoT、System 2）、自我反思、工具使用高度相关，因为这些是代理能力的关键组成部分。论文提到推理导向模型优于通用LLM，反思和长期记忆是关键改进方向，GUI代理直接涉及工具使用。其他关键词如MoE、量化、RAG等未在摘要中提及，与论文技术焦点无关。

!!! tip deepseek-chat TL;DR

该论文提出了PSPA-Bench基准，用于评估智能手机GUI代理在个性化设置下的性能，发现当前方法表现不佳，并指出推理导向模型、感知能力和反思机制是改进的关键方向。

摘要翻译

智能手机GUI代理通过直接在应用程序界面上执行操作来完成任务，为实现广泛能力提供了一条无需深度系统集成的路径。然而，现实中的智能手机使用具有高度个性化特征：用户采用多样化的工作流程和偏好，这要求代理能够提供定制化协助而非通用解决方案。由于用户特定数据稀疏且缺乏细粒度评估指标，现有GUI代理基准测试无法充分捕捉这一个性化维度。为填补这一空白，我们提出了PSPA-Bench——首个专门评估智能手机GUI代理个性化能力的基准测试。PSPA-Bench包含12,855条以上符合真实用户行为的个性化指令，涵盖10个典型日常使用场景和22个移动应用程序，并引入结构感知的过程评估方法，在细粒度层面衡量代理的个性化能力。通过PSPA-Bench，我们对11种最先进的GUI代理进行了基准测试。结果表明，现有方法在个性化设置下表现不佳，即使是最强的代理也仅取得有限成功。我们的分析进一步揭示了推进个性化GUI代理发展的三个方向：（1）面向推理的模型持续优于通用大语言模型（LLMs），（2）感知能力仍是简单但关键的核心能力，（3）反思与长期记忆机制是提升适应性的关键。这些发现共同确立了PSPA-Bench作为个性化GUI代理系统性研究和未来发展的基础。

摘要 (Abstract)

Smartphone GUI agents execute tasks by operating directly on app interfaces, offering a path to broad capability without deep system integration. However, real-world smartphone use is highly personalized: users adopt diverse workflows and preferences, challenging agents to deliver customized assistance rather than generic solutions. Existing GUI agent benchmarks cannot adequately capture this personalization dimension due to sparse user-specific data and the lack of fine-grained evaluation metrics. To address this gap, we present PSPA-Bench, the benchmark dedicated to evaluating personalization in smartphone GUI agents. PSPA-Bench comprises over 12,855 personalized instructions aligned with real-world user behaviors across 10 representative daily-use scenarios and 22 mobile apps, and introduces a structure-aware process evaluation method that measures agents’ personalized capabilities at a fine-grained level. Through PSPA-Bench, we benchmark 11 state-of-the-art GUI agents. Results reveal that current methods perform poorly under personalized settings, with even the strongest agent achieving limited success. Our analysis further highlights three directions for advancing personalized GUI agents: (1) reasoning-oriented models consistently outperform general LLMs, (2) perception remains a simple yet critical capability, and (3) reflection and long-term memory mechanisms are key to improving adaptation. Together, these findings establish PSPA-Bench as a foundation for systematic study and future progress in personalized GUI agents.

关键词: Smartphone GUI Agents, Personalization Benchmark, Personalized Instructions, Process Evaluation, Reasoning-oriented Models, Reflection Mechanisms, Long-term Memory, Agent Adaptation

107. ❌ MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network

作者: Guozhi Qiu, Zhiwei Chen, Zixu Li, Qinlei Huang, Zhiheng Fu, Xuemeng Song, Yupeng Hu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29291v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是组合图像检索（CIR）任务，提出MELT网络来解决频率偏差和相似性估计问题。所有关键词都专注于大语言模型（LLM）及相关技术（如MoE、RLHF、RAG等），或特定AI科学应用（如生物信息学）。论文内容涉及多模态融合和图像检索，但未提及任何LLM技术、模型架构、训练方法或AI科学应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

论文提出MELT网络来解决组合图像检索中的频率偏差和相似性估计问题，通过关注稀有修改语义和扩散去噪提高了检索性能。

摘要翻译

组合图像检索（Composed Image Retrieval, CIR）以一张参考图像和一段修改文本作为查询，旨在检索出满足“根据文本指令修改参考图像”要求的目标图像。然而，现有CIR方法面临两个局限：（1）频率偏差导致“稀有样本忽视”，（2）相似度评分易受困难负样本和噪声干扰。为应对这些局限，我们需解决两个关键挑战：多模态语境下的非对称稀有语义定位，以及在困难负样本下的鲁棒相似度估计。为解决这些挑战，我们提出了修改频率-稀有度平衡网络MELT。MELT在多模态语境中增强对稀有修改语义的关注，同时对具有高相似度评分的困难负样本应用基于扩散的去噪方法，从而提升多模态融合与匹配能力。在两个CIR基准数据集上的大量实验验证了MELT的优越性能。代码可在https://github.com/luckylittlezhi/MELT获取。

摘要 (Abstract)

Composed Image Retrieval (CIR) uses a reference image and a modification text as a query to retrieve a target image satisfying the requirement of modifying the reference image according to the text instructions''. However, existing CIR methods face two limitations: (1) frequency bias leading to Rare Sample Neglect’’, and (2) susceptibility of similarity scores to interference from hard negative samples and noise. To address these limitations, we confront two key challenges: asymmetric rare semantic localization and robust similarity estimation under hard negative samples. To solve these challenges, we propose the Modification frEquentation-rarity baLance neTwork MELT. MELT assigns increased attention to rare modification semantics in multimodal contexts while applying diffusion-based denoising to hard negative samples with high similarity scores, enhancing multimodal fusion and matching. Extensive experiments on two CIR benchmarks validate the superior performance of MELT. Codes are available at https://github.com/luckylittlezhi/MELT.

关键词: Composed Image Retrieval, Multimodal Fusion, Frequency Bias, Rare Sample Neglect, Hard Negative Samples, Diffusion Denoising, Similarity Estimation, Modification Semantics

108. ❌ Downsides of Smartness Across Edge-Cloud Continuum in Modern Industry

作者: Akhil Gupta Chigullapally, Sharvan Vittala, Razin Farhan Hussian, Mohsen Amini Salehi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29289v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要讨论工业智能系统（特别是基于AI的解决方案，包括机器学习、强化学习和生成式AI）在边缘-雾-云计算连续体中的部署所带来的安全风险和负面影响，如互操作性副作用和网络威胁。然而，论文并未深入探讨任何特定的大模型技术原理、训练方法、推理优化、对齐技术或具体应用领域（如生物信息学）。所有关键词均涉及大模型的具体技术、方法或应用领域，而论文仅在高层次上提及AI解决方案，未聚焦于大模型本身或其技术细节，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该研究揭示了工业智能系统在边缘-云连续体中部署时面临的安全隐患，包括互操作性副作用和网络威胁，强调了理解和解决这些负面影响对确保智能工业系统安全可持续发展的重要性。

摘要翻译

现代人工智能的快速发展正迅速将传统工业系统转变为由人工智能解决方案驱动的大规模、智能化且可能无人化的自主运行环境。这些解决方案运用多种形式的机器学习、强化学习和生成式人工智能。此类智能能力的引入已突破多个工业领域的边界，实现了预测性维护、性能优化和工作流程精简。这些解决方案通常部署于工业物联网（IIoT，Industrial Internet of Things）中，并依托边缘-雾-云计算连续体提供支持，以实现紧急（即实时或近实时）决策。尽管当前业界为提升利润、质量和效率正积极采用这些智能工业解决方案，但大规模集成与部署也带来了严重隐患，若被忽视可能削弱智能工业的效益。这些隐患包括不可预见的互操作性副作用，以及对网络威胁的脆弱性加剧——尤其是在运行大量异构IIoT系统的环境中。本研究旨在阐明工业智能化可能带来的后果，特别聚焦于安全影响，包括脆弱性、副作用和网络威胁。我们区分了源自传统AI解决方案与生成式AI的软件层面弊端，以及源于基础设施层（即IIoT和边缘-云连续体）的弊端。在每个层面，我们探究了潜在的脆弱性、网络威胁和非预期副作用。随着工业持续智能化，理解并应对这些弊端对于确保智能工业系统的安全与可持续发展至关重要。

摘要 (Abstract)

The fast pace of modern AI is rapidly transforming traditional industrial systems into vast, intelligent and potentially unmanned autonomous operational environments driven by AI-based solutions. These solutions leverage various forms of machine learning, reinforcement learning, and generative AI. The introduction of such smart capabilities has pushed the envelope in multiple industrial domains, enabling predictive maintenance, optimized performance, and streamlined workflows. These solutions are often deployed across the Industrial Internet of Things (IIoT) and supported by the Edge-Fog-Cloud computing continuum to enable urgent (i.e., real-time or near real-time) decision-making. Despite the current trend of aggressively adopting these smart industrial solutions to increase profit, quality, and efficiency, large-scale integration and deployment also bring serious hazards that if ignored can undermine the benefits of smart industries. These hazards include unforeseen interoperability side-effects and heightened vulnerability to cyber threats, particularly in environments operating with a plethora of heterogeneous IIoT systems. The goal of this study is to shed light on the potential consequences of industrial smartness, with a particular focus on security implications, including vulnerabilities, side effects, and cyber threats. We distinguish software-level downsides stemming from both traditional AI solutions and generative AI from those originating in the infrastructure layer, namely IIoT and the Edge-Cloud continuum. At each level, we investigate potential vulnerabilities, cyber threats, and unintended side effects. As industries continue to become smarter, understanding and addressing these downsides will be crucial to ensure secure and sustainable development of smart industrial systems.

关键词: Industrial AI, Edge-Cloud Continuum, IIoT Security, Cyber Threats, Smart Industries, Generative AI, Predictive Maintenance, Autonomous Systems

109. ❌ Sima AIunty: Caste Audit in LLM-Driven Matchmaking

作者: Atharva Naik, Shounok Kar, Varnika Sharma, Ashwin Rajadesingan, Koustuv Saha 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29288v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在婚恋匹配场景中的种姓偏见问题，直接涉及"Large Language Models"关键词，因此该关键词得10分。论文未涉及其他技术原理创新（如MoE、量化、推理加速等）或特定应用领域（如生物信息学），也未包含指定专家作者，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文通过审计GPT、Gemini、Llama等大语言模型在婚恋匹配评估中的表现，发现这些模型会系统性复制印度种姓等级制度，同种姓匹配的评分比跨种姓匹配高出25%。

摘要翻译

在婚介等关系性领域的社会与个人决策，深深植根于文化规范与历史等级结构，并可能受到算法和人工智能对匹配度、接纳度及稳定性的评估所影响。在南亚语境中，种姓仍是婚姻决策的核心因素，然而当代大语言模型在此类情境中如何复制或打破基于种姓的分层机制，目前尚不明确。本研究使用真实婚介档案，对大语言模型中介的婚配评估进行了种姓偏见的受控审计。我们变换了婆罗门、刹帝利、吠舍、首陀罗和达利特种姓身份，并设置了五个收入区间，评估了五个大语言模型系列（GPT、Gemini、Llama、Qwen和BharatGPT）。研究通过提示模型从社会接纳度、婚姻稳定性和文化适配度三个维度评估档案。分析显示，所有模型均呈现一致的等级化模式：同种姓匹配获得最高评分，其平均评分（基于10分制）比跨种姓匹配高出达25%，而跨种姓匹配的评分又依照传统种姓等级进一步排序。这些发现揭示了现有种姓等级制度如何在大语言模型决策中被复制，并强调在部署于社会敏感领域的人工智能系统中，亟需建立基于文化的评估与干预策略，因为此类系统可能强化历史性的排斥结构。

摘要 (Abstract)

Social and personal decisions in relational domains such as matchmaking are deeply entwined with cultural norms and historical hierarchies, and can potentially be shaped by algorithmic and AI-mediated assessments of compatibility, acceptance, and stability. In South Asian contexts, caste remains a central aspect of marital decision-making, yet little is known about how contemporary large language models (LLMs) reproduce or disrupt caste-based stratification in such settings. In this work, we conduct a controlled audit of caste bias in LLM-mediated matchmaking evaluations using real-world matrimonial profiles. We vary caste identity across Brahmin, Kshatriya, Vaishya, Shudra, and Dalit, and income across five buckets, and evaluate five LLM families (GPT, Gemini, Llama, Qwen, and BharatGPT). Models are prompted to assess profiles along dimensions of social acceptance, marital stability, and cultural compatibility. Our analysis reveals consistent hierarchical patterns across models: same-caste matches are rated most favorably, with average ratings up to 25% higher (on a 10-point scale) than inter-caste matches, which are further ordered according to traditional caste hierarchy. These findings highlight how existing caste hierarchies are reproduced in LLM decision-making and underscore the need for culturally grounded evaluation and intervention strategies in AI systems deployed in socially sensitive domains, where such systems risk reinforcing historical forms of exclusion.

关键词: LLM bias, caste audit, matchmaking, social acceptance, marital stability, cultural compatibility, algorithmic fairness, South Asian context

110. ❌ Grokking From Abstraction to Intelligence

作者: Junjie Zhang, Zhen Shen, Gang Xiong, Xisong Dong 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29262v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究grokking现象（模型从记忆到泛化的转变机制），属于深度学习模型内部工作机制的机理研究。论文使用因果分析、谱分析、算法复杂度测量和奇异学习理论等方法，探讨模型内部结构的简化、冗余流形的物理坍缩和信息压缩过程。这与’Mechanistic Interpretability OR Explainable AI’（机理可解释性或可解释AI）高度相关（10分），因为论文的核心正是通过可解释性方法揭示模型泛化的内在机制。其他关键词主要涉及大模型的具体技术（如LLM、MoE、SFT、RAG等）、应用领域（如AI for Science）或特定能力（如推理、对齐），而本文是基础性的模型机理研究，并未涉及这些具体技术或应用，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了深度学习模型中grokking现象（从记忆到泛化的转变）的机制，发现它源于内部模型结构的自发简化和冗余流形的物理坍缩，为理解模型过拟合和泛化提供了新视角。

摘要翻译

模块算术中的顿悟现象已成为典型的果蝇实验，作为探究模型泛化机制起源的关键领域。尽管其意义重大，现有研究仍狭隘地聚焦于特定局部电路或优化调参，很大程度上忽视了根本驱动这一现象的全局结构演化。我们认为，顿悟源于模型内部结构受简约性原则支配的自发简化过程。我们整合了因果性、谱分析与算法复杂性度量，并结合奇异学习理论，揭示了从记忆到泛化的转变对应着冗余流形的物理坍缩与深层信息压缩，为理解模型过拟合与泛化机制提供了新的视角。

摘要 (Abstract)

Grokking in modular arithmetic has established itself as the quintessential fruit fly experiment, serving as a critical domain for investigating the mechanistic origins of model generalization. Despite its significance, existing research remains narrowly focused on specific local circuits or optimization tuning, largely overlooking the global structural evolution that fundamentally drives this phenomenon. We propose that grokking originates from a spontaneous simplification of internal model structures governed by the principle of parsimony. We integrate causal, spectral, and algorithmic complexity measures alongside Singular Learning Theory to reveal that the transition from memorization to generalization corresponds to the physical collapse of redundant manifolds and deep information compression, offering a novel perspective for understanding the mechanisms of model overfitting and generalization.

关键词: grokking, modular arithmetic, model generalization, mechanistic interpretability, singular learning theory, information compression, overfitting, internal model structures

111. ❌ PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

作者: Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29281v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究零售环境中的具身视觉语言模型（VLMs），通过构建PRISM数据集进行监督微调（SFT）以提升模型在空间、物理动态和具身动作理解方面的能力。与关键词的相关性分析如下：1）与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（10分），因为论文核心是使用SFT方法微调VLMs；2）与"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"有一定关联（8分），因为数据集包含链式思维监督；3）与"Large Language Models OR LLMs OR Foundation Models"有中等关联（5分），因为VLMs属于大模型范畴，但论文更侧重视觉语言模型而非纯文本LLMs；4）其他关键词如MoE、SLMs、Scaling Laws、RLHF等与论文内容完全无关（0分），因为论文未涉及这些技术。

!!! tip deepseek-chat TL;DR

论文提出了PRISM数据集，通过监督微调具身视觉语言模型来提升其在零售环境中的空间、物理和动作理解能力，实验表明微调后错误率降低了66.6%。

摘要翻译

当前顶尖物理人工智能模型所具备的通用视觉理解能力，与结构化现实世界部署环境所要求的专业化感知需求之间，存在一个关键差距。我们提出了PRISM，这是一个包含27万个样本的多视角视频监督微调数据集，专为现实零售环境中的具身视觉-语言模型而构建。PRISM的提出基于一个简单的观察：物理人工智能系统的失败并非源于视觉识别能力不足，而是因为它们对空间、物理动态以及具身行动的理解尚未达到能在现实世界中可靠运作的水平。为此，PRISM植根于一个新颖的三维知识本体，该本体涵盖空间知识、时间与物理知识以及具身行动知识。它覆盖了四个评估维度下的20多项能力探测任务——具身推理、常识、空间感知和直觉物理。据我们所知，PRISM是首个在单一现实世界部署领域内实例化所有三个知识维度的数据集。该数据集采集自五个超市场景，包含第一人称、第三人称和360°视角的视频数据，并提供开放式、思维链式以及多项选择式的监督信号。以每秒4帧计，PRISM包含约1180万视频帧和约7.3亿个文本标记，使其成为最大的领域专用视频监督微调数据集之一。在PRISM上进行微调后，模型在所有20多项探测任务上的错误率比预训练基线降低了66.6%，其中在具身行动理解方面提升显著，准确率提高了36.4%。我们的结果表明，基于本体结构化的领域专用监督微调能够有效增强具身视觉-语言模型在现实场景中的能力。PRISM数据集及更多细节请访问 https://dreamvu.ai/prism。

摘要 (Abstract)

A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism

关键词: embodied vision-language models, supervised fine-tuning, multi-view video dataset, retail environments, spatial knowledge, physical dynamics, chain-of-thought, domain-specific SFT

112. ❌ Monodense Deep Neural Model for Determining Item Price Elasticity

作者: Lakshya Garg, Sai Yaswanth, Deep Narayan Mishra, Karthik Kumaran, Anupriya Sharma, Mayank Uniyal 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29261v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究价格弹性预测，使用传统机器学习方法（Monodense-DL、DML、LGBM）处理大规模交易数据，未涉及大模型、深度学习技术原理创新或AI for Science等关键词。所有关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的Monodense深度神经网络框架，用于在缺乏对照组的场景下预测商品价格弹性，并在多类别零售数据上验证了其优于其他机器学习方法。

摘要翻译

商品价格弹性用于量化消费者需求对商品价格变化的敏感程度，帮助企业制定定价策略并优化收益管理。实体零售、电子商务和消费品等行业依赖于从历史销售与定价数据中推导出的弹性信息。该弹性指标有助于理解不同商品间的购买行为、消费者对折扣的敏感度以及需求弹性较高的商品类别。这些信息对于竞争激烈的市场及资源有限的企业决策尤为宝贵，旨在实现利润与市场份额的最大化。价格弹性还能揭示消费者敏感度随时间推移的历史性变化。本文利用大规模交易数据集，对商品级价格弹性进行建模，提出一种新颖的弹性估计框架，该框架能够在缺乏实验对照组的情境下有效工作。我们通过以下基于机器学习的算法对该框架进行测试，包括我们新提出的Monodense深度神经网络：（1）Monodense-DL网络——结合嵌入层、全连接层与Monodense层的混合神经网络架构；（2）DML（Double Machine Learning）——采用回归模型的双重机器学习设定；（3）LGBM（Light Gradient Boosting Model）——轻量梯度提升模型。我们通过回溯测试框架，在涵盖数百万笔交易的多品类零售数据上评估模型性能。实验结果表明，在此框架下，我们提出的神经网络模型相较于上述其他主流机器学习方法具有优越性。

摘要 (Abstract)

Item Price Elasticity is used to quantify the responsiveness of consumer demand to changes in item prices, enabling businesses to create pricing strategies and optimize revenue management. Sectors such as store retail, e-commerce, and consumer goods rely on elasticity information derived from historical sales and pricing data. This elasticity provides an understanding of purchasing behavior across different items, consumer discount sensitivity, and demand elastic departments. This information is particularly valuable for competitive markets and resource-constrained businesses decision making which aims to maximize profitability and market share. Price elasticity also uncovers historical shifts in consumer responsiveness over time. In this paper, we model item-level price elasticity using large-scale transactional datasets, by proposing a novel elasticity estimation framework which has the capability to work in an absence of treatment control setting. We test this framework by using Machine learning based algorithms listed below, including our newly proposed Monodense deep neural network. (1) Monodense-DL network – Hybrid neural network architecture combining embedding, dense, and Monodense layers (2) DML – Double machine learning setting using regression models (3) LGBM – Light Gradient Boosting Model We evaluate our model on multi-category retail data spanning millions of transactions using a back testing framework. Experimental results demonstrate the superiority of our proposed neural network model within the framework compared to other prevalent ML based methods listed above.

关键词: price elasticity, Monodense deep neural network, transactional datasets, elasticity estimation framework, retail data, machine learning, back testing, revenue management

113. ❌ Covertly improving intelligibility with data-driven adaptations of speech timing

作者: Paige Tuttösí, Angelica Lim, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30032v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究语音可懂度与语速调整的关系，使用数据驱动方法开发文本转语音算法。虽然涉及机器学习（文本转语音），但论文核心是语音处理、听觉感知和可访问性技术，而非大模型或深度学习技术原理的创新。所有关键词均与大模型、深度学习技术、AI科学应用等直接相关，而本文未涉及这些领域，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究了针对性语速调整如何提高语音可懂度，发现特定时间结构的语速调整能显著改善非母语者和母语者在挑战性听觉条件下的单词理解，而全局减速反而增加理解错误，并开发了数据驱动的文本转语音算法来复制这种时间结构。

摘要翻译

说话者在面对存在语言理解困难的听者（如听力受损者或非母语成人）时，常采用整体放慢语速的策略。然而，这种策略是否真正提升了言语的可懂度尚不明确。本研究借助近期机器生成语音的技术进展，实现对语速更精确的控制，从而系统性地探究针对性的语速调整如何改善理解。我们首先通过反向相关实验表明，在目标元音对比（例如紧-松元音区分）之前，语速的时间性影响实际上呈现一种剪刀状模式，即在早期与晚期语境窗口中产生相反效应；该模式在个体内部以及不同母语背景的听者（包括母语为英语的L1听者，以及母语为法语、普通话、日语的L2英语听者）之间表现出显著的稳定性。其次，我们发现这种语速结构不仅有助于L2听者对目标元音对比的理解，母语听者在具有挑战性的声学条件下同样依赖此模式。最后，我们构建了一种数据驱动的文本转语音算法，该算法能在新的语音序列上复现这种时间结构。在各种句子和元音对比的测试中，听者并未察觉到这种针对性放慢策略提升了词语理解。引人注目的是，参与者反而认为常见的整体放慢策略更清晰，尽管该策略实际上增加了理解错误。总之，这些结果表明，在困难条件下，针对性的语速调整能显著提升可懂度，且常不被察觉。更广泛而言，本文提供了一种数据驱动的方法论，用以改善机器生成语音的可及性，该方法可扩展至言语理解的其他方面，并适用于多种听者与环境。

摘要 (Abstract)

Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners’ comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.

关键词: speech intelligibility, speech rate adjustment, data-driven TTS algorithm, L2 listeners, vowel contrast comprehension, reverse-correlation experiments, machine-generated speech, accessibility improvement

114. ❌ ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection

作者: Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30025v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ContextClaim范式，利用大语言模型（LLMs）生成上下文摘要以改进可验证声明检测，与’Large Language Models’高度相关（10分）。该方法涉及检索维基百科信息作为上下文，与’Retrieval-Augmented Generation’高度相关（10分）。研究关注事实核查和真实性评估，与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（8分）。实验包括微调、零样本和少样本设置，与’Post-training OR Supervised Fine-tuning OR SFT’和’In-context Learning OR Many-shot Learning’有一定关联（各5分）。其他关键词如MoE、量化、代理等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种上下文驱动的可验证声明检测范式ContextClaim，通过检索外部知识并使用大语言模型生成上下文摘要，实验表明该方法能提高检测性能，但其效果因领域、模型架构和学习设置而异。

摘要翻译

可验证声明检测旨在判断一个声明是否表达了原则上可依据外部证据进行核实的事实性陈述。作为自动化事实核查的早期过滤阶段，它在减轻下游验证模块负担方面发挥着重要作用。然而，现有的声明检测方法——无论是基于核查价值还是可验证性——都仅依赖于声明文本本身。这对于可验证声明检测而言是一个显著局限，因为判断声明是否可核查可能需要了解其所指涉的实体与事件，以及是否存在相关信息支持验证。受证据检索在后续声明验证阶段成熟作用的启发，我们提出上下文驱动的声明检测（ContextClaim），这是一种将检索环节前置至检测阶段的新范式。ContextClaim从输入声明中提取实体指称，从维基百科作为结构化知识源中检索相关信息，并利用大语言模型生成简洁的上下文摘要以供下游分类使用。我们在涵盖不同主题与文本类型的两个数据集上评估ContextClaim：CheckThat! 2022新冠肺炎推特数据集与PoliClaim政治辩论数据集，并在微调、零样本和少样本设置下测试了仅编码器与仅解码器模型。结果表明，上下文增强能够改进可验证声明检测，但其有效性在不同领域、模型架构和学习设置中存在差异。通过组件分析、人工评估和错误分析，我们进一步探究了检索到的上下文何时以及为何能帮助做出更可靠的可验证性判断。

摘要 (Abstract)

Verifiable claim detection asks whether a claim expresses a factual statement that can, in principle, be assessed against external evidence. As an early filtering stage in automated fact-checking, it plays an important role in reducing the burden on downstream verification components. However, existing approaches to claim detection, whether based on check-worthiness or verifiability, rely solely on the claim text itself. This is a notable limitation for verifiable claim detection in particular, where determining whether a claim is checkable may benefit from knowing what entities and events it refers to and whether relevant information exists to support verification. Inspired by the established role of evidence retrieval in later-stage claim verification, we propose Context-Driven Claim Detection (ContextClaim), a paradigm that advances retrieval to the detection stage. ContextClaim extracts entity mentions from the input claim, retrieves relevant information from Wikipedia as a structured knowledge source, and employs large language models to produce concise contextual summaries for downstream classification. We evaluate ContextClaim on two datasets covering different topics and text genres, the CheckThat! 2022 COVID-19 Twitter dataset and the PoliClaim political debate dataset, across encoder-only and decoder-only models under fine-tuning, zero-shot, and few-shot settings. Results show that context augmentation can improve verifiable claim detection, although its effectiveness varies across domains, model architectures, and learning settings. Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.

关键词: verifiable claim detection, context-driven, large language models, retrieval-augmented, fact-checking, context augmentation, few-shot learning, entity extraction

115. ❌ Tracking Equivalent Mechanistic Interpretations Across Neural Networks

作者: Alan Sun, Mariya Toneva 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于神经网络的可解释性研究，特别是机制可解释性（Mechanistic Interpretability）框架，这与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文提到在Transformer模型上进行案例研究，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。其他关键词主要涉及大模型的具体技术、应用或优化方法，而本文聚焦于解释性理论框架和算法，与这些关键词无直接关联（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了神经网络中机制可解释性的等价性问题，提出了解释等价性的定义和算法，并在Transformer模型上进行了验证，为可解释性评估和自动化解释发现提供了理论基础。

摘要翻译

机制可解释性（MI）是一种新兴的神经网络解释框架。给定一个任务和模型，MI旨在发现一种简洁的算法过程——即一种解释——以阐明模型在该任务上的决策过程。然而，MI难以扩展和泛化。这部分源于两个关键挑战：缺乏对有效解释的准确定义；且生成解释的过程通常是临时性的。本文通过定义并研究解释等价性问题来应对这些挑战：即在不要求明确描述解释内容的情况下，判断两个不同模型是否共享一个共同的解释。我们方法的核心在于提出并形式化了一个原则：若一个模型的两种解释的所有可能实现方式均等价，则这两种解释等价。我们开发了一种算法来估计解释等价性，并以基于Transformer的模型为例进行了案例研究。为分析该算法，我们基于模型表征相似性提出了解释等价性的必要与充分条件。我们提供的理论保证同时关联了模型的算法解释、电路结构与表征。本框架为开发更严谨的MI评估方法以及自动化、可泛化的解释发现方法奠定了基础。

摘要 (Abstract)

Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model’s decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models’ representation similarity. We provide guarantees that simultaneously relate a model’s algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.

关键词: Mechanistic Interpretability, Interpretive Equivalence, Neural Networks, Transformer Models, Algorithmic Interpretations, Representation Similarity, Interpretation Discovery

116. ❌ Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior

作者: Junwei Yu, Mufeng Yang, Yepeng Ding, Hiroyuki Sato 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29979v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Generative Engine Optimization (GEO)，专注于通过内容结构优化来提高LLM驱动的搜索引擎中的引用率，属于大模型在不同领域的研究应用。论文明确提到"LLM-powered information ecosystems"，与"Large Language Models"高度相关（8分）。其他关键词主要涉及大模型的技术原理（如MoE、训练方法、推理优化等）或特定应用领域（如科学AI），而本文专注于内容结构对引用行为的影响，属于应用层面的优化，与这些技术原理关键词无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在AI驱动的搜索引擎中，内容结构如何影响引用行为，并提出了一种结构特征工程框架GEO-SFE，实验表明该框架能显著提高引用率和内容质量。

摘要翻译

人工智能搜索引擎的普及已将信息发现从传统的基于链接的检索转变为选择性引用来源的直接答案生成，这为内容可见性带来了新的挑战。尽管现有的生成式引擎优化（Generative Engine Optimization，简称GEO）方法主要关注语义内容修改，但结构特征在影响引用行为方面的作用仍未得到充分探索。
本文提出GEO-SFE，一种用于生成式引擎优化的结构化特征工程系统框架。我们的方法将内容结构分解为三个层次：宏观结构（文档架构）、中观结构（信息分块）和微观结构（视觉强调），并建模它们在不同生成式引擎架构中对引用概率的影响。我们开发了架构感知的优化策略和预测模型，这些策略和模型在保持语义完整性的同时提高了结构有效性。
在六种主流生成式引擎上的实验评估表明，引用率（提升17.3%）和主观质量（提升18.5%）均获得持续改善，验证了所提框架的有效性和普适性。本研究确立了结构优化作为GEO的基础组成部分，为在大型语言模型驱动的信息生态系统中提升内容可见性提供了一种数据驱动的方法论。

摘要 (Abstract)

The proliferation of AI-powered search engines has shifted information discovery from traditional link-based retrieval to direct answer generation with selective source citation, creating new challenges for content visibility. While existing Generative Engine Optimization (GEO) approaches focus primarily on semantic content modification, the role of structural features in influencing citation behavior remains underexplored. In this paper, we propose GEO-SFE, a systematic framework for structural feature engineering in generative engine optimization. Our approach decomposes content structure into three hierarchical levels: macro-structure (document architecture), meso-structure (information chunking), and micro-structure (visual emphasis), and models their impact on citation probability across different generative engine architectures. We develop architecture-aware optimization strategies and predictive models that preserve semantic integrity while improving structural effectiveness. Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed framework. This work establishes structural optimization as a foundational component of GEO, providing a data-driven methodology for enhancing content visibility in LLM-powered information ecosystems.

关键词: Generative Engine Optimization, Structural Feature Engineering, Citation Behavior, LLM-powered Search, Content Visibility, Document Architecture, Information Chunking, Visual Emphasis

117. ❌ Rewrite the News: Tracing Editorial Reuse Across News Agencies

作者: Soveatin Kuntur, Nina Smirnova, Anna Wroblewska, Philipp Mayr, Sebastijan Razboršek Maček 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29937v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究新闻机构间的句子级文本重用检测，属于自然语言处理在新闻学中的应用，但完全不涉及大模型、深度学习技术原理或AI for Science等关键词。论文使用弱监督方法进行跨语言文本匹配，属于传统NLP任务，与评分关键词中的大模型技术、训练方法、推理优化、AI代理、科学AI应用等均无关联。

!!! tip deepseek-chat TL;DR

该论文研究多语言新闻机构间的句子级文本重用模式，开发了一种无需完整翻译的弱监督跨语言重用检测方法，发现52%的斯洛文尼亚通讯社文章存在非字面重用，且重用内容多出现在文章中部和结尾。

摘要翻译

本文研究多语言新闻报道中的句子级文本复用现象，分析复用内容在文章中出现的位置。我们提出一种弱监督方法，用于检测句子级跨语言复用而无需完整译文，旨在通过自动化预筛选减轻新闻工作者的信息过载负担（Holyst等人，2024）。本研究以斯洛文尼亚通讯社（STA）的英文报道为基准，与来自15家外媒机构（FA）的七种语言报道进行对比，利用发布时间戳为每个复用句子追溯最早的潜在外媒来源。通过分析两个时间段（2023年10月7日-11月2日；2025年2月1日-28日）的1,037篇STA文章与237,551篇FA文章，经最早来源筛选后共识别出1,087组对齐句子对。复用现象出现在52%的STA文章与1.6%的FA文章中，且主要表现为非字面复用，包括释义式改写和多元材料的组合式复用。复用内容多集中于英文文章的中间及结尾部分，而导语部分多为原创，这表明简单的词汇匹配会遗漏大量编辑性复用行为。相较于以往专注于单语重叠的研究，本工作（i）实现了无需完整翻译的跨语言复用检测，（ii）利用发布时间识别潜在来源，（iii）系统分析复用材料在文章中的分布位置。数据集与代码详见：https://github.com/kunturs/lrec2026-rewrite-news。

摘要 (Abstract)

This paper investigates sentence-level text reuse in multilingual journalism, analyzing where reused content occurs within articles. We present a weakly supervised method for detecting sentence-level cross-lingual reuse without requiring full translations, designed to support automated pre-selection to reduce information overload for journalists (Holyst et al., 2024). The study compares English-language articles from the Slovenian Press Agency (STA) with reports from 15 foreign agencies (FA) in seven languages, using publication timestamps to retain the earliest likely foreign source for each reused sentence. We analyze 1,037 STA and 237,551 FA articles from two time windows (October 7-November 2, 2023; February 1-28, 2025) and identify 1,087 aligned sentence pairs after filtering to the earliest sources. Reuse occurs in 52% of STA articles and 1.6% of FA articles and is predominantly non-literal, involving paraphrase and compositional reuse from multiple sources. Reused content tends to appear in the middle and end of English articles, while leads are more often original, indicating that simple lexical matching overlooks substantial editorial reuse. Compared with prior work focused on monolingual overlap, we (i) detect reuse across languages without requiring full translation, (ii) use publication timing to identify likely sources, and (iii) analyze where reused material is situated within articles. Dataset and code: https://github.com/kunturs/lrec2026-rewrite-news.

关键词: text reuse, multilingual journalism, sentence-level detection, cross-lingual analysis, weakly supervised method, editorial reuse, news agencies, paraphrase detection

118. ❌ Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

作者: Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon, Md Rakibul Hasan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29901v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学影像（放射学）报告的多模态自动摘要，提出了一种视觉-文本注意力模型（ViTAS），通过选择性视觉注意力机制提升性能。论文的核心是计算机视觉（ViT、MedSAM2、视觉tokenization）与自然语言处理（文本摘要）在医学领域的应用，属于AI for Science（具体是医学影像AI）范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分8分）。然而，论文并未涉及大语言模型（LLMs）、MoE、模型训练技术（预训练、微调、对齐等）、推理优化、智能体、模型压缩等关键词所描述的核心大模型技术原理或通用方法，这些关键词均与论文内容无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了多模态放射学报告自动摘要中视觉噪声的问题，提出了一种选择性关注病理相关视觉区域的ViTAS模型，在MIMIC-CXR基准上实现了最先进的性能，证明了更少但更相关的视觉输入优于全图像输入。

摘要翻译

自动化放射学报告摘要生成旨在将冗长的检查发现提炼为简洁的临床印象，但现有的多模态模型常受视觉噪声困扰，且在“发现”至“印象”的转换中，相较于强大的纯文本基线模型，未能实现有意义的提升。我们挑战了两个普遍假设：(1) 更多的视觉输入总是更好；(2) 当“发现”部分已包含丰富的图像衍生细节时，多模态模型带来的价值有限。通过在MIMIC-CXR基准上进行受控消融实验，我们发现，选择性地聚焦于与病理相关的视觉图像块，而非使用完整图像，能显著提升性能。我们提出了ViTAS（视觉-文本注意力摘要生成器），这是一个多阶段流程，它结合了集成引导的MedSAM2肺部分割、用于多视图融合的双向交叉注意力、Shapley值引导的自适应图像块聚类，以及为视觉变换器（ViT）提供输入的分层视觉标记化方法。ViTAS取得了最先进的结果，BLEU-4分数为29.25%，ROUGE-L分数为69.83%，定性分析显示其事实一致性得到改善，并获得了专家评定的人类评估最高分。我们的研究结果表明，更少但更相关的视觉输入对于多模态放射学摘要生成不仅足够，而且效果更优。

摘要 (Abstract)

Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.

关键词: multimodal radiology summarization, visual-text attention, selective visual attention, ViTAS, MIMIC-CXR, pathology-relevant patches, automated report generation, medical image analysis

119. ❌ FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

作者: Daban Q. Jaff, Mohammad Mohammadamini 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29892v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要关注语音数据集扩展和语音识别/翻译任务，使用Whisper模型进行微调。与大多数关键词无关，仅与"Post-training OR Supervised Fine-tuning OR SFT"有一定关联（5分），因为论文提到对Whisper模型进行微调。其他关键词涉及大模型技术原理、推理、对齐、压缩等，均未在论文中涉及。

!!! tip deepseek-chat TL;DR

该研究扩展了FLEURS数据集以包含北库尔德语，创建了FLEURS-Kobani基准数据集，并通过微调Whisper模型为自动语音识别和端到端语音到文本翻译任务提供了基线性能。

摘要翻译

FLEURS数据集为100多种语言提供了n路平行语音，但北库尔德语并不包含在内，这限制了该语言在自动语音识别和语音翻译任务中的基准测试。我们提出FLEURS-Kobani，这是FLEURS基准的北库尔德语（ISO 639-3代码KMR）口语扩展。FLEURS-Kobani数据集包含5,162条经过验证的语句，总时长为18小时24分钟。数据由31名母语者录制完成。它将基准测试覆盖范围扩展到了资源匮乏的库尔德语变体。作为基线，我们针对自动语音识别和端到端语音到文本翻译任务对Whisper v3-large模型进行了微调。采用两阶段微调策略（从Common Voice到FLEURS-Kobani）取得了最佳的自动语音识别性能（在测试集上词错误率WER为28.11，字错误率CER为9.84）。对于端到端语音到文本翻译（KMR到EN），Whisper在测试集上取得了8.68的BLEU分数；我们还额外报告了通过枢轴翻译生成的目标结果以及级联式语音到文本翻译设置。FLEURS-Kobani为评估自动语音识别、语音到文本翻译及语音到语音翻译任务提供了首个公开的北库尔德语基准。该数据集依据CC BY 4.0许可协议公开发布，供研究使用。

摘要 (Abstract)

FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language. We present FLEURS-Kobani, a Northern Kurdish (ISO 639-3 KMR) spoken extension of the FLEURS benchmark. The FLEURS-Kobani dataset consists of 5,162 validated utterances, totaling 18 hours and 24 minutes. The data were recorded by 31 native speakers. It extends benchmark coverage to an under-resourced Kurdish variety. As baselines, we fine-tuned Whisper v3-large for ASR and E2E S2TT. A two-stage fine-tuning strategy (Common Voice to FLEURS-Kobani) yields the best ASR performance (WER 28.11, CER 9.84 on test). For E2E S2TT (KMR to EN), Whisper achieves 8.68 BLEU on test; we additionally report pivot-derived targets and a cascaded S2TT setup. FLEURS-Kobani provides the first public Northern Kurdish benchmark for evaluation of ASR, S2TT and S2ST tasks. The dataset is publicly released for research use under a CC BY 4.0 license.

关键词: Northern Kurdish, FLEURS dataset, speech recognition, speech translation, Whisper model, fine-tuning, benchmark dataset, under-resourced language

120. ❌ SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

作者: Adar Avsian, Larry Heck 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在多智能体环境中的战略通信能力，直接涉及’Large Language Models’和’LLM Agents’、‘Multi-agent Systems’关键词，其他关键词如模型架构、训练方法、推理优化等均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在多智能体环境中如何平衡信息传递与保密性的战略通信能力，并提出了SNEAK基准测试，发现当前模型在此任务上表现远不如人类。

摘要翻译

大型语言模型（LLM）正越来越多地被部署在多智能体环境中，此类环境中的通信必须在信息性与保密性之间取得平衡。在这些场景中，智能体可能需要向协作者传递信息，同时防止对手推断出敏感细节。然而，现有的LLM基准测试主要评估推理、事实知识或指令遵循等能力，并未直接衡量非对称信息下的策略性通信。我们提出了SNEAK（面向对抗性知识的秘密感知自然语言评估），这是一个用于评估语言模型中选择性信息共享能力的基准。在SNEAK中，模型被赋予一个语义类别、一个候选词集以及一个秘密词，其必须生成一条能表明知晓该秘密但又不至于过于清晰泄露它的消息。我们通过两个具有不同信息状态的模拟智能体来评估生成的消息：一个是知晓秘密并必须识别出预期消息的盟友，另一个是不知晓秘密并试图从消息中推断秘密的伪装者。这产生了两个互补的指标：效用（衡量消息向协作者传达信息的效果）和泄露（衡量消息向对手暴露信息的程度）。利用此框架，我们分析了现代语言模型在信息性与保密性之间的权衡，并表明在非对称信息下进行策略性通信对当前系统而言仍是一项具有挑战性的能力。值得注意的是，人类参与者的表现远超所有被评估的模型，其得分最高可达模型的四倍。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed in multi-agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction following, and do not directly measure strategic communication under asymmetric information. We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models. In SNEAK, a model is given a semantic category, a candidate set of words, and a secret word, and must generate a message that indicates knowledge of the secret without revealing it too clearly. We evaluate generated messages using two simulated agents with different information states: an ally, who knows the secret and must identify the intended message, and a chameleon, who does not know the secret and attempts to infer it from the message. This yields two complementary metrics: utility, measuring how well the message communicates to collaborators, and leakage, measuring how much information it reveals to an adversary. Using this framework, we analyze the trade-off between informativeness and secrecy in modern language models and show that strategic communication under asymmetric information remains a challenging capability for current systems. Notably, human participants outperform all evaluated models by a large margin, achieving up to four times higher scores.

关键词: Large Language Models, Multi-agent Systems, Strategic Communication, Information Leakage, Benchmark Evaluation, Asymmetric Information, SNEAK, Adversarial Knowledge

121. ❌ ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian

作者: Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Laura Melosi, Mehwish Alam 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29801v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要介绍了一个用于历史意大利语命名实体识别和链接的银标准数据集ENEIDE，包括其构建方法、内容和基线实验。论文内容聚焦于自然语言处理中的特定任务（NERL）和数据集创建，而非大模型或深度学习技术原理的创新。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联，因为论文涉及AI在人文科学（历史文本分析）中的应用，但并非核心内容，因此给5分。其他关键词均与论文主题无关，给0分。

!!! tip deepseek-chat TL;DR

论文提出了ENEIDE，一个用于历史意大利语命名实体识别和链接的多领域银标准数据集，并通过基线实验展示了其挑战性以及零样本方法与微调模型之间的性能差距。

摘要翻译

本文介绍了ENEIDE（意大利数字版本命名实体抽取），这是一个用于历史意大利文本命名实体识别与链接的银标准数据集。该语料库包含2,111份文档，涵盖超过8,000个通过半自动方式从两部学术数字版本中提取的实体标注：其一是意大利诗人贾科莫·莱奥帕尔迪（1798–1837）的哲学日记《数字版札记》，其二是意大利政治家阿尔多·莫罗（1916–1978）的全集《阿尔多·莫罗数字档案》。标注涵盖多种实体类型（人物、地点、组织、文学作品），并与维基数据标识符关联，其中包含无法映射到知识图谱的NIL实体。据我们所知，ENEIDE是首个面向历史意大利语、具备训练集、开发集和测试集划分的多领域公开NERL数据集。我们提出了一种从人工校勘的学术数字版本中半自动提取标注的方法，包括质量控制与标注增强流程。基于前沿模型的基线实验表明，该数据集对NERL任务构成挑战，并揭示了零样本方法与微调模型之间的性能差距。数据集跨越两个世纪的历时覆盖范围，使其特别适用于时序实体消歧和跨领域评估。ENEIDE以CC BY-NC-SA 4.0许可协议发布。

摘要 (Abstract)

This paper introduces ENEIDE (Extracting Named Entities from Italian Digital Editions), a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts. The corpus comprises 2,111 documents with over 8,000 entity annotations semi-automatically extracted from two scholarly digital editions: Digital Zibaldone, the philosophical diary of the Italian poet Giacomo Leopardi (1798–1837), and Aldo Moro Digitale, the complete works of the Italian politician Aldo Moro (1916–1978). Annotations cover multiple entity types (person, location, organization, literary work) linked to Wikidata identifiers, including NIL entities that cannot be mapped to the knowledge graph. To the best of our knowledge, ENEIDE represents the first multi-domain, publicly available NERL dataset for historical Italian with training, development, and test splits. We present a methodology for semi-automatic annotations extraction from manually curated scholarly digital editions, including quality control and annotation enhancement procedures. Baseline experiments using state-of-the-art models demonstrate the dataset’s challenge for NERL and the gap between zero-shot approaches and fine-tuned models. The dataset’s diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation. ENEIDE is released under a CC BY-NC-SA 4.0 license.

关键词: Named Entity Recognition, Named Entity Linking, Historical Italian, Silver Standard Dataset, Digital Editions, Wikidata, Temporal Entity Disambiguation, Cross-domain Evaluation

122. ❌ Training-Free Dynamic Upcycling of Expert Language Models

作者: Eros Fanì, Oğuzhan Ersoy 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29765v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs和MoE架构，提出DUME方法动态组合预训练专家模型，无需额外训练，高度相关LLMs、MoE和Model Merging关键词；涉及专家模型复用和微调，与Pre-training、Post-training、PEFT有一定关联；其他关键词如SLMs、Scaling Laws、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的MoE动态升级方法DUME，通过复用不同领域的预训练专家模型构建统一的多任务模型，在保持原始专家性能的同时实现了成本效益和可扩展性。

摘要翻译

大型语言模型（LLM）在广泛的专业任务上取得了显著性能，展现出强大的问题解决能力。然而，训练这些模型的成本极其高昂，且由于依赖通用知识数据集，它们通常缺乏领域专业知识。专家微调可以解决这一问题，但往往导致过度专业化，且由于目标差异，开发单一的多领域专家模型仍然困难。此外，多任务训练因任务干扰和灾难性遗忘而具有挑战性。现有研究提出在混合专家（Mixture of Experts, MoE）架构中整合稠密模型的专业能力，但该方法仍需要进行多任务微调。为解决这些问题，我们提出了动态升级循环混合专家（Dynamic Upcycling MoE, DUME），这是一种创新方法，通过复用在不同领域上训练的稠密专家模型来构建统一的MoE模型。我们的方法构建了一个单一的多任务模型，该模型保留了原始稠密专家的能力，且无需额外训练。DUME兼具成本效益与可扩展性：通过利用岭回归的闭式解，它避免了进一步的优化需求，并支持动态添加专家，同时保持模型的原始性能。我们证明，在因果语言建模和推理任务中，DUME始终优于基线方法。最后，我们还展示了DUME模型可通过微调进一步提升性能。在因果语言建模任务中，DUME能够保留专注于特定领域的稠密专家模型高达97.6%的性能；在推理任务中，它甚至能够超越该专家模型，达到其性能的102.1%。我们的代码公开于：github.com/gensyn-ai/dume。

摘要 (Abstract)

Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model. Our method builds a single multitask model that preserves the capabilities of the original dense experts without requiring additional training. DUME is both cost-efficient and scalable: by leveraging the closed-form solution of ridge regression, it eliminates the need for further optimization and enables experts to be added dynamically while maintaining the model’s original performance. We demonstrate that DUME consistently outperforms baseline approaches in both causal language modeling and reasoning settings. Finally, we also show that the DUME model can be fine-tuned to further improve performance. We show that, in the causal language modeling setting, DUME can retain up to 97.6% of a dense expert model specialized in one particular domain, and that it can also surpass it in the reasoning setting, where it can achieve 102.1% of the dense expert performance. Our code is available at: github.com/gensyn-ai/dume.

关键词: Large Language Models, Mixture of Experts, Dynamic Upcycling, Training-Free, Domain Expertise, Multitask Model, Ridge Regression, Model Reuse

123. ❌ Near-Miss: Latent Policy Failure Detection in Agentic Workflows

作者: Ella Rabinovich, David Boaz, Naama Zwerdling, Ateret Anaby-Tavor 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29665v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agentic workflows中的潜在策略失败检测，与"LLM Agents/Agentic Workflow"和"Tool Use/Function Calling"高度相关（10分），因为论文专门研究代理工作流中的工具调用决策和政策遵守。与"Large Language Models"高度相关（10分），因为研究基于LLM的代理系统。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种检测LLM-based agentic workflows中潜在策略失败的新方法，发现即使最终结果正确，8-17%的轨迹仍存在代理绕过政策检查的潜在失败。

摘要翻译

面向业务流程自动化的智能体系统通常需要遵循对系统状态进行条件更新的管控策略。基于大语言模型的智能体工作流中，策略遵循性的评估通常通过将最终系统状态与预定义基准真值进行比对来完成。尽管这种方法能检测显性的策略违规，却可能忽略一类更为隐蔽的问题：智能体绕过必要的策略检查，却因有利条件仍达成正确结果。我们将此类情况称为“侥幸规避”或“潜在失效”。本研究提出一种用于检测智能体对话轨迹中潜在策略失效的新颖度量方法。基于将自然语言策略转换为可执行防护代码的ToolGuard框架，我们的方法通过分析智能体行为轨迹，判断其工具调用决策是否具备充分依据。
我们在经过τ²验证的航空公司基准测试上，对多个作为智能体的当代开源及专有大语言模型进行了方法评估。结果表明，即使在最终结果符合预期基准真值状态的情况下，涉及状态变更工具调用的轨迹中仍有8-17%存在潜在失效。这些发现揭示了当前评估方法存在的盲区，并强调需要建立不仅能评估最终结果、同时能衡量其决策过程的度量标准。

摘要 (Abstract)

Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as $\textit{near-misses}$ or $\textit{latent failures}$. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent’s tool-calling decisions where sufficiently informed. We evaluate our approach on the $τ^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.

关键词: agentic workflows, LLM-based agents, policy adherence, latent failures, tool-calling decisions, evaluation methodology, agent trajectories, near-misses

124. ❌ Learning Diagnostic Reasoning for Decision Support in Toxicology

作者: Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer, Matthias Keicher 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29608v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是应用LLMs于毒理学临床决策支持，属于大模型在科学（医学）领域的应用创新。高度相关关键词：1) ‘Large Language Models’ (论文明确使用LLMs并微调，权重1.0，评分10.0)；2) ‘AI for Science’ (毒理学属于生物医学科学，权重1.0，评分10.0)。部分相关关键词：‘Hallucination Mitigation’ (论文提到惩罚模型幻觉，但非核心方法，权重1.0，评分5.0)。其他关键词如MoE、SFT、RAG等未涉及，评0分。

!!! tip deepseek-chat TL;DR

该研究针对急性多物质中毒诊断的不确定性，提出了DeToxR系统，通过强化学习对齐的LLM融合非结构化临床叙事和结构化医疗数据，在毒物识别上显著优于基础LLM和专家毒理学家。

摘要翻译

急性多物质中毒救治需要在高度不确定条件下做出快速、挽救生命的决策，临床医生往往只能依赖不完整的摄入史描述和非特异性症状。在这种混乱的临床环境中实现有效的诊断推理，需要将非结构化、非医学性的叙述（如急救人员的现场描述、患者不可靠的自述或已知病史）与生命体征等结构化医学数据相融合。尽管大语言模型在处理此类异构输入方面展现出潜力，但在此场景中表现欠佳，其性能甚至常低于仅依赖患者病史的简单基线方法。为此，我们提出了DeToxR（基于推理的毒理学决策支持系统），这是强化学习在急诊毒理学领域的首次适应性应用。我们基于经过群组相对策略优化微调的大语言模型，设计了一个稳健的数据融合引擎，用于对14类物质进行多标签预测。我们直接使用临床性能奖励来优化模型的推理过程：通过将多标签一致性指标构建为奖励信号，模型会因遗漏共同摄入的物质或虚构未摄入的毒物而受到明确惩罚。我们的模型显著优于未经调整的基础大语言模型及有监督基线方法。此外，在一项临床验证研究中，该模型在识别正确毒物方面显示出临床优势，其表现超过毒理学专家（Micro-F1值：0.644对比0.473）。这些结果证明了经强化学习对齐的大语言模型在整合非结构化院前叙述与结构化医学数据方面的潜力，能够为高风险环境下的决策提供支持。

摘要 (Abstract)

Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model’s reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.

关键词: Large Language Models, Reinforcement Learning, Toxicology, Clinical Decision Support, Data Fusion, Multi-label Prediction, Hallucination Mitigation, AI for Science

125. ❌ When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

作者: Robinson Ferrer, Damla Turgut, Zhongzhou Chen, Shashank Sonkar 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29559v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM在自动评分中的可靠性问题，直接涉及LLM技术应用，因此’Large Language Models OR LLMs OR Foundation Models’得10分（核心内容）。论文使用教育数据集（包括RiceChem化学数据集），属于AI在科学/教育领域的应用，因此’AI for Science OR Bioinformatics OR Cheminformatics’得5分（有一定关联）。其他关键词如MoE、SFT、RAG、推理方法、模型压缩等均未在摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何预测LLM自动评分结果的可靠性，通过比较三种置信度估计方法发现，让LLM自我报告置信度是最实用且校准效果最好的方法，能够有效识别可靠的评分预测。

摘要翻译

大型语言模型（LLM）在自动评分方面展现出潜力，但其输出可能不可靠。我们并非直接提升评分准确性，而是解决一个互补性问题：预测LLM评分何时可能正确。这使得选择性自动化成为可能——高置信度的预测可自动处理，而不确定的案例则标记为人工复核。我们在三个教育数据集（RiceChem（长答案化学）、SciEntsBank和Beetle（短答案科学））上，针对七种不同规模（4B至120B参数）的LLM，比较了三种置信度估计方法（自报告置信度、自一致性投票和词元概率）。实验表明，在所有条件下，自报告置信度始终实现最佳校准（平均预期校准误差ECE为0.166，而自一致性方法为0.229）。尽管自一致性方法需要5倍的推理成本，其校准表现仍差38%，这一结果令人意外。更大规模的模型展现出显著更好的校准效果，但提升程度因数据集和方法而异（例如自报告方法使ECE降低28%），其中GPT-OSS-120B实现了最佳校准（平均ECE 0.100）和较强的区分能力（平均AUC 0.668）。我们还发现，所有方法的置信度均呈现强烈的高值偏态分布，形成了“置信度下限”，实践者在设定阈值时必须考虑这一现象。这些发现表明，直接要求LLM报告其置信度，为识别可靠的评分预测提供了一种实用方法。代码发布于\href{https://github.com/sonkar-lab/llm_grading_calibration}{此处}。

摘要 (Abstract)

Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor’’ that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{https://github.com/sonkar-lab/llm_grading_calibration}{here}.

关键词: Large Language Models, automated grading, confidence estimation, calibration, self-reported confidence, educational datasets, reliability prediction, selective automation

126. ❌ Can LLM Agents Identify Spoken Dialects like a Linguist?

作者: Tobias Bystrich, Lukas Hamm, Maria Hassan, Lea Fischbach, Lucie Flek, Akbar Karimi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）作为智能体（LLM Agents）在方言识别任务中的应用，与’Large Language Models’和’LLM Agents’高度相关（10分）。研究涉及语言学应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究探索了LLM智能体在瑞士德语方言分类任务中的能力，发现当提供语言学信息时，LLM预测性能会提升，且自动生成的转录对分类有益。

摘要翻译

由于标注方言语音数据的稀缺性，音频方言分类对包括瑞士德语在内的大多数语言而言都是一项具有挑战性的任务。本研究探讨了大型语言模型（LLMs）作为智能体在理解方言方面的能力，并检验其是否能在方言分类任务中达到与HuBERT等模型相当的性能。此外，我们提供了大型语言模型的基准结果和人类语言学家的基准结果。我们的方法使用自动语音识别（ASR）系统生成的音标转写，并结合方言特征地图、元音演变历史和规则等语言学资源。研究结果表明，当提供语言学信息时，大型语言模型的预测性能有所提升。人类基准测试表明，自动生成的转写对此类分类任务具有助益，同时也揭示了改进的空间。

摘要 (Abstract)

Due to the scarcity of labeled dialectal speech, audio dialect classification is a challenging task for most languages, including Swiss German. In this work, we explore the ability of large language models (LLMs) as agents in understanding the dialects and whether they can show comparable performance to models such as HuBERT in dialect classification. In addition, we provide an LLM baseline and a human linguist one. Our approach uses phonetic transcriptions produced by ASR systems and combines them with linguistic resources such as dialect feature maps, vowel history, and rules. Our findings indicate that, when linguistic information is provided, the LLM predictions improve. The human baseline shows that automatically generated transcriptions can be beneficial for such classifications, but also presents opportunities for improvement.

关键词: LLM agents, dialect classification, phonetic transcriptions, linguistic resources, Swiss German, audio dialect, human baseline, ASR systems

127. ❌ LLM Probe: Evaluating LLMs for Low-Resource Languages

作者: Hailay Kidu Teklehaymanot, Gebrearegawi Gebremariam, Wolfgang Nejdl 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29517v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心是评估大语言模型（LLMs）在低资源语言环境下的语言能力，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文主要关注评估框架、基准数据集创建和模型性能分析，不涉及其他关键词所描述的具体模型架构（如MoE、SLMs）、训练技术（如预训练、微调、对齐、PEFT）、推理优化（如RAG、注意力机制、加速）、高级能力（如推理、代理、工具使用）、模型优化（如量化、合并）或特定科学领域应用，因此这些关键词均评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为LLM Probe的评估框架，用于系统评估大语言模型在低资源语言中的语言理解能力，并通过一个低资源闪米特语案例研究发现，序列到序列模型在形态句法分析和翻译上表现更好，而因果模型在词汇对齐上更强但翻译准确性较弱。

摘要翻译

尽管大语言模型（LLM）发展迅速，但由于标注资源有限且缺乏标准化评估框架，其在低资源及形态丰富语言中的语言学能力仍未得到充分理解。本文提出LLM Probe，一个基于词典的评估框架，旨在系统性地评估大语言模型在低资源语言环境中的语言学技能。该框架从四个语言理解维度分析模型表现：词汇对齐、词性识别、形态句法探测以及翻译准确性。为展示该框架，我们以一门低资源闪米特语作为案例研究，创建了一个人工标注的基准数据集。该数据集包含带有语言学标注的双语词典，涵盖词性标签、语法性别及形态句法特征，其标注者间一致性较高，确保了标注的可靠性。我们测试了多种模型，包括因果语言模型和序列到序列架构。结果显示不同语言学任务间存在显著性能差异：序列到序列模型通常在形态句法分析和翻译质量上表现优异，而因果模型在词汇对齐方面表现强劲但翻译准确性较弱。我们的研究结果强调，需要基于语言学的评估以更好地理解大语言模型在低资源环境中的局限性。我们将LLM Probe及配套基准数据集作为开源工具发布，以促进可复现的基准测试，并支持开发更具包容性的多语言技术。

摘要 (Abstract)

Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset comprises bilingual lexicons with linguistic annotations, including part-of-speech tags, grammatical gender, and morphosyntactic features, which demonstrate high inter-annotator agreement to ensure reliable annotations. We test a variety of models, including causal language models and sequence-to-sequence architectures. The results reveal notable differences in performance across various linguistic tasks: sequence-to-sequence models generally excel in morphosyntactic analysis and translation quality, whereas causal models demonstrate strong performance in lexical alignment but exhibit weaker translation accuracy. Our results emphasize the need for linguistically grounded evaluation to better understand LLM limitations in low-resource settings. We release LLM Probe and the accompanying benchmark dataset as open-source tools to promote reproducible benchmarking and to support the development of more inclusive multilingual language technologies.

关键词: large language models, low-resource languages, evaluation framework, linguistic assessment, benchmark dataset, morphologically rich languages, lexicon-based probing, multilingual NLP

128. ❌ Calibrated Confidence Expression for Radiology Report Generation

作者: David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann, Benedikt Wiestler, Rickmer Braren, Nassir Navab, Matthias Keicher 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29492v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）在放射学报告生成中的置信度校准问题，属于大模型在医疗领域的应用创新。核心相关关键词包括：1）‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（10分）- 论文使用GRPO算法（Group Relative Policy Optimization，一种强化学习算法）进行微调，属于RLHF相关技术；2）‘Hallucination Mitigation OR Factuality OR Truthfulness’（10分）- 研究目标是减少幻觉风险，确保临床决策的可靠性；3）‘AI for Science OR Bioinformatics OR Cheminformatics’（10分）- 直接应用于放射学（医学影像分析），属于生物信息学/科学AI领域；4）‘Large Language Models OR LLMs OR Foundation Models’（8分）- 涉及大型视觉语言模型（LVLMs），是大模型的一种；5）‘Self-Correction OR Self-Improvement OR Self-Reflection’（5分）- 模型通过置信度表达实现自我评估，与自我校正概念相关；6）‘Mechanistic Interpretability OR Explainable AI’（5分）- 置信度表达旨在提高模型输出的可解释性。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对放射学报告生成中大型视觉语言模型过度自信的问题，提出了ConRad强化学习框架来校准模型的置信度表达，实验表明该方法显著改善了校准性能并支持更安全的临床AI集成。

摘要翻译

在放射学报告生成中安全部署大型视觉语言模型（LVLMs）不仅需要准确的预测，还需提供临床可解释的指标以指示何时应对输出结果进行彻底审查，从而实现选择性放射科医师验证，并降低幻觉性发现影响临床决策的风险。一种直观的方法是采用言语化置信度，即模型明确声明其确定性。然而，当前最先进的语言模型常表现出过度自信，且在放射学报告生成等多模态场景中的校准研究仍有限。为填补这一空白，我们提出了ConRad（放射学报告置信度校准框架），这是一个基于强化学习的框架，用于微调医学LVLMs，使其在生成放射学报告的同时输出经过校准的言语化置信度估计。我们研究了两种设置：单一报告级置信度分数，以及为每个诊断主张分配置信度的句子级变体。两者均使用GRPO算法进行训练，其奖励函数基于对数评分规则，通过对错误校准进行惩罚来激励真实的自我评估，并保证在奖励最大化条件下实现最优校准。实验表明，ConRad显著改善了校准效果，并优于现有方法。在一项临床评估中，我们证明ConRad的报告级分数与临床医生的判断高度一致。通过标记完整报告或低置信度陈述以进行针对性审查，ConRad能够为AI辅助报告生成的安全临床整合提供支持。

摘要 (Abstract)

Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad’s report level scores are well aligned with clinicians’ judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.

关键词: Large Vision-Language Models, Radiology Report Generation, Confidence Calibration, Reinforcement Learning, GRPO Algorithm, Hallucination Mitigation, Clinical Evaluation, Verbalized Confidence

129. ❌ Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods

作者: Baoyi Zeng, Andrea Nini 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29454v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在作者模仿生成文本方面的应用，并评估其对抗作者验证系统的能力，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术细节、方法或应用，如MoE、SLMs、训练技术、推理优化、代理系统、模型压缩等，故其他关键词均为0分。

!!! tip deepseek-chat TL;DR

本研究探讨了利用大型语言模型（GPT-4o）生成模仿特定作者风格的文本是否能够逃避现有的作者验证系统，结果表明LLM生成的文本无法有效复制作者个体特征，当前验证系统对此类初级模仿尝试具有鲁棒性。

摘要翻译

作者身份验证（Authorship Verification, AV）作为判定争议文本是否由特定个体撰写的任务，是司法语言学的重要组成部分。尽管在历史司法案例中，作案者进行的人工作者模仿早已被视为一种公认的威胁，但大语言模型（Large Language Models, LLMs）的最新进展带来了新的挑战，因为攻击者可能利用这些工具来模仿他人的写作风格。本研究探讨了经提示的大语言模型能否生成具有说服力的作者模仿文本，以及此类输出能否规避现有的司法作者身份验证系统。我们以GPT-4o作为攻击模型，在电子邮件、短信和社交媒体帖子三种文体中，通过四种提示条件生成了模仿文本。随后，我们在似然比框架下，使用非神经方法（n-gram追踪、基于排序的冒名顶替者方法、LambdaG）和神经方法（AdHominem、LUAR、STAR）对这些输出进行了评估。结果表明，大语言模型生成的文本未能充分复现作者个体特征以绕过既有的作者身份验证系统。我们还观察到，与真实的负样本相比，某些方法在拒绝模仿文本时甚至达到了更高的准确率。总体而言，这些发现表明，尽管大语言模型易于获取，但当前的作者身份验证系统在面对跨多种文体的初级模仿尝试时仍保持稳健。此外，我们证明这种反直觉的鲁棒性至少部分源于大语言模型生成文本所固有的更高词汇多样性和熵值。

摘要 (Abstract)

Authorship verification (AV), the task of determining whether a questioned text was written by a specific individual, is a critical part of forensic linguistics. While manual authorial impersonation by perpetrators has long been a recognized threat in historical forensic cases, recent advances in large language models (LLMs) raise new challenges, as adversaries may exploit these tools to impersonate another’s writing. This study investigates whether prompted LLMs can generate convincing authorial impersonations and whether such outputs can evade existing forensic AV systems. Using GPT-4o as the adversary model, we generated impersonation texts under four prompting conditions across three genres: emails, text messages, and social media posts. We then evaluated these outputs against both non-neural AV methods (n-gram tracing, Ranking-Based Impostors Method, LambdaG) and neural approaches (AdHominem, LUAR, STAR) within a likelihood-ratio framework. Results show that LLM-generated texts failed to sufficiently replicate authorial individuality to bypass established AV systems. We also observed that some methods achieved even higher accuracy when rejecting impersonation texts compared to genuine negative samples. Overall, these findings indicate that, despite the accessibility of LLMs, current AV systems remain robust against entry-level impersonation attempts across multiple genres. Furthermore, we demonstrate that this counter-intuitive resilience stems, at least in part, from the higher lexical diversity and entropy inherent in LLM-generated texts.

关键词: Authorship Verification, Large Language Models, LLM-generated Text, Forensic Linguistics, Authorial Impersonation, GPT-4o, Likelihood-ratio Framework, Lexical Diversity

130. ❌ CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

作者: Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng, Angel Hsing-Chi Hwang, Adam C. Frank, Ruishan Liu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29429v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究基于LLM的心理健康对话审计工具包，核心涉及LLM应用（摘要明确提到LLM-based tools和LLM judges）、AI for Science（心理健康应用属于科学领域）、Explainable AI（提供透明、结构化的审计报告）等关键词，与这些关键词高度相关（10分）。与Self-Correction和Hallucination Mitigation有一定关联（5分），因为审计工具可能间接涉及对话质量改进和事实性检查。其他关键词如MoE、SFT、RAG等未在论文中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对心理健康支持对话缺乏透明审计的问题，开发了一个名为CounselReflect的端到端工具包，通过整合模型指标和基于规则的评估，提供结构化、多维度的审计报告，以支持可理解、可用和可信赖的对话质量检查。

摘要翻译

心理健康支持日益通过对话系统（例如基于LLM的工具）进行中介，但用户往往缺乏结构化方法来审核所获支持的质量与潜在风险。本文介绍CounselReflect——一个用于审核心理健康支持对话的端到端工具包。与生成单一不透明的质量评分不同，CounselReflect提供结构化的多维报告，包含会话级摘要、轮次级评分及证据关联摘录，以支持透明化审查。该系统整合了两类评估信号：（一）由任务专用预测器生成的12项基于模型的指标；（二）基于量规的指标，通过文献衍生的指标库（含69项指标）和用户自定义指标扩展覆盖范围，并借助可配置的LLM评估器实现操作化。CounselReflect以网页应用、浏览器扩展及命令行界面（CLI）三种形式提供，支持实时场景与大规模应用。人工评估包括一项20名参与者的用户研究和6位心理健康专家的专业评审，结果表明CounselReflect能够提供可理解、易用且可信赖的审核支持。本文同时提供了演示视频与完整源代码。

摘要 (Abstract)

Mental-health support is increasingly mediated by conversational systems (e.g., LLM-based tools), but users often lack structured ways to audit the quality and potential risks of the support they receive. We introduce CounselReflect, an end-to-end toolkit for auditing mental-health support dialogues. Rather than producing a single opaque quality score, CounselReflect provides structured, multi-dimensional reports with session-level summaries, turn-level scores, and evidence-linked excerpts to support transparent inspection. The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined custom metrics, operationalized with configurable LLM judges. CounselReflect is available as a web application, browser extension, and command-line interface (CLI), enabling use in real-time settings as well as at scale. Human evaluation includes a user study with 20 participants and an expert review with 6 mental-health professionals, suggesting that CounselReflect supports understandable, usable, and trustworthy auditing. A demo video and full source code are also provided.

关键词: mental-health support, conversational systems, LLM-based tools, auditing toolkit, structured evaluation, transparent inspection, LLM judges, human evaluation

131. ❌ PRISM: PRIor from corpus Statistics for topic Modeling

作者: Tal Ishon, Yoav Goldberg, Uri Shaham 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29406v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文PRISM专注于主题建模方法，使用LDA框架和基于语料库统计的初始化技术，与大多数大模型技术关键词（如LLM、MoE、微调、推理优化等）完全无关。唯一的相关点是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在单细胞RNA-seq数据上进行了实验，属于生物信息学应用，但这不是论文的核心（核心是文本主题建模），因此给5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

论文提出PRISM方法，通过从词共现统计中推导狄利克雷参数来初始化LDA，以提升主题建模的连贯性和可解释性，在文本和单细胞RNA-seq数据上验证了其有效性。

摘要翻译

主题建模旨在揭示文本中的潜在语义结构，其中LDA提供了基础的概率框架。尽管现有方法常引入外部知识（如预训练嵌入），但这种依赖性限制了其在新兴或未充分探索领域的适用性。我们提出\textbf{PRISM}方法，这是一种基于语料库内部特征的技术，通过词汇共现统计推导狄利克雷参数来初始化LDA，且不改变其生成过程。在文本和单细胞RNA-seq数据上的实验表明，PRISM能提升主题连贯性与可解释性，其性能可与依赖外部知识的模型相媲美。这些结果凸显了在资源受限场景下，基于语料库驱动的初始化策略对主题建模的重要价值。代码发布于：https://github.com/shaham-lab/PRISM。

摘要 (Abstract)

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

关键词: topic modeling, LDA, Dirichlet prior, corpus-intrinsic method, word co-occurrence statistics, topic coherence, single cell RNA-seq, PRISM

132. ❌ Is my model perplexed for the right reason? Contrasting LLMs’ Benchmark Behavior with Token-Level Perplexity

作者: Zoë Prins, Samuele Punzo, Frank Wildenburg, Giovanni Cinà, Sandro Pezzelle 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29396v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的评估和解释性，提出基于token-level perplexity的框架来分析模型是否依赖语言相关线索。与’Large Language Models’高度相关（10分），因为全文聚焦LLMs评估；与’Mechanistic Interpretability’高度相关（10分），因为研究模型内部机制和解释性方法。其他关键词如MoE、SFT、RAG等涉及具体技术或应用，论文未涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文提出基于token-level perplexity的框架来评估大语言模型是否依赖语言相关线索，实验发现模型行为不完全由预期语言线索解释，揭示了模型使用其他启发式方法。

摘要翻译

大型语言模型（LLM）的标准评估侧重于任务表现，难以深入揭示其正确行为是否反映了适当的底层机制，且存在确认偏误的风险。我们提出了一种基于词元级困惑度的简洁、可解释性框架，以检验模型是否依赖语言学相关线索。该方法通过比较仅在一个或少数“关键”词元上存在差异的最小句对之间的困惑度分布，实现了无需依赖不稳定特征归因技术的精确、假设驱动分析。在多个开源权重的LLM上进行的受控语言学基准测试表明，虽然语言学上重要的词元会影响模型行为，但它们从未完全解释困惑度的变化，这揭示了模型依赖于预期语言学线索之外的其他启发式策略。

摘要 (Abstract)

Standard evaluations of Large language models (LLMs) focus on task performance, offering limited insight into whether correct behavior reflects appropriate underlying mechanisms and risking confirmation bias. We introduce a simple, principled interpretability framework based on token-level perplexity to test whether models rely on linguistically relevant cues. By comparing perplexity distributions over minimal sentence pairs differing in one or a few `pivotal’ tokens, our method enables precise, hypothesis-driven analysis without relying on unstable feature-attribution techniques. Experiments on controlled linguistic benchmarks with several open-weight LLMs show that, while linguistically important tokens influence model behavior, they never fully explain perplexity shifts, revealing that models rely on heuristics other than the expected linguistic ones.

关键词: Large language models, interpretability, token-level perplexity, linguistic cues, benchmark evaluation, model behavior analysis, minimal sentence pairs, heuristics

133. ❌ Developing a Guideline for the Labovian-Structural Analysis of Oral Narratives in Japanese

作者: Amane Watahiki, Tomoki Doi, Akari Kikuchi, Hiroshi Ohata, Yuki I. Nakata, Takuya Niikawa, Taiga Shinozaki, Hitomi Yanaka 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29347v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于日语叙事分析的定性研究方法论开发，特别是Labovian模型的日语应用指南。论文内容涉及语言学、叙事分析、标注指南和数据集创建，完全不涉及大模型、深度学习、AI技术或任何计算科学方法。所有评分关键词均与大模型技术相关，而本文是纯粹的语言学方法论研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究开发了首个适用于日语叙事数据的Labovian结构分析系统指南，解决了现有英语数据集不适用于日语语法和话语惯例的问题，并通过标注实验验证了指南的有效性。

摘要翻译

叙事分析是质性研究的基石。拉波夫叙事模型是其中一种主流方法，但其应用过程劳动密集型，需要在转录稿的局部与整体之间进行反复、递归的整体性阐释。现有的拉波夫标注数据集仅适用于英语，而日语在语法和话语惯例方面与英语存在显著差异。为填补这一空白，我们首次提出了针对日语叙事数据进行拉波夫叙事分析的系统性标注指南。该指南完整保留了拉波夫模型的全部六个范畴，并通过提供专门针对日语结构设计的明确小句切分规则，扩展了原有框架。此外，我们的指南涵盖了更广泛的小句类型和叙事类型。使用本指南后，标注者在小句切分上达到了高度一致性（弗莱斯卡帕系数=0.80），在两个结构分类任务中达到中等一致性（克里彭多夫阿尔法系数分别为0.41和0.45），其中一项任务尽管采用了更细粒度的区分标准，其一致性仍略高于先前研究。本文阐述了拉波夫模型、提出的标注指南、标注过程及其应用价值。最后，文章讨论了标注过程中遇到的挑战，以及在日语质性研究中开发更大规模结构叙事分析数据集的前景。

摘要 (Abstract)

Narrative analysis is a cornerstone of qualitative research. One leading approach is the Labovian model, but its application is labor-intensive, requiring a holistic, recursive interpretive process that moves back and forth between individual parts of the transcript and the transcript as a whole. Existing Labovian datasets are available only in English, which differs markedly from Japanese in terms of grammar and discourse conventions. To address this gap, we introduce the first systematic guidelines for Labovian narrative analysis of Japanese narrative data. Our guidelines retain all six Labovian categories and extend the framework by providing explicit rules for clause segmentation tailored to Japanese constructions. In addition, our guidelines cover a broader range of clause types and narrative types. Using these guidelines, annotators achieved high agreement in clause segmentation (Fleiss’ kappa = 0.80) and moderate agreement in two structural classification tasks (Krippendorff’s alpha = 0.41 and 0.45, respectively), one of which is slightly higher than that found in prior work despite the use of finer-grained distinctions. This paper describes the Labovian model, the proposed guidelines, the annotation process, and their utility. It concludes by discussing the challenges encountered during the annotation process and the prospects for developing a larger dataset for structural narrative analysis in Japanese qualitative research.

关键词: Labovian model, narrative analysis, Japanese narratives, clause segmentation, annotation guidelines, qualitative research, structural classification, discourse conventions

134. ❌ L-ReLF: A Framework for Lexical Dataset Creation

作者: Anass Sedrati, Mounir Afifi, Reda Benkhadra 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29346v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于为低资源语言创建结构化词汇数据集的方法论框架，涉及OCR、数据标准化和与Wikidata Lexemes的兼容性。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究内容属于传统NLP数据工程领域，未涉及任何大模型技术、深度学习创新或AI for Science应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了L-ReLF框架，用于为低资源语言创建高质量结构化词汇数据集，解决了标准化术语缺乏的问题，并提供了可复现的技术流程。

摘要翻译

本文介绍了L-ReLF（低资源词汇框架），这是一种新颖且可复现的方法论，旨在为资源匮乏的语言创建高质量、结构化的词汇数据集。以摩洛哥达里贾语为例，标准化术语的缺乏对维基百科等平台的知识公平构成了关键障碍，常常迫使编辑人员依赖不一致的临时性方法在其语言中创建新词。我们的研究详细阐述了为克服这些挑战而开发的技术流程。我们系统地解决了处理低资源数据时的诸多困难，包括：语源识别、克服光学字符识别（OCR）对现代标准阿拉伯语的偏向性而加以利用，以及通过严格的后处理来纠正错误并标准化数据模型。最终生成的结构化数据集与维基数据词元（Wikidata Lexemes）完全兼容，成为一项重要的技术资源。L-ReLF方法论设计具备普适性，为其他语言社群提供了一条清晰路径，以构建可用于下游自然语言处理应用（如机器翻译和形态学分析）的基础词汇数据。

摘要 (Abstract)

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

关键词: lexical dataset creation, low-resource languages, structured dataset, Wikidata Lexemes, OCR processing, data standardization, NLP applications, Moroccan Darija

135. ❌ Open Machine Translation for Esperanto

作者: Ona de Gibert, Lluís de Gibert 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29345v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要评估Esperanto的机器翻译系统，包括规则系统、编码器-解码器模型和LLMs。与LLMs相关度较高（8分），因为论文明确提到评估LLMs并微调通用LLM。与SFT相关度中等（5分），因为涉及微调LLM。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文首次全面评估了Esperanto的开源机器翻译系统，包括规则系统、编码器-解码器模型和LLMs，发现NLLB模型在所有语言对中表现最佳，其次是紧凑模型和微调的通用LLM。

摘要翻译

世界语是一种广泛使用的人造语言，以其规则的语法和高效的构词能力而闻名。尽管其在线社区提供了丰富的语言资源，但在现代机器翻译（MT）方法的背景下，世界语仍相对缺乏深入探索。本研究首次对世界语的开源机器翻译系统进行了全面评估，比较了基于规则的系统、编码器-解码器模型以及不同规模的LLMs（大语言模型）。我们使用多种自动评估指标并结合人工评估，在涉及英语、西班牙语、加泰罗尼亚语和世界语的六个语言方向上评估了翻译质量。结果显示，NLLB系列模型在所有语言对中均取得了最佳性能，紧随其后的是我们训练的紧凑模型和经过微调的通用LLM。人工评估证实了这一趋势，约半数对比中NLLB的翻译更受青睐，尽管其中仍存在明显错误。秉承世界语开放与国际协作的传统，我们将公开代码及性能最佳的模型。

摘要 (Abstract)

Esperanto is a widespread constructed language, known for its regular grammar and productive word formation. Besides having substantial resources available thanks to its online community, it remains relatively underexplored in the context of modern machine translation (MT) approaches. In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes. We evaluate translation quality across six language directions involving English, Spanish, Catalan, and Esperanto using multiple automatic metrics as well as human evaluation. Our results show that the NLLB family achieves the best performance in all language pairs, followed closely by our trained compact models and a fine-tuned general-purpose LLM. Human evaluation confirms this trend, with NLLB translations preferred in approximately half of the comparisons, although noticeable errors remain. In line with Esperanto’s tradition of openness and international collaboration, we release our code and best-performing models publicly.

关键词: Machine Translation, Esperanto, Large Language Models, Open-source Evaluation, NLLB, Fine-tuning, Human Evaluation, Translation Quality

136. ❌ CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

作者: Shohei Higashiyama, Masao Ideuchi, Masao Utiyama 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29336v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于为日语实体链接任务构建标注语料库（CADEL），属于自然语言处理中的资源建设研究，与所有评分关键词（均涉及大模型/深度学习技术原理、训练方法、优化技术或特定应用领域如AI for Science）无直接关联；论文未提及任何大模型、深度学习技术、训练方法、推理优化或科学AI应用，仅涉及传统实体链接任务和语料库构建，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究针对日语实体链接系统缺乏评估资源的问题，开发了一个标注语料库设计策略并构建了覆盖日本特有实体表达的日语实体链接语料库（CADEL），通过标注者一致性评估和初步实验验证了其作为评估基准的潜在有用性。

摘要翻译

实体链接任务旨在将语言表达式与知识库中代表现实世界实体和概念的条目相关联。目前该任务的语言资源主要围绕英语开发，可用于评估日语系统的资源仍然有限。本研究针对实体链接任务制定了语料库设计规范，并构建了一个带标注的语料库，用于训练和评估日语实体链接系统。该语料库广泛涵盖了指向日本特有实体的语言表达式。标注者间一致性评估证实了语料库标注结果具有高度一致性，基于字符串匹配的实体消歧初步实验表明，该语料库包含大量非平凡案例，这支持了其作为评估基准的潜在实用价值。

摘要 (Abstract)

Entity linking is the task of associating linguistic expressions with entries in a knowledge base that represent real-world entities and concepts. Language resources for this task have primarily been developed for English, and the resources available for evaluating Japanese systems remain limited. In this study, we develop a corpus design policy for the entity linking task and construct an annotated corpus for training and evaluating Japanese entity linking systems, with rich coverage of linguistic expressions referring to entities that are specific to Japan. Evaluation of inter-annotator agreement confirms the high consistency of the annotations in the corpus, and a preliminary experiment on entity disambiguation based on string matching suggests that the corpus contains a substantial number of non-trivial cases, supporting its potential usefulness as an evaluation benchmark.

关键词: Entity Linking, Japanese, Corpus Construction, Annotation, Knowledge Base, Evaluation Benchmark, Administrative Documents, Linguistic Expressions

137. ❌ MemRerank: Preference Memory for Personalized Product Reranking

作者: Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yi Gong 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29247v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based shopping agents和preference memory框架MemRerank，与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文明确使用LLM作为基础技术并构建智能体系统。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法、推理优化、AI for Science等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM购物代理中用户购买历史直接作为提示效果不佳的问题，提出了MemRerank偏好记忆框架来提炼用户历史为简洁信号以提升个性化产品重排序性能，实验表明其显著优于无记忆、原始历史等基线方法。

摘要翻译

基于大语言模型的购物代理日益依赖长购物历史和多轮交互实现个性化，但直接将原始历史记录附加至提示往往因噪声干扰、长度过长及相关性失配而效果不佳。我们提出MemRerank——一种偏好记忆框架，能够将用户购买历史提炼为简洁的、与查询无关的信号，用于个性化商品重排序。为研究该问题，我们构建了以基于大语言模型的五选一任务为核心的端到端基准测试与评估框架，该任务同时衡量记忆质量与下游重排序效用。我们进一步通过强化学习训练记忆提取器，并以下游重排序性能作为监督信号。基于两种大语言模型重排序器的实验表明，MemRerank在无记忆、原始历史及现成记忆基线方法中均取得稳定优势，在五选一准确率上最高提升10.61个绝对百分点。这些结果表明，显式偏好记忆是智能电商系统中实现个性化功能实用且有效的构建模块。

摘要 (Abstract)

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

关键词: LLM-based shopping agents, preference memory, personalized product reranking, reinforcement learning, MemRerank, agentic e-commerce systems, 1-in-5 selection task

138. ❌ The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

作者: Hillary Mutisya, John Mugane, Gavin Nyamboga, Brian Chege, Maryruth Gathoni 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29244v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要介绍了一个针对非洲低资源语言的多模态数据集（Thiomi Dataset）的构建、质量保证流程和基准模型实验，涉及自动语音识别（ASR）、机器翻译（MT）和文本转语音（TTS）等任务。所有评分关键词均聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文的核心是数据集创建和传统语音/语言处理任务，并未涉及LLM技术、深度学习创新或科学领域的AI应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文构建了一个大规模多模态的非洲低资源语言数据集（Thiomi Dataset），并通过训练ASR、MT和TTS模型验证了其有效性，在斯瓦希里语和索马里语上取得了显著的性能提升。

摘要翻译

我们推出Thiomi数据集，这是一个涵盖四个语系十种非洲语言的大规模多模态语料库，包括：斯瓦希里语、基库尤语、坎巴语、基梅鲁语、卢奥语、马赛语、基普西吉斯语、索马里语（东非）；沃洛夫语（西非）；以及富拉尼语（西/中非）。该数据集包含超过601,000条经审核的句子级文本标注和超过385,000条九种语言的音频记录，这些数据通过一个专门的社区数据收集平台采集，涉及100多位贡献者。Thiomi平台收集了九种语言的数据；斯瓦希里语数据则通过现有Common Voice录音进行了补充。采用多层质量保障流程后，六种主要语言的文本审核通过率达到86-100%。为验证数据集的实用性，我们训练并评估了自动语音识别（ASR）、机器翻译（MT）和文本转语音（TTS）模型，为全部十种语言建立了性能基线。我们最优的ASR系统在斯瓦希里语（Common Voice）上实现了3.24%的词错误率（WER），将学界先前的最优水平从8.3%降至3.24%（绝对降低5.1个百分点，相对降低61%），在索马里语上达到4.3%词错误率。本数据集将在HuggingFace平台发布。我们详细阐述了数据收集平台、质量保障流程和基线实验，并探讨其对非洲语言技术基础设施建设的意义。

摘要 (Abstract)

We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset’s utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

关键词: African languages, multimodal corpus, low-resource languages, automatic speech recognition, machine translation, text-to-speech, dataset creation, quality assurance

139. ❌ Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

作者: Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang, Nan Tang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29232v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）和小语言模型（SLMs）在长文档问答中的应用，提出Chain-of-Structured-Thought（CoST）方法，并使用监督微调（SFT）和强化学习优化（GRPO）训练SLMs。因此，与’Large Language Models’、‘Small Language Models’、‘Post-training’和’Chain of Thought’高度相关（10分）。其他关键词如MoE、Scaling Laws、RAG、Quantization等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过Chain-of-Structured-Thought和两阶段微调（SFT+GRPO）将大语言模型的结构化推理能力蒸馏到小语言模型中，以在长文档问答任务上实现与大模型相当的质量，同时显著降低延迟。

摘要翻译

大语言模型（LLM）已广泛应用于文档数据分析，但直接对冗长、含噪声的文档进行推理仍存在脆弱性且易出错。因此，我们研究文档问答（QA）任务，其将分散的证据整合为结构化输出（如表、图或文本块），以支持可靠、可验证的QA。我们提出一个双支柱框架LiteCoST，旨在利用小语言模型（SLM）同时实现高准确性与低延迟。支柱一：结构化思维链（Chain-of-Structured-Thought，CoST）。我们引入CoST模板——一种模式感知的指令，用于引导强LLM生成逐步的CoST轨迹及相应的结构化输出。该过程通过诱导最小化结构、规范化实体/单位、对齐记录、序列化输出并进行验证/优化，产生可审计的监督信号。支柱二：SLM微调。紧凑模型基于LLM生成的CoST数据进行两阶段训练：首先通过监督微调实现结构对齐，随后采用组相对策略优化（Group Relative Policy Optimization，GRPO），该阶段融合了针对答案质量、格式质量与过程一致性的三重奖励。通过将“结构优先”的行为蒸馏至SLM中，该方法使3B/7B参数的SLM在多领域长文档QA任务上达到了与LLM相当的质量，同时实现了比GPT-4o和DeepSeek-R1（671B）低2-4倍的延迟。代码发布于https://github.com/HKUSTDial/LiteCoST。

摘要 (Abstract)

Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.

关键词: Long-Document QA, Chain-of-Structured-Thought, Small Language Models, Supervised Fine-Tuning, Group Relative Policy Optimization, Structured Output, Latency Reduction, LLM Distillation

140. ❌ SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali

作者: Ranidu Gurusinghe, Nevidu Jayatilleke 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29221v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要创建了一个佛教文本语料库，并评估了预训练语言模型在该语料库上的性能。与关键词的相关性分析如下：1）与’Large Language Models’有一定关联（5分），因为论文评估了包括专有和开源模型在内的语言模型性能；2）与’Pre-training’高度相关（8分），因为论文明确提到该语料库支持领域适应的语言模型预训练；3）其他关键词（如MoE、SFT、RAG等）均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究创建了一个包含僧伽罗语和巴利语佛教文本的综合语料库SiPaKosa，并评估了预训练语言模型在该语料库上的性能，发现专有模型显著优于开源模型。

摘要翻译

SiPaKosa是一个包含僧伽罗语和巴利语教义文本的综合语料库，涵盖约78.6万句、925万词，整合了16份已获版权许可的历史佛教文献以及完整网络爬取的《三藏》经典文本。该语料库通过谷歌文档AI对历史手稿进行高质量光学字符识别（OCR），结合对经典文献库的系统化网络爬取，并经过严格的质量控制与元数据标注构建而成。语料库按语言分为专用子库：僧伽罗语子库以及僧伽罗语-巴利语混合子库。我们使用十种预训练模型评估了语言模型性能，在本语料库上的困惑度分数介于1.09至189.67之间。分析表明，专有模型性能显著优于开源模型，达到其三至六倍。本语料库支持领域自适应语言模型的预训练，促进历史语言分析，有助于开发面向佛学研究的文献检索系统，同时为保护僧伽罗文化遗产提供支持。

摘要 (Abstract)

SiPaKosa is a comprehensive corpus of Sinhala and Pali doctrinal texts comprising approximately 786K sentences and 9.25M words, incorporating 16 copyright-cleared historical Buddhist documents alongside the complete web-scraped Tripitaka canonical texts. The corpus was created through high-quality OCR using Google Document AI on historical manuscripts, combined with systematic web scraping of canonical repositories, followed by rigorous quality control and metadata annotation. The corpus is organised into language-specific subcorpora: Sinhala and Mixed Sinhala-Pali. We evaluate the performance of language models using ten pretrained models, with perplexity scores ranging from 1.09 to 189.67 on our corpus. This analysis shows that proprietary models significantly outperform open-source alternatives by factors of three to six times. This corpus supports the pretraining of domain-adapted language models, facilitates historical language analysis, and aids in the development of information retrieval systems for Buddhist scholarship while preserving Sinhala cultural heritage.

关键词: Buddhist texts, Sinhala, Pali, corpus, language models, perplexity, domain adaptation, cultural heritage

141. ❌ SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation

作者: Mohammad Amer Khalil, Raghad Nahas, Ahmad Nassar, Khloud Al Jallad 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29219v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于叙利亚阿拉伯手语数据集创建和翻译任务，使用MotionCLIP、T2M-GPT、SignCLIP等深度学习模型进行评估，但未涉及大模型技术原理、训练方法、推理优化、对齐技术、代理系统或科学AI应用等关键词领域，所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该研究创建了首个叙利亚阿拉伯手语数据集SyriSign，包含1500个视频样本，并使用三种深度学习架构评估文本到手语的翻译性能，发现数据集规模限制了模型的泛化能力。

摘要翻译

手语是聋哑及听力障碍（DHH）群体主要的沟通方式。尽管目前已有大量针对高资源手语的基准数据集，但如阿拉伯语手语等低资源语言仍缺乏充分研究。目前，尚无公开可用的叙利亚阿拉伯手语（SyArSL）数据集。为填补这一空白，我们推出了SyriSign数据集，该数据集包含150个独立词汇手势的1500个视频样本，专为文本到SyArSL的翻译任务设计。本研究旨在缓解叙利亚的沟通障碍，因为当地新闻多以口语或书面阿拉伯语传播，聋人群体往往难以获取。我们采用三种深度学习架构对SyriSign进行评估：用于语义动作生成的MotionCLIP、基于文本条件动作合成的T2M-GPT，以及实现双语嵌入对齐的SignCLIP。实验结果表明，尽管生成式方法在手语表征方面展现出巨大潜力，但有限的数据集规模制约了泛化性能。我们将公开SyriSign数据集，期望其能成为该领域的初步基准。

摘要 (Abstract)

Sign language is the primary approach of communication for the Deaf and Hard-of-Hearing (DHH) community. While there are numerous benchmarks for high-resource sign languages, low-resource languages like Arabic remain underrepresented. Currently, there is no publicly available dataset for Syrian Arabic Sign Language (SyArSL). To overcome this gap, we introduce SyriSign, a dataset comprising 1500 video samples across 150 unique lexical signs, designed for text-to-SyArSL translation tasks. This work aims to reduce communication barriers in Syria, as most news are delivered in spoken or written Arabic, which is often inaccessible to the deaf community. We evaluated SyriSign using three deep learning architectures: MotionCLIP for semantic motion generation, T2M-GPT for text-conditioned motion synthesis, and SignCLIP for bilingual embedding alignment. Experimental results indicate that while generative approaches show strong potential for sign representation, the limited dataset size constrains generalization performance. We will release SyriSign publicly, hoping it serves as an initial benchmark.

关键词: Syrian Arabic Sign Language, sign language translation, low-resource language, parallel corpus, deep learning, motion generation, dataset creation, communication barriers

142. ❌ Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

作者: Lukuang Dong, Ziwei Li, Saierdaer Yusuyin, Xianyu Zhao, Zhijian Ou 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29217v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	5.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确使用LLMs进行多语言音素到字素转换研究，属于大模型在语音识别领域的应用创新，因此’Large Language Models’得10分。论文提到S-SKM是Monte Carlo近似方法，与’MCTS AND LLM’有一定关联，但并非核心的MCTS算法，得5分。论文属于AI在语音科学领域的应用，与’AI for Science’相关，得5分。其他关键词如MoE、SFT、RAG等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究多语言语音识别中基于LLM的音素到字素转换问题，通过鲁棒训练和低资源过采样策略，在CV-Lang10基准上将平均词错误率从10.56%降低到7.66%。

摘要翻译

基于音素的自动语音识别（ASR）将识别过程分解为语音到音素（S2P）和音素到字素（P2G）两个阶段，从而能够在跨语言共享声学模型的同时，将语言特定的正字法保留在独立模块中。尽管大型语言模型（LLMs）在P2G任务中展现出潜力，但由于需要语言感知的生成能力以及严重的跨语言数据不平衡问题，多语言P2G仍面临挑战。我们在十语言的CV-Lang10基准上研究了基于多语言LLM的P2G方法。我们考察了针对S2P不确定性的鲁棒性策略，包括DANP和简化SKM（S-SKM）。S-SKM是一种蒙特卡洛近似方法，避免了在P2G训练中使用基于CTC的S2P概率加权。通过鲁棒性训练和低资源语言过采样，平均词错误率（WER）从10.56%降至7.66%。

摘要 (Abstract)

Phoneme-based ASR factorizes recognition into speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G), enabling cross-lingual acoustic sharing while keeping language-specific orthography in a separate module. While large language models (LLMs) are promising for P2G, multilingual P2G remains challenging due to language-aware generation and severe cross-language data imbalance. We study multilingual LLM-based P2G on the ten-language CV-Lang10 benchmark. We examine robustness strategies that account for S2P uncertainty, including DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Robust training and low-resource oversampling reduce the average WER from 10.56% to 7.66%.

关键词: LLM-based phoneme-to-grapheme, multilingual speech recognition, CV-Lang10 benchmark, robust training, low-resource oversampling, Monte Carlo approximation, WER reduction, S-SKM

143. ❌ Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa

作者: George Boateng, Samuel Boateng, Victor Kumbol 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是构建一个基于检索增强生成（RAG）的生成式AI教学助手，用于非洲的大规模在线编程教育。因此，与’Retrieval-Augmented Generation’高度相关（10分），因为RAG是其核心技术。论文提到’generative AI teaching assistant’，表明使用了生成式AI/大语言模型，因此与’Large Language Models’有一定关联（8分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理优化、代理系统、模型压缩、科学AI等均未在摘要中提及或与论文主题无关，故得0分。

!!! tip deepseek-chat TL;DR

该研究构建了一个基于检索增强生成（RAG）和人机协同的生成式AI教学助手Kwame 2.0，用于非洲大规模在线编程教育，通过15个月的实际部署验证了其在提供高质量、及时学习支持方面的有效性，同时结合人类监督确保了可靠性。

摘要翻译

在资源受限的大规模在线编程课程中，提供及时且准确的学习支持具有挑战性。我们推出了Kwame 2.0，这是一个基于检索增强生成技术构建的双语（英语-法语）生成式人工智能教学助手，部署于SuaCode课程（一门面向非洲各地学习者的移动端编程入门课程）中的人机协同论坛内。Kwame 2.0能够检索相关课程材料并生成情境感知的回复，同时鼓励人工监督与社区参与。我们通过一项为期15个月的纵向研究部署了该系统，研究覆盖了15个批次、来自35个非洲国家的3,717名注册学员。通过社区反馈和专家评分的评估表明，Kwame 2.0提供了高质量且及时的支持，在课程相关问题中实现了高准确率，而人工助教和同伴则有效纠正了错误，尤其在行政类咨询方面。我们的研究证明，人机协同的生成式人工智能系统能够结合人工智能的可扩展性、速度与人类支持的可靠性，为资源受限环境中代表性不足的群体提供一种有效的大规模学习辅助方案。

摘要 (Abstract)

Providing timely and accurate learning support in large-scale online coding courses is challenging, particularly in resource-constrained contexts. We present Kwame 2.0, a bilingual (English-French) generative AI teaching assistant built using retrieval-augmented generation and deployed in a human-in-the-loop forum within SuaCode, an introductory mobile-based coding course for learners across Africa. Kwame 2.0 retrieves relevant course materials and generates context-aware responses while encouraging human oversight and community participation. We deployed the system in a 15-month longitudinal study spanning 15 cohorts with 3,717 enrollments across 35 African countries. Evaluation using community feedback and expert ratings shows that Kwame 2.0 provided high-quality and timely support, achieving high accuracy on curriculum-related questions, while human facilitators and peers effectively mitigated errors, particularly for administrative queries. Our findings demonstrate that human-in-the-loop generative AI systems can combine the scalability and speed of AI with the reliability of human support, offering an effective approach to learning assistance for underrepresented populations in resource-constrained settings at scale.

关键词: generative AI teaching assistant, retrieval-augmented generation, human-in-the-loop, large-scale online coding education, Africa, SuaCode, bilingual (English-French), longitudinal study

144. ❌ Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

作者: Zhiqian Zhang, Xu Zhao, Xiaoqing Xu, Guangdong Liang, Weijia Wang, Xiaolei Lv, Bo Li, Jun Gao 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29211v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Xuanwu VL-2B，一个约20亿参数的多模态基础模型，用于内容生态系统。核心相关关键词包括：1）‘Large Language Models/Foundation Models’（10分）：论文明确研究多模态基础模型；2）‘Small Language Models/On-device AI’（8分）：模型参数约2B，属于较小规模模型，考虑部署成本；3）‘Pre-training/Domain Adaptation’（10分）：采用三阶段训练流程（pre-training, mid-training, post-training），涉及领域适应；4）‘Post-training/SFT’（10分）：post-training是核心训练阶段之一；5）‘Instruction Tuning/Alignment’（8分）：强调业务对齐（business alignment）和语言-语义对齐；6）‘Scaling Laws AND Data Quality’（5分）：提及数据迭代和筛选机制，隐含数据质量考虑；7）‘Quantization/Model Compression’（5分）：在有限参数预算下优化，涉及模型效率；8）‘Hallucination Mitigation/Factuality’（5分）：内容审核任务涉及事实性和真实性。其他关键词如MoE、RAG、RLHF等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何将通用多模态模型发展为工业级基础模型，以解决内容审核和对抗场景中的泛化能力下降和灾难性遗忘问题，提出的Xuanwu VL-2B模型在约20亿参数预算下，通过三阶段训练流程，在业务对齐、视觉感知、通用能力保留和部署成本之间实现了实用平衡，并在多项基准和业务任务中优于对比模型。

摘要翻译

近年来，多模态大模型在通用基准测试上持续提升。然而，在实际内容审核和对抗性场景中，由于细粒度视觉感知能力有限以及对长尾噪声建模不足，主流模型仍普遍存在泛化能力下降与灾难性遗忘问题。本文以Xuanwu VL-2B为案例，探讨如何将通用多模态模型发展为适用于内容生态系统的工业级基础模型。该模型采用紧凑的InternViT-300M + MLP + Qwen3 1.7B架构，在约20亿参数预算内平衡了细粒度视觉感知、语言语义对齐与部署成本。为兼顾业务专业化与通用能力保留，我们开发了数据迭代与筛选机制，并通过渐进式三阶段流程（预训练、中训练与后训练）对模型进行训练。消融实验与离线业务评估表明，Xuanwu VL-2B在七项OpenCompass多模态指标上平均得分达67.90（对比InternVL 3.5 2B的64.27），在七项独立业务审核任务中平均召回率为94.38%，并在具有挑战性的对抗性OCR场景中对违规文本的加权整体召回率达82.82%，优于Gemini-2.5-Pro（76.72%）。这些结果表明，在有限参数预算下，Xuanwu VL-2B在业务对齐、视觉感知、通用能力保留与部署成本之间实现了实用化的平衡。

摘要 (Abstract)

In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.

关键词: multimodal large models, industrial-grade foundation model, content moderation, adversarial settings, fine-grained visual perception, three-stage training pipeline, parameter budget optimization, business alignment

145. ❌ Concept Training for Human-Aligned Language Models

作者: Christine Zhang, Dan Jurafsky, Chen Shani 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29123v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的语言模型训练框架，用概念预测替代传统的下一词预测，这直接涉及大语言模型（LLMs）的核心训练方法（pre-training），并旨在改善模型与人类语义判断的alignment。因此，与’Large Language Models OR LLMs OR Foundation Models’、‘Pre-training OR Continual Pre-training OR Domain Adaptation’和’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分）。其他关键词如MoE、SFT、RAG、推理加速等，论文未涉及，故为0分。

!!! tip deepseek-chat TL;DR

该研究提出用概念监督训练语言模型，替代传统的下一词预测，以改善模型与人类语义判断的对齐，实验表明该方法在保持语言建模性能的同时提升了语义对齐。

摘要翻译

下一词元预测目标通过训练语言模型在每一步预测单个延续词元。然而在自然语言中，同一前缀可能存在多种有效的延续方式，即使语义相近的表达也可能呈现不同的表层形式。例如，句子“this website is safe to \underline{browse}”的后续可能 plausibly 包含browse、search、visit、surf或navigate等词汇。虽然标准的下一词元预测训练将这些替代选项视为互斥的预测目标，本研究探索了一种预测概念（concept）的框架——概念被近似定义为语义相关词元的集合。研究表明，接受概念监督训练的模型在多项词汇基准测试中，与人类语义相似性判断表现出更强的对齐性。这些优势伴随着模型在语义核心词汇（定义见第3.1节）上的困惑度降低，以及全局词元级困惑度的轻微上升，这反映了标准下一词元预测优化与概念级监督之间的权衡。我们的结果表明，概念级训练目标能在保持语言模型竞争力的同时，有效提升语义对齐能力。

摘要 (Abstract)

The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}’’ could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.

关键词: concept training, language models, next-token prediction, semantic alignment, human judgments, perplexity, concept supervision, semantic similarity

146. ❌ GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

作者: Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29112v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLMs在推荐系统中理解用户兴趣的能力，因此与’Large Language Models’高度相关（10分）。论文提出Interest Groundedness指标来惩罚幻觉兴趣，这与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、AI for Science等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了GISTBench基准，用于评估大型语言模型从推荐系统交互历史中理解和验证用户兴趣的能力，发现当前LLMs在准确计数和归因异构交互信号方面存在性能瓶颈。

摘要翻译

我们推出GISTBench，这是一个用于评估大语言模型从推荐系统交互历史中理解用户能力的基准测试。与传统推荐系统基准主要关注物品预测准确性不同，我们的基准测试重点评估大语言模型从交互数据中提取和验证用户兴趣的能力。我们提出了两个新颖的指标体系：兴趣基础性（Interest Groundedness，简称IG），该指标可分解为精确率和召回率两个子指标，分别用于惩罚模型虚构的兴趣类别和奖励其兴趣覆盖的完整性；以及兴趣特异性（Interest Specificity，简称IS），用于评估经大语言模型验证的用户画像的独特性。我们发布了一个基于全球短视频平台真实用户交互构建的合成数据集，该数据集同时包含隐式和显式交互信号以及丰富的文本描述。我们通过用户调查验证了数据集的保真度，并评估了八个参数量从70亿到1200亿不等的开源大语言模型。研究结果揭示了当前大语言模型的性能瓶颈，特别是其在准确统计和归因跨异构交互类型的参与信号方面存在显著局限。

摘要 (Abstract)

We introduce GISTBench, a benchmark for evaluating Large Language Models’ (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

关键词: Large Language Models, LLM evaluation, user understanding, recommendation systems, interest verification, hallucination mitigation, benchmark dataset, engagement signals

147. ❌ APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

作者: Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29093v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based autonomous agents的持久记忆问题，提出APEX-EM框架通过结构化经验回放实现非参数在线学习。高度相关的关键词包括：LLM Agents（核心主题，15分）、Chain of Thought（涉及规划步骤和推理过程，10分）、System 2 Thinking（涉及深度推理和迭代，10分）、Self-Correction（包含错误分析和改进机制，10分）、In-context Learning（利用成功/失败经验作为上下文示例，10分）、Large Language Models（基于Claude等LLM构建，10分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了LLM自主代理缺乏持久程序记忆的问题，提出了APEX-EM框架，通过结构化经验回放实现非参数在线学习，在多个基准测试中显著提升了任务执行准确率。

摘要翻译

基于大语言模型（LLM）的自主智能体缺乏持久性程序记忆：即使面对结构完全相同的任务，它们也会从头开始重新推导解决方案。我们提出了 APEX-EM，一种非参数的在线学习框架，它能够在不修改模型权重的情况下积累、检索和重用结构化的程序性计划。APEX-EM 引入了：（1）一种结构化经验表示，编码每次执行的完整程序-情景轨迹——包括规划步骤、产出物、带有错误分析的迭代历史以及质量评分；（2）一个计划-检索-生成-迭代-吸收（PRGII） 工作流，其中任务验证器提供多维度的奖励信号；（3）一个双结果经验记忆，其混合检索机制结合了语义搜索、结构签名匹配和计划有向无环图（DAG）遍历——这使得能够在没有词汇重叠但具有相似操作结构的任务之间实现跨领域迁移。成功的经验作为正面的上下文示例；失败的经验则作为带有结构化错误标注的负面示例。
我们在 BigCodeBench~\cite{zhuo2025bigcodebench}、KGQAGen-10k~\cite{zhang2025kgqagen} 和 Humanity’s Last Exam~\cite{phan2025hle} 基准上，使用 Claude Sonnet 4.5 和 Opus 4.5 模型进行评估。在 KGQAGen-10k 上，APEX-EM 达到了 89.6% 的准确率，而无记忆版本为 41.3%（提升 +48.3 个百分点），超过了理想检索的上限（84.9%）。在 BigCodeBench 上，其成功率（SR）从 53.9% 的基线提升至 83.3%（+29.4 个百分点），超过了在可比冻结主干模型条件下 MemRL~\cite{memrl2025} 的 +11.0 个百分点增益（需注意我们的分析已控制主干模型差异）。在 HLE 上，实体图检索率从 25.2% 提升至 48.0%（+22.8 个百分点）。消融实验表明各组件价值因任务而异：丰富的验证器反馈对代码生成任务影响甚微，但对结构化查询至关重要（+10.3 个百分点），而二元信号迭代可以部分补偿较弱的反馈。

摘要 (Abstract)

LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution – planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal – enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity’s Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6% accuracy versus 41.3% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9%). On BigCodeBench, it reaches 83.3% SR from a 53.9% baseline (+29.4pp), exceeding MemRL’s~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0% from 25.2% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

关键词: autonomous agents, LLM-based agents, procedural memory, experience replay, non-parametric learning, structured experience representation, in-context learning, task execution improvement

148. ❌ Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs

作者: Aizirek Turdubaeva, Uichin Lee 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29077v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确研究LLMs在跨文化情感理解中的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等）或领域（如生物信息学），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出Generator-Interpreter框架，通过评估六个LLMs在15个国家的情感归因任务，发现LLMs的性能受情感类型和文化背景影响，且生成者文化背景比解释者影响更大，呼吁在LLM系统中采用文化敏感的情感建模以提高跨文化情感理解的鲁棒性和公平性。

摘要翻译

大型语言模型（LLM）正日益被应用于跨文化系统中，以理解和适应人类情感，这些情感受到表达与解读的文化规范所塑造。然而，先前关于情感归因的研究主要集中于解读层面，忽视了情感生成者的文化背景。这种普遍性假设忽略了不同国家在情感表达与感知方式上的差异。为弥补这一不足，我们提出了一个生成者-解读者框架，通过同时考虑表达与解读来捕捉情感归因的双重视角。我们利用来自15个国家的数据，在情感归因任务上系统评估了六种大型语言模型。分析表明，模型性能差异取决于情感类型和文化语境。生成者与解读者之间存在对齐效应；且生成者所属国家的影响更为显著。我们呼吁在基于大型语言模型的系统中采用文化敏感的情感建模方法，以提升跨文化情感理解的鲁棒性与公平性。

摘要 (Abstract)

Large language models (LLMs) are increasingly used in cross-cultural systems to understand and adapt to human emotions, which are shaped by cultural norms of expression and interpretation. However, prior work on emotion attribution has focused mainly on interpretation, overlooking the cultural background of emotion generators. This assumption of universality neglects variation in how emotions are expressed and perceived across nations. To address this gap, we propose a Generator-Interpreter framework that captures dual perspectives of emotion attribution by considering both expression and interpretation. We systematically evaluate six LLMs on an emotion attribution task using data from 15 countries. Our analysis reveals that performance variations depend on the emotion type and cultural context. Generator-interpreter alignment effects are present; the generator’s country of origin has a stronger impact on performance. We call for culturally sensitive emotion modeling in LLM-based systems to improve robustness and fairness in emotion understanding across diverse cultural contexts.

关键词: Large Language Models, LLMs, emotion attribution, cross-cultural analysis, Generator-Interpreter framework, cultural context, emotion understanding, cultural sensitivity

149. ❌ PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

作者: Caio Vicentino 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29078v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PolarQuant专注于大语言模型（LLMs）的后训练量化压缩方法，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确针对LLMs进行压缩。与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文提出的是后训练量化方法，属于后训练优化范畴。与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为论文核心是量化压缩技术，旨在减少模型存储和计算开销。其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning、RAG、Agents等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

PolarQuant提出了一种基于Hadamard旋转的后训练量化方法，通过将权重分布转换为近似高斯分布来实现对大语言模型（LLMs）的近无损压缩，显著降低了模型困惑度并保持了推理效率。

摘要翻译

本文提出PolarQuant——一种针对大语言模型（LLM）的训练后权重量化方法，该方法通过利用神经网络权重的分布结构实现近乎无损的压缩。PolarQuant 包含三个阶段：（1）逐块归一化至单位超球面，（2）通过沃尔什-哈达玛变换（Walsh-Hadamard rotation）将坐标转换为近似高斯随机变量，（3）采用与高斯分布匹配的质心进行量化。我们的消融实验表明，仅哈达玛变换即可贡献 98% 的质量提升，使 Qwen3.5-9B 的困惑度从 6.90（absmax Q5）降至 6.40（与 FP16 的差距 Delta = +0.03），在无需任何校准数据的情况下实现了实际无损量化。此外，PolarQuant 可作为下游 INT4 量化器的有效预处理步骤：经 PolarQuant Q5 反量化后通过 torchao INT4 重新量化的模型困惑度为 6.56，而直接使用 absmax INT4 量化的困惑度为 6.68，同时在 6.5 GB 显存占用下保持 43.1 tok/s 的推理吞吐量。代码与模型均已公开。

摘要 (Abstract)

We present PolarQuant, a post-training weight quantization method for large language models (LLMs) that exploits the distributional structure of neural network weights to achieve near-lossless compression. PolarQuant operates in three stages: (1) block-wise normalization to the unit hypersphere, (2) Walsh-Hadamard rotation to transform coordinates into approximately Gaussian random variables, and (3) quantization with centroids matched to the Gaussian distribution. Our ablation reveals that Hadamard rotation alone accounts for 98% of the quality improvement, reducing Qwen3.5-9B perplexity from 6.90 (absmax Q5) to 6.40 (Delta = +0.03 from FP16), making it practically lossless without any calibration data. Furthermore, PolarQuant functions as an effective preprocessing step for downstream INT4 quantizers: PolarQuant Q5 dequantized and re-quantized by torchao INT4 achieves perplexity 6.56 versus 6.68 for direct absmax INT4, while maintaining 43.1 tok/s throughput at 6.5 GB VRAM. Code and models are publicly available.

关键词: weight quantization, LLM compression, post-training, Hadamard rotation, Gaussian distribution, perplexity reduction, model efficiency, neural network weights

150. ❌ An Empirical Recipe for Universal Phone Recognition

作者: Shikhar Bharadwaj, Chin-Jou Li, Kwanghee Choi, Eunjung Yeo, William Chen, Shinji Watanabe, David R. Mortensen 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29042v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语音处理领域的音素识别任务，研究内容包括多语言数据训练、自监督学习表示、数据规模影响和损失函数优化等。虽然涉及AI技术应用，但所有关键词均针对大语言模型（LLM）及其相关技术（如微调、对齐、推理、压缩等），而本文研究的是语音识别模型，未涉及任何LLM技术、架构或应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了多语言音素识别中模型泛化能力不足的问题，通过大规模多语言数据训练提出了PhoneticXEUS模型，在100多种语言上实现了最先进的性能表现。

摘要翻译

电话语音识别是支撑多语言及低资源语音处理任务的关键技术，但其鲁棒性表现仍面临挑战。专注于英语的高性能模型难以跨语言泛化，而多语言模型未能充分利用预训练表征。目前，数据规模、模型架构与训练目标如何影响多语言电话语音识别尚不明确。本文提出PhoneticXEUS模型——该模型基于大规模多语言数据训练，在多语言识别（17.7% PFER）及带口音的英语语音识别（10.6% PFER）上均达到最先进性能。通过统一评估框架下对百余种语言进行的受控消融实验，我们实证确立了训练方案，并量化了自监督学习表征、数据规模及损失函数的影响。此外，我们系统分析了跨语系、带口音语音及发音特征的错误模式。所有数据与代码均已开源发布。

摘要 (Abstract)

Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS – trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.

关键词: phone recognition, multilingual speech processing, pretrained representations, data scale, SSL representations, accented speech, language families, articulatory features

151. ❌ On the limited utility of parallel data for learning shared multilingual representations

作者: Julius Leino, Jörg Tiedemann 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29026v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多语言预训练中平行数据（翻译句子）对跨语言表示对齐的影响，仅与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文核心是预训练阶段的数据使用策略分析。其他关键词均未涉及大模型技术原理创新或具体应用，如LLM架构、微调方法、推理优化、科学AI应用等，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究发现，在多语言预训练中，平行数据对跨语言表示对齐的效用有限，仅能轻微加速早期表示共享并减少语言特定神经元，即使没有平行数据，跨语言对齐也能达到相似水平。

摘要翻译

共享多语言表征对于跨语言任务及语言间的知识迁移至关重要。本研究探讨了预训练中平行数据（即翻译句对）作为触发跨语言对齐表征信号的影响。我们使用不同比例的平行数据训练参照模型，结果表明平行数据对跨语言对齐仅产生极小的影响。基于多种评估方法，我们发现其作用仅限于可能加速预训练初期的表征共享，并减少模型中语言特异性神经元的数量。即使没有平行数据提供的显式信号，跨语言对齐似乎仍能在相似水平上自发形成。

摘要 (Abstract)

Shared multilingual representations are essential for cross-lingual tasks and knowledge transfer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representations that are aligned across languages. We train reference models with different proportions of parallel data and show that parallel data seem to have only a minimal effect on the cross-lingual alignment. Based on multiple evaluation methods, we find that the effect is limited to potentially accelerating the representation sharing in the early phases of pretraining, and to decreasing the amount of language-specific neurons in the model. Cross-lingual alignment seems to emerge on similar levels even without the explicit signal from parallel data.

关键词: multilingual representations, parallel data, pretraining, cross-lingual alignment, language-specific neurons, shared representations, knowledge transfer

152. ❌ Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

作者: Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29038v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究对抗性微调方法（Trojan-Speak）绕过LLM安全分类器，核心涉及LLM微调（Post-training/SFT）和安全对齐（Alignment/Hallucination Mitigation）。使用GRPO-based hybrid reinforcement learning与RLHF相关。应用场景涉及CBRN查询，与AI for Science有一定关联。其他关键词如MoE、SLMs、RAG等未涉及。

!!! tip deepseek-chat TL;DR

论文提出Trojan-Speak对抗性微调方法，成功绕过Anthropic的Constitutional Classifiers安全分类器，在保持模型推理能力（<5%退化）的同时实现99%以上的分类器规避，揭示了仅依赖LLM内容分类器无法防止危险信息泄露的安全漏洞。

摘要翻译

主流AI提供商提供的微调API创造了新的攻击面，攻击者可通过针对性微调绕过安全防护机制。本文提出Trojan-Speak——一种能够绕过Anthropic宪法分类器（Constitutional Classifiers）的对抗性微调方法。该方法采用课程学习结合基于GRPO的混合强化学习，使模型掌握一套能够规避基于LLM的内容分类系统的通信协议。关键突破在于：先前对抗性微调方法在推理基准测试中通常导致超过25%的能力退化，而Trojan-Speak在实现参数量14B+模型超过99%分类器规避率的同时，仅造成不足5%的性能退化。我们证明，经微调的模型能够对Anthropic宪法分类器漏洞赏金计划中专家级CBRN（化学、生物、放射性与核武器）相关查询提供详尽回答。研究结果表明：当攻击者拥有微调权限时，仅依赖基于LLM的内容分类器不足以防止危险信息泄露；同时我们发现，激活层探针（activation-level probes）能显著提升对此类攻击的防御鲁棒性。

摘要 (Abstract)

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic’s Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic’s Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.

关键词: adversarial fine-tuning, safety bypass, constitutional classifiers, GRPO-based reinforcement learning, capability degradation, CBRN queries, activation-level probes, LLM security

153. ❌ The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

作者: Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29025v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在推理任务中的系统性失败，特别是当表面启发式与隐含约束冲突时。高度相关关键词：‘Large Language Models’（研究对象）、‘Chain of Thought’和’System 2 Thinking’（涉及推理过程分析）。中等相关：‘Mechanistic Interpretability’（通过token级归因分析模型行为）。其他关键词如MoE、SFT、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文研究发现大型语言模型在推理任务中存在系统性缺陷，当表面启发式线索与未声明的可行性约束冲突时，模型会过度依赖表面线索而忽略深层约束，通过基准测试和因果行为分析揭示了这一启发式覆盖现象。

摘要翻译

当显著的表面线索与未明言的可行性约束发生冲突时，大语言模型会系统性失效。我们通过“诊断-测量-桥接-处理”框架对此进行研究。对六个模型在“洗车问题”上的因果行为分析揭示了近似上下文无关的S型启发式：距离线索产生的影响比目标强8.7至38倍，且词元级归因显示的模式更符合关键词关联而非组合推理。启发式覆盖基准（HOB）——包含4类启发式与5类约束族共500个实例，配备最小对立对和明确性梯度——在14个模型中证明了该现象的普遍性：在严格评估（10/10正确）下，所有模型准确率均未超过75%，其中存在性约束最难（44%）。给予最小提示（例如强调关键对象）平均可恢复+15个百分点，表明失败原因在于约束推断而非知识缺失；当约束被移除时，12/14个模型表现更差（最多下降39个百分点），揭示了保守性偏差。参数化探针证实S型模式可泛化至成本、效率和语义相似性启发式；目标分解提示通过强制模型在回答前枚举前提条件，可恢复+6至9个百分点。这些结果共同将启发式覆盖界定为系统性的推理缺陷，并为衡量该问题的解决进展提供了基准。

摘要 (Abstract)

Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem’’ across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) – 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients – demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.

关键词: Large Language Models, reasoning failure, heuristic override, constraint inference, causal-behavioral analysis, benchmark evaluation, systematic vulnerability, compositional inference

154. ❌ Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction

作者: Diego C. Lerma-Torres 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29023v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接针对大语言模型（LLMs）的长期记忆和推理缺陷问题，提出了一种神经科学启发的记忆架构。核心相关关键词：1）‘Large Language Models’（10分）- 论文明确以LLMs为研究对象；2）‘System 2 Thinking’（10分）- 论文深入探讨了双系统认知理论，特别是System 2的升级机制；3）‘Context Window Extension’（8分）- 论文批评了单纯扩展上下文窗口的局限性，并提出了替代方案；4）‘Hallucination Mitigation’（8分）- 论文通过分级认知状态和结构化方法解决幻觉问题。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型缺乏持久结构化记忆和上下文敏感检索的问题，提出了一种基于互补学习系统理论、认知行为疗法信念层次和双过程认知的神经科学启发生物记忆框架，通过情感关联摘要、默认System 1检索与System 2升级机制以及主动编码原则，实现了随着经验积累而成本降低的交互系统。

摘要翻译

大型语言模型缺乏用于长期交互和情境敏感检索的持久化、结构化记忆。扩展上下文窗口并不能解决此问题：近期证据表明，仅增加上下文长度就会使推理能力下降高达85%——即使在完美检索的情况下也是如此。我们提出一个受生物学启发的记忆框架，其理论基础包括互补学习系统理论、认知行为疗法的信念层级、双过程认知以及模糊痕迹理论，并围绕三个原则构建：(1) 记忆具有效价，而不仅是内容——受贝克认知模型启发，在信念层级中组织的预计算情感-联想摘要（效价向量）使得在深思熟虑前能够快速定向；(2) 检索默认采用系统1，必要时升级至系统2——默认采用自动扩散激活和被动启动机制，仅在需要时进行刻意检索，并通过分级认知状态从结构上应对幻觉问题；(3) 编码是主动的、在场的且依赖于反馈的——一个类似丘脑的网关在存储之间标记和路由信息，而执行系统通过好奇心驱动的调查（而非被动接收）形成要点。七项功能特性明确了任何实现方案必须满足的要求。随着时间的推移，该系统会向系统1处理收敛——这相当于临床专业知识的计算模拟——使得交互体验随着经验积累而成本更低，而非更高。

摘要 (Abstract)

Large language models lack persistent, structured memory for long-term interaction and context-sensitive retrieval. Expanding context windows does not solve this: recent evidence shows that context length alone degrades reasoning by up to 85% - even with perfect retrieval. We propose a bio-inspired memory framework grounded in complementary learning systems theory, cognitive behavioral therapy’s belief hierarchy, dual-process cognition, and fuzzy-trace theory, organized around three principles: (1) Memory has valence, not just content - pre-computed emotional-associative summaries (valence vectors) organized in an emergent belief hierarchy inspired by Beck’s cognitive model enable instant orientation before deliberation; (2) Retrieval defaults to System 1 with System 2 escalation - automatic spreading activation and passive priming as default, with deliberate retrieval only when needed, and graded epistemic states that address hallucination structurally; and (3) Encoding is active, present, and feedback-dependent - a thalamic gateway tags and routes information between stores, while the executive forms gists through curiosity-driven investigation, not passive exposure. Seven functional properties specify what any implementation must satisfy. Over time, the system converges toward System 1 processing - the computational analog of clinical expertise - producing interactions that become cheaper, not more expensive, with experience.

关键词: Large Language Models, Persistent Memory, Dual-Process Cognition, System 1 and System 2, Hallucination Mitigation, Cognitive Architecture, Neuroscience-Inspired AI, Long-term Interaction

155. ❌ Known Intents, New Combinations: Clause-Factorized Decoding for Compositional Multi-Intent Detection

作者: Abhilash Nandy 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28929v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多意图检测任务中的组合泛化问题，提出新的基准测试CoMIX-Shift和轻量级解码器ClauseCompose。论文内容涉及自然语言处理中的意图识别和组合泛化，但未涉及大模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大模型技术、深度学习创新或AI科学应用相关，而本文研究的是传统NLP任务中的特定问题，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了多意图检测中的组合泛化问题，提出了CoMIX-Shift基准测试和ClauseCompose解码器，实验表明简单因子化方法在未见意图组合上表现优异，而现有基准测试未能充分评估组合泛化能力。

摘要翻译

多意图检测研究通常关注模型能否从单一话语中识别出多个意图。我们提出了一个更困难且对实际部署更具价值的问题：模型能否识别出已知意图的新组合？现有基准测试对此的检验能力较弱，因为训练集与测试集往往共享相同的宏观共现模式。我们推出了CoMIX-Shift——一个通过控制变量构建的基准测试平台，旨在通过以下五个维度系统性地检验多意图检测中的组合泛化能力：保留意图对、话语模式迁移、更长且含噪声的封装结构、保留子句模板以及零样本三元组。同时，我们提出了ClauseCompose模型，这是一个仅使用单意图数据训练的轻量级解码器，并将其与基于完整话语的基线模型（包括微调的微型BERT模型）进行对比。经过三次随机种子实验，ClauseCompose在未见意图对上达到95.7的精确匹配率，在话语迁移对上达到93.9，在长噪声对上达到62.5，在保留模板上达到49.8，在未见三元组上达到91.1。相比之下，WholeMultiLabel模型的结果分别为81.4、55.7、18.8、15.5和0.0；BERT基线模型结果为91.5、77.6、48.9、11.0和0.0。我们还构建了包含240个示例的人工标注SNIPS风格组合数据集，其中包含五组保留意图对：在该数据集上，ClauseCompose在未见意图对上的精确匹配率达到97.5，在连接词迁移场景下达到86.7，而WholeMultiLabel仅获得41.3和10.4的结果。这些研究表明，多意图检测需要更系统的组合泛化评估体系，并且当评估标准提出相应要求时，简单的因子分解方法能展现出超预期的性能。

摘要 (Abstract)

Multi-intent detection papers usually ask whether a model can recover multiple intents from one utterance. We ask a harder and, for deployment, more useful question: can it recover new combinations of familiar intents? Existing benchmarks only weakly test this, because train and test often share the same broad co-occurrence patterns. We introduce CoMIX-Shift, a controlled benchmark built to stress compositional generalization in multi-intent detection through held-out intent pairs, discourse-pattern shift, longer and noisier wrappers, held-out clause templates, and zero-shot triples. We also present ClauseCompose, a lightweight decoder trained only on singleton intents, and compare it to whole-utterance baselines including a fine-tuned tiny BERT model. Across three random seeds, ClauseCompose reaches 95.7 exact match on unseen intent pairs, 93.9 on discourse-shifted pairs, 62.5 on longer/noisier pairs, 49.8 on held-out templates, and 91.1 on unseen triples. WholeMultiLabel reaches 81.4, 55.7, 18.8, 15.5, and 0.0; the BERT baseline reaches 91.5, 77.6, 48.9, 11.0, and 0.0. We also add a 240-example manually authored SNIPS-style compositional set with five held-out pairs; there, ClauseCompose reaches 97.5 exact match on unseen pairs and 86.7 under connector shift, compared with 41.3 and 10.4 for WholeMultiLabel. The results suggest that multi-intent detection needs more compositional evaluation, and that simple factorization goes surprisingly far once evaluation asks for it.

关键词: multi-intent detection, compositional generalization, CoMIX-Shift benchmark, ClauseCompose decoder, intent pairs, exact match, factorized decoding, SNIPS-style dataset

156. ❌ From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories

作者: Daban Q. Jaff 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28913v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是在特定领域（大屠杀口述历史）中现成情感分类器的性能评估和稳定性分析，使用了基于Transformer的预训练模型进行情感分类。虽然涉及预训练Transformer模型，但论文的核心是评估现有模型在特定领域的表现，而非开发新的大模型技术或应用。所有关键词都聚焦于大模型技术原理、训练方法、优化技术或特定科学应用，而本文是应用现有模型进行领域评估，与这些技术关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文通过评估三种预训练Transformer情感分类器在大屠杀口述历史语料上的表现，提出了基于一致性的稳定性分类法（ABC）来分析模型间的分歧模式，发现模型间一致性较低且主要由中性边界决策驱动。

摘要翻译

在领域迁移情境下，极性检测的挑战显著增加，尤其是在具有复杂话语结构的异质长篇幅叙事中，例如大屠杀口述历史。本文对现成情感分类器在长篇大屠杀口述历史上的表现进行了语料库规模的诊断研究，使用三种基于预训练Transformer的极性分类器，对包含107,305个话语单元和579,013个句子的语料库进行分析。在整合模型输出后，我们引入了一种基于一致性的稳定性分类法（Agreement-based stability taxonomy，简称ABC）以分层评估模型间输出稳定性。我们报告了成对百分比一致性、科恩卡帕系数、弗莱斯卡帕系数以及行归一化混淆矩阵，以定位系统性分歧。作为辅助描述性信号，我们采用基于T5的情感分类器对每个一致性层级的分层样本进行分析，以比较不同层级间的情感分布。多模型标签三角测量与ABC分类法的结合，为描述情感模型在敏感历史叙事中产生分歧的位置与方式提供了一个审慎且可操作的框架。总体而言，模型间一致性处于低至中等水平，其主要驱动因素为围绕中性类别的边界判定决策。

摘要 (Abstract)

Polarity detection becomes substantially more challenging under domain shift, particularly in heterogeneous, long-form narratives with complex discourse structure, such as Holocaust oral histories. This paper presents a corpus-scale diagnostic study of off-the-shelf sentiment classifiers on long-form Holocaust oral histories, using three pretrained transformer-based polarity classifiers on a corpus of 107,305 utterances and 579,013 sentences. After assembling model outputs, we introduce an agreement-based stability taxonomy (ABC) to stratify inter-model output stability. We report pairwise percent agreement, Cohen kappa, Fleiss kappa, and row-normalized confusion matrices to localize systematic disagreement. As an auxiliary descriptive signal, a T5-based emotion classifier is applied to stratified samples from each agreement stratum to compare emotion distributions across strata. The combination of multi-model label triangulation and the ABC taxonomy provides a cautious, operational framework for characterizing where and how sentiment models diverge in sensitive historical narratives. Inter-model agreement is low to moderate overall and is driven primarily by boundary decisions around neutrality.

关键词: sentiment classification, Holocaust oral histories, transformer models, model agreement, domain shift, stability taxonomy, polarity detection, multi-model evaluation

157. ❌ CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation

作者: Andrew Bouras, OMS-II Research Fellow 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心贡献是创建跨领域科学推理数据集CrossTrace，并利用QLoRA微调Qwen2.5-7B-Instruct模型进行假设生成。高度相关的关键词包括：1) ‘PEFT/LoRA/Parameter-efficient Fine-tuning’（10分）- 论文明确使用QLoRA进行微调；2) ‘Chain of Thought/Multi-step Reasoning’（10分）- 数据集包含结构化推理链；3) ‘System 2 Thinking/In-depth Reasoning’（10分）- 涉及科学假设生成的深度推理；4) ‘AI for Science/Bioinformatics’（10分）- 应用于生物医学和AI/ML领域的科学发现。中等相关的关键词：‘Large Language Models’（8分）- 使用Qwen2.5模型；‘Post-training/SFT’（8分）- 涉及微调过程；‘Hallucination Mitigation/Factuality’（8分）- 强调数据真实性和低捏造率。其余关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该研究创建了首个大规模跨领域科学推理数据集CrossTrace，并通过QLoRA微调大语言模型，显著提升了科学假设生成的质量和跨领域推理能力。

摘要翻译

科学假说生成是加速研究的关键瓶颈，然而现有用于训练和评估假说生成模型的数据集仅限于单一领域，且缺乏将先验知识与新贡献相连接的显式推理轨迹。本文介绍CrossTrace数据集，它包含1,389条有依据的科学推理轨迹，涵盖生物医学研究（518条）、人工智能/机器学习（605条）及跨领域研究（266条）。每条轨迹捕捉了从既有知识出发、经过中间逻辑步骤、最终形成新假说的结构化推理链，其中每一步均植根于源论文文本。我定义了一种输入/轨迹/输出（Input/Trace/Output）模式，该模式扩展了HypoGen的Bit-Flip-Spark框架，引入了步骤级验证、八种发现模式的分类体系以及多领域覆盖。通过QLoRA在CrossTrace上对Qwen2.5-7B-Instruct进行微调，相比未调优基线取得了显著提升：IAScore从0.828升至0.968（GPT-4o评判）和从0.716升至0.888（Claude Opus 4.5），结构合规性从0%改善至100%，spark余弦相似度从0.221提高至0.620。平衡的跨领域训练（生物医学+AI/ML+计算机科学）优于单领域训练，这证明科学推理模式具有跨学科迁移性。对150条分层记录的人工验证确认了99.7%的步骤级依据准确性和0.0%的虚构率。据我所知，CrossTrace是首个大规模、跨领域且具备步骤级有依据推理轨迹的假说生成数据集，我的结果表明此类轨迹是一种有效的训练信号，其益处至少在一定程度上具有领域通用性。

摘要 (Abstract)

Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross-domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit-Flip-Spark framework of HypoGen with step-level verification, a taxonomy of eight discovery patterns, and multi-domain coverage. Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves from 0% to 100%, and spark cosine similarity increases from 0.221 to 0.620. Balanced cross-domain training (biomedical + AI/ML + CS) outperforms single-domain training, providing evidence that scientific reasoning patterns transfer across disciplines. Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate. To my knowledge, CrossTrace is the first large-scale, cross-domain dataset with step-level grounded reasoning traces for hypothesis generation, and my results demonstrate that such traces are an effective training signal whose benefits are at least partially domain-general.

关键词: scientific reasoning, hypothesis generation, cross-domain dataset, grounded reasoning traces, QLoRA fine-tuning, Qwen2.5-7B-Instruct, structured reasoning chains, domain transfer

158. ❌ OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

作者: Haiyue Song, Masao Utiyama 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28858v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究内容是针对大语言模型（LLMs）的持续预训练（Continual Pre-training）问题，提出了一种名为OptiMer的新方法。该方法通过训练多个CPT模型、提取分布向量，并使用贝叶斯优化进行后验权重搜索，来优化数据混合比例，从而避免了传统方法中需要预先固定数据比例的高成本调优。论文明确涉及LLMs和持续预训练，因此这两个关键词高度相关（10分）。同时，该方法涉及模型合并（Model Merging）的概念，通过组合不同模型的分布向量来创建新模型，因此该关键词也高度相关（10分）。其他关键词如MoE、SLMs、SFT、RAG、推理加速、AI for Science等均未在论文标题或摘要中提及，与论文内容无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为OptiMer的新方法，通过后验优化分布向量来替代传统持续预训练中需要预先确定数据混合比例的方法，从而在多种语言和领域任务上以更低的搜索成本实现了更好的性能。

摘要翻译

持续预训练被广泛用于使大语言模型适应目标语言和领域，然而训练数据的混合比例仍是一个敏感且调优成本高昂的超参数：它们必须在训练开始前确定，而次优的选择可能浪费数周的计算资源。本研究提出OptiMer方法，将比例选择与训练过程解耦：我们为每个数据集训练一个持续预训练模型，提取每个模型的分布向量（该向量表征了数据集引发的参数偏移），并通过贝叶斯优化进行事后优化以寻找最佳组合权重。在Gemma 3 27B模型上进行的跨语言（日语、中文）和跨领域（数学、代码）实验表明，OptiMer在搜索成本降低15-35倍的同时，持续优于数据混合和模型平均基线方法。关键发现包括：1）优化后的权重可解释为数据混合比例，使用这些比例重新训练能改进数据混合式持续预训练；2）同一向量池可根据特定目标重新优化而无需任何再训练，从而按需生成定制化模型。我们的工作证明，传统上作为预训练决策的数据混合比例选择，可被重新定义为基于分布向量的事后优化问题，这为持续预训练提供了更灵活的研究范式。

摘要 (Abstract)

Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model’s distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

关键词: Continual Pre-training, LLMs, Data Mixture Ratio, Distribution Vector, Bayesian Optimization, Model Merging, OptiMer, Parameter Shift

159. ❌ OneComp: One-Line Revolution for Generative AI Model Compression

作者: Yuma Ichikawa, Keiji Kimura, Akihiro Yoshida, Yudai Fujimoto, Hiroki Tokura, Yamato Arai, Yoshiyuki Ishii, Yusei Kawakami, Genki Shikada, Achille Jacquemond, Yoshihiko Fujisawa, Katsuki Fujisawa, Takumi Honda, Akira Sakai 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28845v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究内容是基础模型的压缩框架，与’Quantization OR Model Compression OR Low-bit Weights’高度相关（15分），直接研究模型压缩和量化。论文明确提到’foundation models’和’post-training compression’，因此与’Large Language Models OR LLMs OR Foundation Models’（10分）和’Post-training OR Supervised Fine-tuning OR SFT’（10分）高度相关。压缩有助于部署，与’Small Language Models OR SLMs OR On-device AI’（5分）和’Speculative Decoding OR Inference Acceleration’（5分）有一定关联。压缩可视为参数高效微调的一种形式，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’（5分）有一定关联。其他关键词如MoE、Scaling Laws、RAG、Alignment等与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

OneComp是一个开源压缩框架，通过自动化混合精度量化和渐进式压缩阶段，解决了基础模型部署中的内存、延迟和成本瓶颈，实现了从算法创新到生产级部署的桥梁。

摘要翻译

基础模型的部署日益受到内存占用、延迟和硬件成本的制约。训练后压缩技术可通过降低模型参数精度来缓解这些瓶颈，且不会显著影响性能；然而其实践应用仍面临挑战，从业者需要应对量化算法、精度预算、数据驱动的校准策略以及硬件相关的执行机制等碎片化技术生态。我们提出OneComp——一个开源压缩框架，将这一专家工作流程转化为可复现、资源自适应的处理管线。给定模型标识符与可用硬件，OneComp能自动分析模型结构，规划混合精度分配方案，并执行从逐层压缩、块级优化到全局优化的渐进式量化阶段。其核心架构设计是将首次量化生成的检查点作为可部署的基准锚点，确保后续每个优化阶段都在同一模型基础上进行改进，使得模型质量随着计算资源的投入而持续提升。通过将前沿压缩研究成果转化为可扩展、开源且硬件感知的流程化系统，OneComp在算法创新与生产级模型部署之间架起了桥梁。

摘要 (Abstract)

Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven calibration strategies, and hardware-dependent execution regimes. We present OneComp, an open-source compression framework that transforms this expert workflow into a reproducible, resource-adaptive pipeline. Given a model identifier and available hardware, OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement. A key architectural choice is treating the first quantized checkpoint as a deployable pivot, ensuring that each subsequent stage improves the same model and that quality increases as more compute is invested. By converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline, OneComp bridges the gap between algorithmic innovation and production-grade model deployment.

关键词: model compression, quantization, foundation models, post-training compression, mixed-precision, hardware-aware, deployment, open-source framework

160. ❌ ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

作者: Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Ben Wang, Jun Zhao, Kun Xu, Kang Liu 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28610v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）的输入效率优化，核心贡献是ResAdapt框架，通过自适应分辨率分配减少视觉token数量以支持更长上下文。与关键词高度相关的是：1）‘Large Language Models’（10分）- 论文明确研究MLLMs，属于大模型范畴；2）‘Context Window Extension’（8分）- 论文解决视觉token增长导致的上下文限制问题，支持更多帧数。其他关键词如MoE、SFT、RAG等未涉及，评分为0。论文属于大模型技术优化，符合研究背景中的’纯大模型技术’创新。

!!! tip deepseek-chat TL;DR

论文提出ResAdapt框架，通过自适应分配视觉预算减少多模态大语言模型的输入token数量，在视频QA和图像推理任务中实现16倍更多帧数支持并提升15%性能。

摘要翻译

多模态大语言模型（MLLMs）通过提升输入保真度实现了更强的视觉理解能力，但由此产生的视觉标记增长使得同时维持高空间分辨率与长时序上下文变得难以实现。我们认为瓶颈不在于编码后表示的压缩方式，而在于编码器接收的像素总量，并为此提出ResAdapt——一种输入侧自适应框架，该框架在编码前学习每帧图像应分配多少视觉预算。ResAdapt将轻量级分配器与未经修改的MLLM主干网络耦合，使主干网络在接收经操作变换的输入时，仍能保持其原有的视觉标记接口。我们将资源分配建模为上下文赌博机问题，并通过成本感知策略优化（Cost-Aware Policy Optimization, CAPO）训练分配器，该方法将稀疏的 rollout 反馈转化为稳定的精度-成本学习信号。在预算受控的视频问答、时序定位和图像推理任务中，ResAdapt提升了低预算操作点的性能，并常位于或接近效率-精度边界，在激进压缩下的推理密集型基准测试中提升最为显著。值得注意的是，在相同视觉预算下，ResAdapt可支持多达16倍帧数的处理，同时带来超过15%的性能提升。代码发布于https://github.com/Xnhyacinth/ResAdapt。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.

关键词: Multimodal Large Language Models, visual token reduction, adaptive resolution, context window extension, efficient inference, video QA, temporal grounding, input-side adaptation

161. ❌ Training data generation for context-dependent rubric-based short answer grading

作者: Pavel Šindelář, Dávid Slivka, Christopher Bouma, Filip Prášil, Ondřej Bojar 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28537v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于教育领域的自动评分数据生成方法，未涉及大模型技术原理、训练方法、推理优化、对齐技术、代理系统、模型压缩等任何评分关键词。虽然提到了机器学习方法，但未具体说明是否使用大模型，且研究重点在于数据生成而非AI技术本身。

!!! tip deepseek-chat TL;DR

该研究探索了利用少量机密数据生成大规模训练数据集的方法，用于自动学生答案评分，实验表明其中一种方法可能提高评分模型的训练效果。

摘要翻译

每四年，经济合作与发展组织（OECD）会开展国际学生评估项目（PISA）测试，以评估全球青少年学生的知识水平，并促进不同教育体系间的比较。然而，由于需要规避语言差异和评分者偏差，对学生答案进行评分颇具挑战。因此，探索学生答案自动评分方法具有重要意义。为训练其中需要机器学习的方法，或为无需机器学习的方法计算参数、选择超参数，都需要大量领域特定的数据。本研究探索了少量方法，仅以相对较小的保密数据集为参考，利用一组极为简单的衍生文本格式来保持数据机密性，从而创建大规模训练数据集。通过所提出的方法，我们成功构建了三个替代数据集，这些数据集至少在表面上比基于提示生成的直接结果更接近参考数据集。初步实验表明，其中一种方法可能还有助于提升自动答案评分模型的训练效果。

摘要 (Abstract)

Every four years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to consider methods of automatic student answer grading. To train some of these methods, which require machine learning, or to compute parameters or select hyperparameters for those that do not, a large amount of domain-specific data is needed. In this work, we explore a small number of methods for creating a large-scale training dataset using only a relatively small confidential dataset as a reference, leveraging a set of very simple derived text formats to preserve confidentiality. Using the proposed methods, we successfully created three surrogate datasets that are, at the very least, superficially more similar to the reference dataset than a straightforward result of prompt-based generation. Early experiments suggest one of these approaches might also lead to improved training of automatic answer grading models.

关键词: automatic student answer grading, training data generation, PISA test, confidential dataset, surrogate datasets, machine learning, rubric-based grading, educational assessment

162. ❌ OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation

作者: Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold-Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, Zifan Shi, Yiwei Hu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation》专注于全景视频生成框架的开发，涉及视频生成、场景建模、轨迹控制和3D重建等技术。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文的核心内容与这些关键词无直接关联。论文未提及任何大语言模型、模型训练方法（如预训练、微调、对齐）、推理优化、代理系统或科学AI应用，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了OmniRoam框架，通过可控的全景视频生成解决了现有方法在场景完整性和全局一致性方面的不足，实现了高质量的长时程场景漫游。

摘要翻译

近年来，利用视频生成模型进行场景建模的研究兴趣日益增长。然而，现有方法大多依赖于透视视频模型，这些模型仅能合成场景的有限观测视角，导致完整性与全局一致性问题。我们提出了OmniRoam，一个可控的全景视频生成框架，它利用全景表征所具备的丰富单帧场景覆盖度与固有的长时空间及时间一致性，实现了长时程的场景漫游。我们的框架始于预览阶段，其中由轨迹控制的视频生成模型根据给定的输入图像或视频快速生成场景概览。随后，在细化阶段，该视频在时间上被延长、在空间上被上采样，从而生成长时程、高分辨率的视频，实现高保真度的世界漫游。为训练模型，我们引入了两个包含合成视频与真实世界捕捉视频的全景视频数据集。实验表明，无论在定性还是定量评估中，我们的框架在视觉质量、可控性以及长时场景一致性方面均持续优于现有先进方法。我们进一步展示了该框架的若干扩展应用，包括实时视频生成与三维重建。代码发布于https://github.com/yuhengliu02/OmniRoam。

摘要 (Abstract)

Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at https://github.com/yuhengliu02/OmniRoam.

关键词: panoramic video generation, scene wandering, trajectory-controlled, long-horizon, spatial-temporal consistency, video upsampling, 3D reconstruction, controllable framework

163. ❌ Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

作者: Kaleb Newman, Tyler Zhu, Olga Russakovsky 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30043v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频扩散模型在迷宫求解中的推理机制，主要涉及模型内部规划动态和推理能力分析。与大多数关键词无关，因为论文专注于视频模型而非语言模型。相关关键词：1) ‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’ (8分)：论文研究视频模型的多步推理能力，涉及路径规划和链式生成；2) ‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’ (8分)：论文分析视频模型的深度推理过程，包括早期计划承诺和推理阈值；3) ‘Mechanistic Interpretability OR Explainable AI’ (10分)：论文核心是理解视频模型的内部推理机制，属于可解释AI研究。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了视频扩散模型在迷宫求解任务中的内部推理机制，发现模型在早期去噪步骤中就会承诺一个高层运动计划，且路径长度是迷宫难度的主要预测因子，并提出了ChEaP方法来提升长视野迷宫求解的准确性。

摘要翻译

视频扩散模型展现出解决迷宫与谜题等涌现推理能力，但其在生成过程中的推理机制尚不明确。本研究以此为切入点，以二维迷宫求解为受控实验平台，首次探究视频模型的内部规划动态。我们的分析揭示了两项关键发现。第一项发现是早期规划固化：视频扩散模型在前几个去噪步骤内即固化高层运动规划，后续去噪仅改变视觉细节而不影响底层轨迹。第二项发现是路径长度（而非障碍物密度）成为迷宫难度的主导预测因子，且在12步处存在明显的失败阈值。这意味着视频模型仅能通过将多个连续生成过程链式拼接来处理长迷宫。为展示这些发现的实际价值，我们提出早期规划链式生成法（Chaining with Early Planning, ChEaP），该方法仅对具有潜力的早期规划种子投入计算资源，并通过链式拼接应对复杂迷宫。在Wan2.2-14B和HunyuanVideo-1.5模型上，该方法将长视野迷宫的准确率从7%提升至67%，在Frozen Lake和VR-Bench的困难任务中整体性能提高2.5倍。我们的分析表明，当前视频模型具备比既往认知更深刻的推理能力，通过改进推理时缩放策略可更可靠地激发这种能力。

摘要 (Abstract)

Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.

关键词: video diffusion models, maze solving, reasoning capabilities, early plan commitment, internal planning dynamics, Chaining with Early Planning, path length, denoising steps

164. ❌ Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

作者: Wenyi Li, Renkai Luo, Yue Yu, Huan-ang Gao, Mingju Gao, Li Yuan, Chaoyou Fu, Hao Zhao 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30038v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究大模型在3D几何计算机视觉领域的代码生成能力评估，与’Large Language Models’高度相关（10分），因为论文明确评估了GPT-5等大模型；与’AI for Science’有一定关联（8分），属于AI在科学计算（3D视觉）中的应用；与’Context Window Extension’有微弱关联（5分），因为论文提到了长上下文科学理解挑战；其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了GeoCodeBench基准来评估大模型在PhD级别的3D几何计算机视觉代码生成能力，发现当前最佳模型GPT-5仅达到36.6%通过率，揭示了现有能力与可靠科学编码之间的巨大差距。

摘要翻译

人工智能辅助编码已迅速重塑软件实践与科研工作流程，然而当前模型在生成复杂三维几何视觉的正确代码方面仍面临困难。若模型能够可靠地编写此类代码，本领域的研究范式将发生根本性变革。为衡量实现该目标的进展，我们推出GeoCodeBench——一个博士层级的三维视觉编码评估基准。每个问题均为填空式函数实现任务，其内容选自近期顶级会议的代表性论文：我们首先通过工具从官方代码库中提取候选函数，随后经过严格的人工筛选，最终选定核心的三维几何组件。针对每个目标函数，我们生成多样化、涵盖边界情况的单元测试，从而实现全自动、可复现的评分。我们评估了八个具有代表性的开源与闭源模型，以反映当前生态系统的现状。表现最佳的GPT-5模型仅获得36.6%的通过率，这揭示了当前能力与可靠的三维科学编码需求之间存在的巨大差距。GeoCodeBench将任务组织为双层体系：通用三维能力（几何变换与力学/光学公式表达）与研究能力（新颖算法实现与几何逻辑路径规划）。这些维度的得分呈正相关，但研究导向型任务明显更为困难。上下文消融实验进一步表明“更多论文文本”并非总是有益：在方法论章节处截断输入文本的设定，在统计意义上优于输入完整论文，这凸显了长上下文科学理解中尚未解决的挑战。综合而言，这些发现确立了GeoCodeBench作为推动从通用编码迈向可信赖三维几何视觉编码的严谨测试平台地位。

摘要 (Abstract)

AI-assisted coding has rapidly reshaped software practice and research workflows, yet today’s models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that “more paper text” is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.

关键词: AI-assisted coding, 3D geometric vision, PhD-level benchmark, GeoCodeBench, code generation, scientific coding, long-context comprehension, unit testing

165. ❌ Conditional Polarization Guidance for Camouflaged Object Detection

作者: QIfan Zhang, Hao Wang, Xiangrong Qin, Ruijie Li 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30008v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的伪装物体检测（COD），提出了一种结合RGB和偏振信息的条件偏振引导网络（CPGNet）。研究内容涉及视觉编码器、特征融合、边缘检测和频率细化等计算机视觉技术，但完全不涉及大语言模型（LLM）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、大模型应用（如RAG、LLM Agents）或科学AI应用（如Bioinformatics）。所有关键词均与大模型或深度学习技术原理相关，而本文是纯粹的计算机视觉应用研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种条件偏振引导网络（CPGNet），通过动态调制RGB特征和偏振边缘引导的频率细化策略，有效解决了伪装物体检测中目标与背景高度融合的难题，在多个数据集上超越了现有方法。

摘要翻译

伪装目标检测（COD）旨在识别与背景高度融合的目标。近期研究表明，偏振线索的光学特性对提升伪装目标检测性能具有重要作用。然而，现有基于偏振的方法大多依赖复杂的视觉编码器与融合机制，导致模型复杂度和计算开销增加，且未能充分探索偏振信息如何显式指导层次化RGB表征学习。为应对这些局限，本文提出CPGNet——一种非对称的RGB-偏振框架，通过引入条件偏振引导机制，显式调控面向伪装目标检测的RGB特征学习。具体而言，我们设计了一个轻量级偏振交互模块，以统一方式联合建模这些互补线索并生成可靠的偏振引导信息。与传统特征融合策略不同，所提出的条件引导机制利用偏振先验动态调制RGB特征，使网络能够聚焦于伪装目标与背景间的细微差异。此外，我们提出一种偏振边缘引导的频率优化策略，通过在偏振约束下增强高频分量，有效破解伪装模式。最后，我们开发了迭代反馈解码器，执行从粗到细的特征校准并逐步优化伪装预测。在多个任务的偏振数据集上的大量实验，以及对非偏振数据集的评估均表明，CPGNet持续优于现有先进方法。

摘要 (Abstract)

Camouflaged object detection (COD) aims to identify targets that are highly blended with their backgrounds. Recent works have shown that the optical characteristics of polarization cues play a significant role in improving camouflaged object detection. However, most existing polarization-based approaches depend on complex visual encoders and fusion mechanisms, leading to increased model complexity and computational overhead, while failing to fully explore how polarization can explicitly guide hierarchical RGB representation learning. To address these limitations, we propose CPGNet, an asymmetric RGB-polarization framework that introduces a conditional polarization guidance mechanism to explicitly regulate RGB feature learning for camouflaged object detection. Specifically, we design a lightweight polarization interaction module that jointly models these complementary cues and generates reliable polarization guidance in a unified manner. Unlike conventional feature fusion strategies, the proposed conditional guidance mechanism dynamically modulates RGB features using polarization priors, enabling the network to focus on subtle discrepancies between camouflaged objects and their backgrounds. Furthermore, we introduce a polarization edge-guided frequency refinement strategy that enhances high-frequency components under polarization constraints, effectively breaking camouflage patterns. Finally, we develop an iterative feedback decoder to perform coarse-to-fine feature calibration and progressively refine camouflage prediction. Extensive experiments on polarization datasets across multiple tasks, along with evaluations on non-polarization datasets, demonstrate that CPGNet consistently outperforms state-of-the-art methods.

关键词: Camouflaged Object Detection, Polarization Cues, Conditional Polarization Guidance, RGB-Polarization Framework, Feature Modulation, Frequency Refinement, Iterative Feedback Decoder

作者: Abdullah Thabit, Mohamed Benmahdjoub, Rafiuddin Jinabade, Hizirwan S. Salim, Marie-Lise C. van Veelen, Mark G. van Vledder, Eppo B. Wolvius, Theo van Walsum 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《SurgNavAR: An Augmented Reality Surgical Navigation Framework for Optical See-Through Head Mounted Displays》专注于增强现实（AR）手术导航系统的开发与评估，涉及计算机视觉、医学成像和手术辅助技术。论文内容与绝大多数关键词（如大模型、微调、推理加速、对齐等）完全无关，因为这些关键词均围绕大语言模型（LLMs）及其相关技术。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为该论文属于AI在医学（科学）领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出并评估了一个基于头戴式显示器的增强现实手术导航框架，在幻影实验中实现了毫米级的工具校准、配准和定位精度，为多样化手术应用提供了可配置的解决方案。

摘要翻译

配备头戴式显示器（HMD）的增强现实（AR）设备，能够在手术过程中将三维术前影像数据直接叠加至患者体表。若要将HMD-AR设备用作独立的手术导航系统，该设备需具备定位患者与手术器械、将术前影像数据与患者进行配准，并在术中实时可视化导航数据的能力。尽管其中部分技术已较为成熟，但将其集成于此种设备中仍过程繁琐，且需要特定知识与专业技能，这阻碍了该领域的科研进展。因此，本研究旨在提出并评估一种基于HMD的集成式AR手术导航框架，该框架可适配多种外科应用。该框架通过追踪附着于患者及手术器械上的二维图案作为参考标记，支持使用枢轴校准与基于参考的校准技术进行手术工具标定，并允许通过基于点的匹配及手动定位实现影像-患者配准。研究在HoloLens 2和Magic Leap 2两款HMD设备上，通过模拟体模设置中的两个外科应用案例（AR引导穿刺与肋骨骨折定位）对该框架的集成功能进行了评估。在两个应用案例中，该框架实现了平均1毫米的工具尖端校准精度、3毫米的配准精度以及低于5毫米的靶向精度。该框架为基于HMD的AR手术导航提供了一个易于使用且可配置的工具，可扩展并适配于众多外科应用。框架代码已公开于https://github.com/abdullahthabit/SurgNavAR。

摘要 (Abstract)

Augmented reality (AR) devices with head mounted displays (HMDs) facilitate the direct superimposition of 3D preoperative imaging data onto the patient during surgery. To use an HMD-AR device as a stand-alone surgical navigation system, the device should be able to locate the patient and surgical instruments, align preoperative imaging data with the patient, and visualize navigation data in real time during surgery. Whereas some of the technologies required for this are known, integration in such devices is cumbersome and requires specific knowledge and expertise, hampering scientific progress in this field. This work therefore aims to present and evaluate an integrated HMD-based AR surgical navigation framework that is adaptable to diverse surgical applications. The framework tracks 2D patterns as reference markers attached to the patient and surgical instruments. It allows for the calibration of surgical tools using pivot and reference-based calibration techniques. It enables image-to-patient registration using point-based matching and manual positioning. The integrated functionalities of the framework are evaluated on two HMD devices, the HoloLens 2 and Magic Leap 2, with two surgical use cases being evaluated in a phantom setup: AR-guided needle insertion and rib fracture localization. The framework was able to achieve a mean tooltip calibration accuracy of 1 mm, a registration accuracy of 3 mm, and a targeting accuracy below 5 mm on the two surgical use cases. The framework presents an easy-to-use configurable tool for HMD-based AR surgical navigation, which can be extended and adapted to many surgical applications. The framework is publicly available at https://github.com/abdullahthabit/SurgNavAR.

关键词: Augmented Reality, Surgical Navigation, Head Mounted Display, Image-to-patient Registration, Tool Calibration, Medical Imaging, HoloLens 2, Magic Leap 2

167. ❌ Learning Structural-Functional Brain Representations through Multi-Scale Adaptive Graph Attention for Cognitive Insight

作者: Badhan Mazumder, Sir-Lord Wiafe, Aline Kotoski, Vince D. Calhoun, Dong Hye Ye 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29967v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于神经科学领域，提出了一种名为MAGNet的Transformer风格图神经网络框架，用于整合大脑结构和功能数据以理解认知功能。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词特指自然语言处理或通用人工智能领域的大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（具体是神经科学/生物信息学）领域的应用，但论文本身并未直接提及大模型或深度学习技术原理的创新，只是使用了图神经网络（一种深度学习架构）作为方法，因此相关性有限，给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一个多尺度自适应图网络（MAGNet）框架，通过整合大脑结构MRI和功能fMRI数据来建模结构-功能相互作用，并在ABCD数据集上验证了其在理解认知功能方面的有效性。

摘要翻译

理解大脑结构与功能如何相互作用是解释智力的关键，但对其进行联合建模具有挑战性，因为结构连接组与功能连接组捕捉了互补的组织层面。我们提出了多尺度自适应图网络（MAGNet），这是一种Transformer风格的图神经网络框架，能够自适应地学习结构-功能交互作用。MAGNet利用结构磁共振成像（structural MRI）中的基于源形态测量法提取区域间的形态学特征，并将其与静息态功能磁共振成像（resting-state fMRI）获得的功能网络连接性相融合。通过混合图整合直接与间接通路，局部-全局注意力机制优化连接重要性，同时联合损失函数在端到端预测中强制跨模态一致性并优化预测目标。在ABCD数据集上，MAGNet的表现优于相关基线模型，证明了其通过有效的多模态整合推进我们对认知功能的理解。

摘要 (Abstract)

Understanding how brain structure and function interact is key to explaining intelligence yet modeling them jointly is challenging as the structural and functional connectome capture complementary aspects of organization. We introduced Multi-scale Adaptive Graph Network (MAGNet), a Transformer-style graph neural network framework that adaptively learns structure-function interactions. MAGNet leverages source-based morphometry from structural MRI to extract inter-regional morphological features and fuses them with functional network connectivity from resting-state fMRI. A hybrid graph integrates direct and indirect pathways, while local-global attention refines connectivity importance and a joint loss simultaneously enforces cross-modal coherence and optimizes the prediction objective end-to-end. On the ABCD dataset, MAGNet outperformed relevant baselines, demonstrating effective multimodal integration for advancing our understanding of cognitive function.

关键词: brain structure-function interaction, graph neural network, multimodal integration, cognitive function, resting-state fMRI, structural MRI, Transformer-style architecture, ABCD dataset

168. ❌ Scaling Video Pretraining for Surgical Foundation Models

作者: Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的手术视频理解，提出了一个名为SurgRec的可扩展预训练框架（包括SurgRec-MAE和SurgRec-JEPA变体），并构建了一个大规模、多源的手术视频数据集。论文的核心是视觉预训练技术（如MAE、JEPA）在医学视频领域的应用，属于“AI for Science”范畴，因此与“Pre-training OR Continual Pre-training OR Domain Adaptation”和“AI for Science OR Bioinformatics OR Cheminformatics”高度相关（10分）。论文讨论了数据规模、多样性和可复现性，与“Scaling Laws AND Data Quality”有一定关联（5分）。论文未涉及大语言模型（LLMs）、推理、对齐、微调、代理、压缩等关键词，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对手术视频理解中数据规模有限、流程多样性不足和评估不一致的问题，提出了一个可扩展、可复现的预训练框架SurgRec，并在一个大规模多源手术视频数据集上验证了其在下游任务中的优越性能。

摘要翻译

手术视频理解对于计算机辅助介入至关重要，然而现有的手术基础模型仍受限于数据规模有限、手术流程多样性不足以及评估标准不一致等问题，且往往缺乏可复现的训练流程。我们提出SurgRec，一种可扩展且可复现的手术视频理解预训练方案，并实例化为两种变体：SurgRec-MAE与SurgRec-JEPA。我们构建了一个包含10,535个视频、2.145亿帧的大规模多源数据集，涵盖内窥镜、腹腔镜、白内障及机器人手术等多种术式。基于此数据集，我们开发了具有平衡采样策略的统一预训练流程，并在16个下游数据集和四个临床领域中建立了标准化的可复现基准测试框架，采用一致的数据划分方法。在与多种自监督学习基线及视觉-语言模型的广泛对比中，SurgRec在所有下游数据集上均展现出卓越性能。相比之下，视觉-语言模型在细粒度时序识别任务中表现不可靠，既存在性能差距，也对提示词表述极为敏感。本研究为学界提供了一个可复现、可扩展的基础框架，以推动更具普适性的手术视频模型发展。所有代码、模型与数据将公开发布。

摘要 (Abstract)

Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.

关键词: Surgical video understanding, Foundation models, Pretraining, SurgRec, MAE, JEPA, Computer-assisted interventions, Reproducible benchmark

169. ❌ SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

作者: Shi Li, Vinkle Srivastav, Nicolas Chanel, Saurav Sharma, Nabani Banik, Lorenzo Arboit, Kun Yuan, Pietro Mascagni, Nicolas Padoy 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29962v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出SurgTEMP，一个用于手术视频问答的多模态LLM框架，核心是应用大语言模型（LLMs）解决外科手术领域的AI问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究属于大模型在科学（具体是医学/生物信息学）领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG、推理方法、模型优化等），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究针对手术视频问答中忽略时间语义、视觉对比度低等挑战，提出了一个结合查询引导令牌选择和外科能力进展训练的多模态LLM框架SurgTEMP，并构建了大型数据集CholeVidQA-32K，在多项评估任务中显著提升了性能。

摘要翻译

外科手术本身具有高度复杂性和风险性，需要丰富的专业知识与持续专注力以应对不断变化的术中场景。计算机辅助系统如手术视觉问答（VQA）为医学教育和术中支持提供了潜在解决方案。当前手术VQA研究主要集中于静态帧分析，忽略了丰富的时序语义信息。手术视频问答面临更多挑战：视觉对比度低、高度依赖专业知识、分析需求多样且涉及分散的时间窗口，以及从基础感知到高级术中评估的层次化特性。为应对这些挑战，我们提出SurgTEMP——一个多模态大语言模型框架，其特点包括：（i）查询引导的令牌选择模块，可构建层次化视觉记忆（空间与时间记忆库）；（ii）手术能力递进（SCP）训练方案。这些组件共同实现了对可变长度手术视频的有效建模，同时保留手术相关线索与时间连贯性，并能更好地支持多样化的下游评估任务。为促进模型开发，我们构建了CholeVidQA-32K数据集，该手术视频问答数据集包含3.2万个开放式问答对及3,855个视频片段（总计约128小时），均来源于腹腔镜胆囊切除术。数据集按三个层次组织——感知层、评估层与推理层，涵盖从器械/操作/解剖结构感知到安全关键视野（CVS）、术中难度、技能熟练度及不良事件评估等11项任务。在与最先进的开源多模态及视频大语言模型（微调与零样本设置）的综合对比评估中，SurgTEMP实现了显著的性能提升，推动了基于视频的手术VQA技术发展。

摘要 (Abstract)

Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy – Perception, Assessment, and Reasoning – spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

关键词: Surgical Video Question Answering, Multimodal LLM, Temporal Awareness, Visual Memory, Surgical Competency Progression, Laparoscopic Cholecystectomy, CholeVidQA-32K dataset, Hierarchical Assessment

170. ❌ Detecting Unknown Objects via Energy-based Separation for Open World Object Detection

作者: Jun-Woo Heo, Keonhee Park, Gyeong-Moon Park 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29954v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的开放世界目标检测（OWOD），提出了一种基于能量的分离框架（DEUS），用于检测未知对象并缓解灾难性遗忘。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关，未涉及任何大模型、语言模型、训练技术、推理方法、代理系统或AI for Science相关内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DEUS的新框架，通过基于能量的分离方法解决了开放世界目标检测中未知对象检测和灾难性遗忘的挑战，在基准测试中显著提升了未知检测性能并保持了已知类别的竞争力。

摘要翻译

本研究致力于解决开放世界目标检测（Open World Object Detection, OWOD）问题。这一挑战性场景要求检测器能够增量式地学习对已知目标进行分类而不遗忘，同时在没有监督的情况下识别未知目标。以往的OWOD方法通过增强未知目标发现过程以及采用记忆回放策略来缓解灾难性遗忘。然而，由于现有方法严重依赖检测器对已知类别的预测来检测未知目标，它们难以有效学习和识别未知目标的表征。此外，尽管记忆回放减轻了对旧类别的遗忘，却常常牺牲对新学习类别的知识掌握。为克服这些局限，我们提出DEUS（基于能量分离的未知目标检测），一种应对开放世界目标检测挑战的新框架。DEUS包含等角紧框架子空间未知分离模块（Equiangular Tight Frame (ETF)-Subspace Unknown Separation, EUS）与基于能量的已知类别区分损失（Energy-based Known Distinction, EKD）。EUS利用基于ETF的几何特性构建正交子空间，从而在已知与未知目标表征之间实现更清晰的分离。与先前仅考虑已知空间的基于能量的方法不同，EUS同时利用两个空间的能量以更好地捕捉未知目标的独特模式。此外，EKD损失强制分离先前分类器与当前分类器，从而在记忆回放过程中最大限度地减少先前学习类别与新学习类别之间的知识干扰。我们在OWOD基准测试上对DEUS进行了全面验证，结果表明其在保持已知类别检测竞争力的同时，于未知目标检测方面取得了显著的性能提升。

摘要 (Abstract)

In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector’s known class predictions for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energy-based Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of Equiangular Tight Frame (ETF)-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations. Unlike prior energy-based approaches that consider only the known space, EUS utilizes energies from both spaces to better capture distinct patterns of unknown objects. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.

关键词: Open World Object Detection, OWOD, unknown object detection, energy-based separation, catastrophic forgetting, Equiangular Tight Frame, ETF, memory replay

171. ❌ EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

作者: Fumihiko Tsuchiya, Taiki Miyanishi, Mahiro Ukai, Nakamasa Inoue, Shuhei Kurita, Yusuke Iwasawa, Yutaka Matsuo 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29943v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（MLLMs）在超长视频中的枚举和计数任务，属于大模型在计算机视觉领域的应用研究。与’Large Language Models’高度相关（8分），因为论文明确测试了22个MLLMs。与推理相关的关键词（Chain of Thought, System 2 Thinking）有一定关联（5分），因为论文涉及长视频中的时序推理和定量推理，但并非这些技术的核心研究。其他关键词如MoE、量化、对齐等与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了EC-Bench基准，用于评估多模态大语言模型在超长视频中的枚举和计数能力，发现当前最佳模型准确率仅为29.98%（枚举）和23.74%（计数），远低于人类水平（78.57%和82.97%），揭示了MLLMs在长视频定量推理上的根本局限性。

摘要翻译

长视频计数是计算机视觉领域中一个基础但尚未被充分探索的挑战。现实世界的录像通常长达数十分钟甚至更久，且包含稀疏、多样的事件，这使得长程时序推理尤为困难。然而，现有的大多数视频计数基准测试都聚焦于短视频片段，并且仅评估最终的数字答案，未能深入揭示模型应计数什么内容，或者模型是否能在时间上持续识别相关实例。我们推出了EC-Bench，这是一个用于评估长视频中枚举、计数及时序证据定位的联合基准。EC-Bench包含152个时长超过30分钟的视频以及1,699个带有明确证据时间段的查询。在22个多模态大语言模型（MLLMs）的测试中，最佳模型在枚举任务上仅达到29.98%的准确率，在计数任务上为23.74%，而人类的表现分别达到78.57%和82.97%。我们的分析揭示了枚举准确率、时序定位与计数性能之间的强关联性。这些结果凸显了当前MLLMs的根本性局限，并将EC-Bench确立为长视频定量推理领域一个具有挑战性的基准。

摘要 (Abstract)

Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.

关键词: video counting, long-form videos, multimodal large language models, temporal reasoning, enumeration, benchmark, MLLMs, quantitative reasoning

172. ❌ Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

作者: Vanessa Emanuela Guarino, Claudia Winklmayr, Jannik Franzen, Josef Lorenz Rumberger, Manuel Pfeuffer, Sonja Greven, Klaus Maier-Hein, Carsten T. Lüth, Christoph Karg, Dagmar Kainmueller 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29941v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于图像分割不确定性量化的聚合策略研究，属于计算机视觉和医学图像分析领域。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确提到在生物医学图像分析等安全关键领域应用，属于AI for Science的范畴，但并非论文核心创新点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了图像分割不确定性量化中不同聚合策略的性能，发现利用空间结构的聚合器能提升下游任务表现，并提出了一种跨数据集鲁棒的元聚合器。

摘要翻译

不确定性量化对于确保生物医学图像分析或自动驾驶等安全关键领域中自动化图像分割的可靠性至关重要。在分割任务中，不确定性量化会生成逐像素的不确定性分数，这些分数必须聚合为图像级分数，以用于分布外检测或故障检测等下游任务。尽管聚合策略已被常规使用，但其特性以及对下游任务性能的影响尚未得到全面研究。全局平均法是默认选择，但它未考虑分割不确定性的空间与结构特征。目前存在基于图像块、类别和阈值的替代策略，但缺乏系统比较，导致报告结果不一致且最佳实践不明确。我们通过以下方式填补这一空白：(1) 系统分析常见策略的特性、局限与缺陷；(2) 提出融合空间不确定性结构的新策略；(3) 在十个图像几何结构与组织特征各异的数据集上，对分布外检测与故障检测任务进行性能基准测试。研究发现，利用空间结构的聚合器在两项下游任务中均表现出更强性能。然而，单个聚合器的性能高度依赖于数据集特征，因此我们(4)提出一种元聚合器，它能整合多种聚合器并在不同数据集中保持稳健性能。

摘要 (Abstract)

Uncertainty Quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. In segmentation, UQ generates pixel-wise uncertainty scores that must be aggregated into image-level scores for downstream tasks like Out-of-Distribution (OoD) or failure detection. Despite routine use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied. Global Average is the default choice, yet it does not account for spatial and structural features of segmentation uncertainty. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices. We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure. We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, the performance of individual aggregators depends heavily on dataset characteristics, so we (4) propose a meta-aggregator that integrates multiple aggregators and performs robustly across datasets.

关键词: Uncertainty Quantification, Image Segmentation, Aggregation Strategies, Spatial Structure, Out-of-Distribution Detection, Failure Detection, Biomedical Image Analysis, Benchmarking

173. ❌ Gloria: Consistent Character Video Generation via Content Anchors

作者: Yuhang Yang, Fan Zhang, Huaijin Pi, Shuai Guo, Guowei Xu, Wei Zhai, Yang Cao, Zheng-Jun Zha 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29931v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Gloria: Consistent Character Video Generation via Content Anchors》专注于计算机视觉领域的视频生成技术，特别是角色一致性视频生成。论文的核心创新在于提出使用锚定帧（anchor frames）来保持角色外观一致性，并设计了Superset Content Anchoring和RoPE as Weak Condition等机制来解决多参考冲突问题。虽然研究背景中提到“大模型和深度学习在科学领域的应用”，但该论文的具体内容完全围绕视频生成技术，没有涉及任何语言模型、模型训练方法（如预训练、微调、对齐）、推理优化、代理系统、模型压缩或科学AI应用等关键词领域。所有关键词均与语言模型、模型训练技术或特定AI应用领域相关，而本论文属于纯粹的计算机视觉/视频生成任务，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了长时角色视频生成中外观一致性的挑战，通过提出基于锚定帧的方法和相应机制，实现了超过10分钟的高质量、多视角一致的角色视频生成。

摘要翻译

数字角色是现代媒体的核心，然而生成具有长时长、一致多视角外观与富有表现力身份特征的角色视频仍具挑战。现有方法要么未能提供足够的上下文以保持身份一致性，要么依赖非角色中心信息作为记忆，导致一致性欠佳。我们认识到角色视频生成本质上类似于“由外向内观察”的场景。本文提出通过一组紧凑的锚帧来表征角色的视觉属性。该设计为一致性提供了稳定参考，但基于参考的视频生成本身面临复制粘贴和多参考冲突的挑战。为此，我们引入两种机制：超集内容锚定——提供训练片段内外的线索以防止重复，以及作为弱条件的RoPE——编码位置偏移以区分多个锚点。此外，我们构建了一个可扩展的流程，用于从海量视频中提取这些锚点。实验表明，我们的方法能生成超过10分钟的高质量角色视频，并在多视角下实现富有表现力的身份特征与外观一致性，超越了现有方法。

摘要 (Abstract)

Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

关键词: character video generation, consistency, anchor frames, multi-view appearance, Superset Content Anchoring, RoPE, video generation pipeline, long-duration video

174. ❌ Abstraction in Style

作者: Min Lu, Yuanfeng He, Anthony Chen, Jianhuang He, Pu Wang, Daniel Cohen-Or, Hui Huang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Abstraction in Style》专注于计算机视觉和图形学领域的艺术风格迁移，提出了一种将结构抽象与视觉风格化分离的生成框架。论文内容涉及图像处理、生成模型和风格迁移，但未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大语言模型、深度学习技术或AI科学应用相关，与本文的计算机视觉/图形学研究主题完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对传统风格迁移方法难以捕捉艺术风格中深层抽象行为的问题，提出了一个将结构抽象与视觉风格化分离的两阶段生成框架（AiS），实现了更广泛、更可控和更具表现力的风格化转换。

摘要翻译

艺术风格通常蕴含着超越表面外观的抽象性，涉及对结构的刻意重新诠释，而非仅仅是纹理或颜色的改变。传统的风格迁移方法通常保留输入图像的几何结构，因此难以捕捉这种更深层次的抽象行为，尤其对于插画类和非真实感风格。在本研究中，我们提出“风格抽象”（AiS）这一生成框架，将结构抽象与视觉风格化分离。给定目标图像和少量风格示例，AiS首先推导出一个中间抽象代理，该代理根据风格所展现的抽象逻辑重新诠释目标图像的结构。该代理在捕捉语义结构的同时放宽了几何保真度，使得后续的风格化能够基于抽象化表征而非原始图像进行操作。在第二阶段，抽象代理被渲染以生成最终的风格化输出，保持与参考风格的视觉一致性。两个阶段均通过共享图像空间类比实现，使得变换能够从视觉示例中学习，而无需显式的几何监督。通过将抽象与外观解耦，并将抽象视为一个显式、可迁移的过程，AiS支持更广泛的风格转换，提升了可控性，并实现了更具表现力的风格化效果。

摘要 (Abstract)

Artistic styles often embed abstraction beyond surface appearance, involving deliberate reinterpretation of structure rather than mere changes in texture or color. Conventional style transfer methods typically preserve the input geometry and therefore struggle to capture this deeper abstraction behavior, especially for illustrative and nonphotorealistic styles. In this work, we introduce Abstraction in Style (AiS), a generative framework that separates structural abstraction from visual stylization. Given a target image and a small set of style exemplars, AiS first derives an intermediate abstraction proxy that reinterprets the target’s structure in accordance with the abstraction logic exhibited by the style. The proxy captures semantic structure while relaxing geometric fidelity, enabling subsequent stylization to operate on an abstracted representation rather than the original image. In a second stage, the abstraction proxy is rendered to produce the final stylized output, preserving visual coherence with the reference style. Both stages are implemented using a shared image space analogy, enabling transformations to be learned from visual exemplars without explicit geometric supervision. By decoupling abstraction from appearance and treating abstraction as an explicit, transferable process, AiS supports a wider range of stylistic transformations, improves controllability, and enables more expressive stylization.

关键词: style transfer, abstraction, generative framework, structural abstraction, visual stylization, image space analogy, nonphotorealistic rendering, artistic style

175. ❌ Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

作者: Hiba Adil Al-kharsan, Róbert Rajkó 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29917v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于手写数字分类的计算机视觉任务，使用CNN、NNMF和扩散模型等技术，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或AI for Science等关键词领域。所有关键词均与大模型、深度学习技术原理或科学AI应用无关，因此全部评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合扩散特征去噪和混合特征表示的手写数字多分类方法，在基准和对抗攻击设置下均表现出有效性和鲁棒性。

摘要翻译

本研究提出一种用于手写数字识别的鲁棒多分类框架，该框架将扩散驱动的特征去噪与混合特征表示相结合。受我们先前脑肿瘤分类研究的启发，所提出的方法在特征空间中运行，以提升对噪声和对抗性攻击的鲁棒性。首先，通过非负矩阵分解（Nonnegative Matrix Factorization, NNMF）将输入图像转换为紧凑且可解释的示例化表示。同时，使用卷积神经网络（CNN）提取深层特征。这些互补的特征被融合为统一的混合表示。为增强鲁棒性，在特征空间中采用逐步添加高斯噪声的扩散操作，并训练一个特征去噪网络来逆转此过程，从扰动输入中重建干净的特征表示。随后，这些优化后的特征被用于多分类任务。该方法在基准场景和对抗场景（使用AutoAttack评估）中均进行了验证。实验结果表明，基于扩散的混合模型在保持强大分类性能的同时，其效果和鲁棒性均优于CNN基线模型。这些结果阐释了特征级扩散防御机制在可靠的多类别手写数字分类中的有效性。

摘要 (Abstract)

This work presents a robust multi-class classification framework for handwritten digits that combines diffusion-driven feature denoising with a hybrid feature representation. Inspired by our previous work on brain tumor classification, the proposed approach operates in a feature space to improve the robustness to noise and adversarial attacks. First, the input images are converted into tight, interpretable exemplification using Nonnegative Matrix Factorization (NNMF). In parallel, special deep features are extracted using a computational neural network (CNN). These integral features are combined into a united hybrid representation. To improve robustness, a step diffusion operation is used in the feature space by gradually adding Gaussian noise. A feature denoiser network is trained to reverse this operation and rebuild clean representations from tilted inputs. The courteous features are then applied for multi-class classification. The suggested method is evaluated in both baseline and adversarial settings using AutoAttack. The experimental outcome present that the diffusion-based hybrid model is both effective and robust, the CNN baseline models outperforming while maintain powerful classification performance. These results explain the activity of feature-level diffusion defense for reliable multi-class handwritten digit classification.

关键词: handwritten digit classification, diffusion-based feature denoising, Nonnegative Matrix Factorization, CNN, hybrid feature representation, robustness, adversarial attacks, AutoAttack

176. ❌ Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data

作者: Minyoung E. Kim, Dae Hee Yun, Aditi V. Patel, Madeline Hon, Webster Guan, Taegeon Lee, Brian Nguyen 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29842v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于生物医学成像领域，特别是高分辨率光片显微镜数据的处理和分析，提出了一个名为CANVAS的基准数据集。论文内容与大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词特指自然语言处理或通用人工智能领域的技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物信息学/科学领域的应用，但论文本身并未深入探讨AI模型的技术创新，而是侧重于数据基准和现有模型泛化问题的展示，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对高分辨率全脑光片显微镜数据缺乏可扩展分析方法的挑战，提出了首个大规模的小鼠脑组织亚细胞水平成像基准数据集CANVAS，并展示了现有视觉模型在该数据上的泛化困难。

摘要翻译

亚细胞分辨率全脑三维显微成像数据正以前所未有的细节揭示生物结构，这得益于近期完整组织处理技术与光片荧光显微镜（LSFM）的进展。这些体数据提供了丰富的细胞形态学与空间信息，然而，针对此类拍字节级数据缺乏可扩展的数据处理与分析方法，对准确解读构成了重大挑战。此外，现有用于目标检测与分类等视觉任务的模型难以泛化至此类数据。为加速开发适配的方法与基础模型，我们推出了CANVAS——一套涵盖六种神经元与免疫细胞类型标记的高分辨率全鼠脑LSFM基准数据集，包含细胞注释与评估排行榜。我们还展示了基于现有架构构建的基线模型在泛化能力上面临的挑战，这尤其源于大脑中不同表型和解剖位置间细胞形态的异质性。据我们所知，CANVAS是首个在亚细胞水平捕获完整鼠脑组织的最大规模LSFM基准数据集，并包含全脑范围的详尽细胞注释。

摘要 (Abstract)

Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these petabyte-scale data poses a substantial challenge for accurate interpretation. Further, existing models for visual tasks such as object detection and classification struggle to generalize to this type of data. To accelerate the development of suitable methods and foundational models, we present CANVAS, a comprehensive set of high-resolution whole mouse brain LSFM benchmark data, encompassing six neuronal and immune cell-type markers, along with cell annotations and a leaderboard. We also demonstrate challenges in generalization of baseline models built on existing architectures, especially due to the heterogeneity in cellular morphology across phenotypes and anatomical locations in the brain. To the best of our knowledge, CANVAS is the first and largest LSFM benchmark that captures intact mouse brain tissue at subcellular level, and includes extensive annotations of cells throughout the brain.

关键词: whole-brain imaging, light-sheet fluorescence microscopy, benchmark data, cell annotation, generalization challenge, subcellular resolution, mouse brain, morphological heterogeneity

177. ❌ AutoFormBench: Benchmark Dataset for Automating Form Understanding

作者: Gaurab Baral, Junxiu Zhou 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29832v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于使用传统计算机视觉方法（OpenCV）和YOLO架构进行表单元素检测的基准测试，未涉及任何大模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型技术、深度学习创新或科学AI应用相关，而本文研究的是具体的文档处理任务，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了AutoFormBench基准数据集，用于评估表单元素检测模型，并通过实验发现YOLOv11在检测复选框、输入线和文本框等表单元素时表现最佳。

摘要翻译

由于现实场景中政府表格、医疗记录和企业发票等结构化文档的版式存在高度可变性，其自动化处理仍面临持续挑战。本文介绍了AutoFormBench——一个包含407份标注真实表单的基准数据集，涵盖政府、医疗和企业领域，专为训练和评估表单元素检测模型而设计。我们系统比较了经典OpenCV方法与四种YOLO架构（YOLOv8、YOLOv11、YOLOv26-s和YOLOv26-l）在不同类型PDF文档中可填写表单元素（特别是复选框、输入行和文本框）的定位与分类性能。实验表明，在所有元素类别和容差级别下，YOLOv11在F1分数和Jaccard准确率指标上均展现出持续优越的性能。

摘要 (Abstract)

Automated processing of structured documents such as government forms, healthcare records, and enterprise invoices remains a persistent challenge due to the high degree of layout variability encountered in real-world settings. This paper introduces AutoFormBench, a benchmark dataset of 407 annotated real-world forms spanning government, healthcare, and enterprise domains, designed to train and evaluate form element detection models. We present a systematic comparison of classical OpenCV approaches and four YOLO architectures (YOLOv8, YOLOv11, YOLOv26-s, and YOLOv26-l) for localizing and classifying fillable form elements. specifically checkboxes, input lines, and text boxes across diverse PDF document types. YOLOv11 demonstrates consistently superior performance in both F1 score and Jaccard accuracy across all element classes and tolerance levels.

关键词: form understanding, benchmark dataset, document processing, YOLO architectures, form element detection, computer vision, PDF documents, automated form processing

178. ❌ Multi-Feature Fusion Approach for Generative AI Images Detection

作者: Abderrezzaq Sendjasni, Mohamed-Chaker Larabi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是生成式AI图像检测的多特征融合方法，主要涉及计算机视觉、图像处理和生成对抗网络检测领域。论文内容与所有评分关键词均无直接关联：1）论文未涉及大语言模型、小语言模型、MoE等模型架构；2）未讨论预训练、微调、对齐、RLHF等训练方法；3）未涉及推理加速、量化、上下文扩展等技术；4）未涉及思维链、智能体、工具使用等应用；5）虽然涉及AI，但属于图像生成检测领域，而非科学领域的AI应用。所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种融合多特征（MSCN、CLIP嵌入、MLBP）的框架来提升生成式AI图像的检测性能，实验表明该框架在多个基准数据集上优于现有方法。

摘要翻译

生成式人工智能（GenAI）模型的快速发展催生了逼真度前所未有的合成图像，这对传统方法区分其与自然照片的能力构成了挑战。现有检测器通常依赖单一特征空间，例如统计规律性、语义嵌入或纹理模式，但这些方法在面对多样且不断演进的生成模型时往往缺乏鲁棒性。本研究探讨并系统评估了一种多特征融合框架，该框架整合了来自三个不同空间的互补线索：（1）均值减对比度归一化（MSCN）特征，用于捕捉低层统计偏差；（2）CLIP嵌入，用于编码高层语义一致性；（3）多尺度局部二值模式（MLBP），用于表征中层纹理异常。通过在涵盖广泛生成模型的四个基准数据集上进行大量实验，我们发现单一特征空间在不同生成器间表现出显著的性能波动。关键的是，融合所有三种表征能够产生更优越且更稳定的性能，尤其在具有挑战性的混合模型场景中。与现有先进方法相比，所提出的框架在所有评估数据集上均实现了持续的性能提升。总体而言，本研究强调了混合表征对于鲁棒的生成式人工智能图像检测的重要性，并为整合互补视觉线索提供了一个原则性框架。

摘要 (Abstract)

The rapid evolution of Generative AI (GenAI) models has led to synthetic images of unprecedented realism, challenging traditional methods for distinguishing them from natural photographs. While existing detectors often rely on single-feature spaces, such as statistical regularities, semantic embeddings, or texture patterns, these approaches tend to lack robustness when confronted with diverse and evolving generative models. In this work, we investigate and systematically evaluate a multi-feature fusion framework that combines complementary cues from three distinct spaces: (1) Mean Subtracted Contrast Normalized (MSCN) features capturing low-level statistical deviations; (2) CLIP embeddings encoding high-level semantic coherence; and (3) Multi-scale Local Binary Patterns (MLBP) characterizing mid-level texture anomalies. Through extensive experiments on four benchmark datasets covering a wide range of generative models, we show that individual feature spaces exhibit significant performance variability across different generators. Crucially, the fusion of all three representations yields superior and more consistent performance, particularly in a challenging mixed-model scenario. Compared to state-of-the-art methods, the proposed framework yields consistently improved performance across all evaluated datasets. Overall, this work highlights the importance of hybrid representations for robust GenAI image detection and provides a principled framework for integrating complementary visual cues.

关键词: Generative AI, Image Detection, Multi-feature Fusion, MSCN, CLIP Embeddings, Local Binary Patterns, Synthetic Images, Robust Detection

179. ❌ MAPLE: Multi-Path Adaptive Propagation with Level-Aware Embeddings for Hierarchical Multi-Label Image Classification

作者: Boshko Koloski, Marjan Stoimchev, Jurica Levatić, Dragi Kocev, Sašo Džeroski 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29784v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文MAPLE专注于计算机视觉领域的层次多标签图像分类，特别是遥感图像分析。它使用图卷积网络（GCNs）、多模态融合和自适应损失等技术，核心是深度学习在图像分类中的应用，而非大语言模型（LLMs）或相关技术。因此，除’AI for Science OR Bioinformatics OR Cheminformatics’（评5分，因遥感属于科学应用，但非生物信息学或化学信息学）外，其他所有关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG等）均与论文内容完全无关，评0分。加权总分计算为5.0（仅一个关键词相关）。

!!! tip deepseek-chat TL;DR

论文提出MAPLE框架，通过集成层次语义初始化、图结构编码和自适应多模态融合，解决了遥感图像中层次多标签分类在多路径设置下的挑战，在少样本场景下实现了高达42%的性能提升，同时仅增加2.6%的参数开销。

摘要翻译

层次多标签分类（HMLC）对于遥感中结构化标签依赖关系的建模至关重要。然而，现有方法在多路径场景中面临挑战，即图像可能激活多个分类学分支，导致层次信息利用不足。我们提出MAPLE（基于层级感知嵌入的多路径自适应传播框架），该框架整合了：（i）源自图感知文本描述的层次语义初始化，（ii）通过图卷积网络（GCNs）实现的基于图的结构编码，以及（iii）动态平衡语义先验与视觉证据的自适应多模态融合机制。一种自适应的层级感知目标函数能自动为每个层次级别选择合适的损失函数。在基于CORINE标准的遥感数据集（AID、DFC-15和MLRSNet）上的评估表明，该方法在少样本场景下性能持续提升最高达+42%，同时仅增加2.6%的参数开销，证明MAPLE能够为地球观测（EO）任务高效且有效地建模层次语义。

摘要 (Abstract)

Hierarchical multi-label classification (HMLC) is essential for modeling structured label dependencies in remote sensing. Yet existing approaches struggle in multi-path settings, where images may activate multiple taxonomic branches, leading to underuse of hierarchical information. We propose MAPLE (Multi-Path Adaptive Propagation with Level-Aware Embeddings), a framework that integrates (i) hierarchical semantic initialization from graph-aware textual descriptions, (ii) graph-based structure encoding via graph convolutional networks (GCNs), and (iii) adaptive multi-modal fusion that dynamically balances semantic priors and visual evidence. An adaptive level-aware objective automatically selects appropriate losses per hierarchy level. Evaluations on CORINE-aligned remote sensing datasets (AID, DFC-15, and MLRSNet) show consistent improvements of up to +42% in few-shot regimes while adding only 2.6% parameter overhead, demonstrating that MAPLE effectively and efficiently models hierarchical semantics for Earth observation (EO).

关键词: Hierarchical multi-label classification, Remote sensing, Graph convolutional networks, Multi-path adaptive propagation, Level-aware embeddings, Few-shot learning, Earth observation, Multi-modal fusion

180. ❌ SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

作者: Rui Bao, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, Yang Song, Jiaojiao Jiang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29742v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型水印攻击方法SHIFT，专注于扩散模型的安全漏洞和攻击技术，与所有评分关键词（均围绕大模型/深度学习技术原理、训练方法、推理优化、对齐、应用等）完全无关，无任何交集。

!!! tip deepseek-chat TL;DR

论文提出SHIFT方法，通过随机扩散重采样在潜空间偏转生成轨迹，有效攻击多种扩散模型水印方案，实现95%-100%攻击成功率且保持视觉质量。

摘要翻译

基于扩散模型的水印方法通过操控初始噪声或逆向扩散轨迹嵌入可验证标记。然而，这些方法共享一个关键假设：仅当扩散轨迹能够被忠实重建时，验证才能成功。这种对轨迹恢复的依赖构成了一个根本性且可利用的漏洞。我们提出随机隐藏轨迹偏转攻击（$\mathbf{SHIFT}$），这是一种无需训练的攻击方法，它利用了不同水印范式中的这一共同弱点。SHIFT利用随机扩散重采样在潜在空间中偏转生成轨迹，使得重建图像在统计上与原始水印嵌入轨迹解耦，同时保持优异的视觉质量和语义一致性。在涵盖噪声空间、频域和基于优化范式的九种代表性水印方法上进行的大量实验表明，SHIFT实现了95%至100%的攻击成功率，且语义质量几乎无损，无需任何水印特定知识或模型重新训练。

摘要 (Abstract)

Diffusion-based watermarking methods embed verifiable marks by manipulating the initial noise or the reverse diffusion trajectory. However, these methods share a critical assumption: verification can succeed only if the diffusion trajectory can be faithfully reconstructed. This reliance on trajectory recovery constitutes a fundamental and exploitable vulnerability. We propose $\underline{\mathbf{S}}$tochastic $\underline{\mathbf{Hi}}$dden-Trajectory De$\underline{\mathbf{f}}$lec$\underline{\mathbf{t}}$ion ($\mathbf{SHIFT}$), a training-free attack that exploits this common weakness across diverse watermarking paradigms. SHIFT leverages stochastic diffusion resampling to deflect the generative trajectory in latent space, making the reconstructed image statistically decoupled from the original watermark-embedded trajectory while preserving strong visual quality and semantic consistency. Extensive experiments on nine representative watermarking methods spanning noise-space, frequency-domain, and optimization-based paradigms show that SHIFT achieves 95%–100% attack success rates with nearly no loss in semantic quality, without requiring any watermark-specific knowledge or model retraining.

关键词: diffusion models, watermarking, attack, trajectory deflection, stochastic resampling, latent space, security vulnerability, generative models

181. ❌ Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration

作者: Fengyang Xiao, Peng Hu, Lei Xu, XingE Guo, Guanyi Qin, Yuqi Shen, Chengyu Fang, Rihan Zhang, Chunming He, Sina Farsiu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29773v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像修复任务，提出了一种利用图像质量先验（IQP）从预训练的无参考图像质量评估（NR-IQA）模型中提取信息来指导修复过程的方法。论文的核心技术涉及Transformer架构、离散表示学习和质量优化策略，但所有关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用（如生物信息学）无关。论文未提及任何大模型、语言模型、对齐技术、推理方法、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对真实世界图像修复中地面真值（GT）监督存在感知质量不一致的问题，提出了一种利用预训练无参考图像质量评估模型提取图像质量先验（IQP）的新框架IQPIR，通过质量条件Transformer、双分支码本结构和离散表示优化，显著提升了修复图像的感知质量并超越了现有方法。

摘要翻译

真实世界图像复原旨在从非受控条件下采集的退化低质量（LQ）输入中恢复出高质量（HQ）图像。现有方法通常依赖于真实标签（GT）监督，并假设GT提供了完美的参考质量。然而，GT中仍可能包含感知保真度不一致的图像，导致模型收敛至训练数据的平均质量水平，而非达到可实现的最高感知质量。为解决这些问题，我们提出了一种新颖框架，称为IQPIR，该框架引入了从预训练的无参考图像质量评估（NR-IQA）模型中提取的图像质量先验（IQP），以显式地引导复原过程朝向感知最优的输出。我们的方法通过三种关键机制将IQP与学习的码本先验协同整合：（1）质量条件化Transformer，其中NR-IQA导出的分数作为条件信号，引导预测的表征朝向最大感知质量。该设计提供了即插即用的增强功能，无需结构修改即可与现有复原架构兼容；（2）双分支码本结构，解耦了通用特征与高质量专属特征，确保了对通用结构信息与质量敏感属性的全面表征；（3）基于离散表征的质量优化策略，缓解了连续潜在空间中常见的过度优化效应。在真实世界图像复原上的大量实验表明，我们的方法不仅超越了前沿方法，还可作为现有方法中一种可泛化的质量引导增强策略。代码已公开。

摘要 (Abstract)

Real-world image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed IQPIR, that introduces an Image Quality Prior (IQP)-extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models-to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms: (1) a quality-conditioned Transformer, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and (2) a dual-branch codebook structure, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) a discrete representation-based quality optimization strategy, which mitigates over-optimization effects commonly observed in continuous latent spaces. Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code is available.

关键词: Image Restoration, Image Quality Prior, No-Reference Image Quality Assessment, Transformer, Codebook, Perceptual Quality, Real-world Images, Quality Optimization

182. ❌ GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis

作者: Thomas Tanay, Mohammed Brahimi, Michal Nazarczuk, Qingwen Zhang, Sibi Catley-Chandar, Arthur Moreau, Zhensong Zhang, Eduardo Pérez-Pellitero 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29734v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的动态视图合成技术，提出了一种基于循环架构和平面扫描的通用化模型（GRVS），用于从单目视频生成动态场景的新视图。论文的核心贡献在于动态场景表示、相机运动解耦和几何一致性改进，属于3D视觉和神经渲染领域。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究内容完全不涉及这些主题：没有使用或提及任何语言模型（LLM/SLM）、模型训练技术（预训练/微调/对齐）、推理优化方法（注意力机制/量化）、AI代理系统、可解释性技术或AI在生物/化学信息学中的应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种通用化循环模型GRVS，用于解决从单目动态视频合成新视图时几何不一致的问题，通过在UCSD和Kubric-4D-dyn数据集上的实验表明，其方法在静态和动态区域都能更好地重建几何细节，优于基于高斯泼溅和扩散的方法。

摘要翻译

从动态场景的单目视频中合成新视角仍是一个具有挑战性的问题。基于显式运动先验优化4D表示的场景特定方法，在难以利用多视角信息的高度动态区域常常失效。将相机控制整合进大型预训练模型的扩散方法虽能生成视觉上合理的视频，却常因静态与动态区域的几何不一致性而受限。这两类方法均需大量计算资源。基于通用化静态新视角合成模型的成功经验，我们将其框架适配于动态输入，并提出一种包含两个关键组件的新模型：（1）一个循环回路，可实现输入与目标视频间无界且异步的映射；（2）对动态输入高效运用平面扫描技术，以解耦相机与场景运动，并实现精细的六自由度相机控制。我们在UCSD数据集及Kubric-4D-dyn数据集上训练和评估了模型——后者是一个新的单目动态数据集，其序列更长、分辨率更高，且场景动态比现有数据集更为复杂。实验表明，在重建静态与动态区域的精细几何细节方面，我们的模型优于四种基于高斯泼溅的场景特定方法以及两种基于扩散的方法。

摘要 (Abstract)

Synthesizing novel views from monocular videos of dynamic scenes remains a challenging problem. Scene-specific methods that optimize 4D representations with explicit motion priors often break down in highly dynamic regions where multi-view information is hard to exploit. Diffusion-based approaches that integrate camera control into large pre-trained models can produce visually plausible videos but frequently suffer from geometric inconsistencies across both static and dynamic areas. Both families of methods also require substantial computational resources. Building on the success of generalizable models for static novel view synthesis, we adapt the framework to dynamic inputs and propose a new model with two key components: (1) a recurrent loop that enables unbounded and asynchronous mapping between input and target videos and (2) an efficient use of plane sweeps over dynamic inputs to disentangle camera and scene motion, and achieve fine-grained, six-degrees-of-freedom camera controls. We train and evaluate our model on the UCSD dataset and on Kubric-4D-dyn, a new monocular dynamic dataset featuring longer, higher resolution sequences with more complex scene dynamics than existing alternatives. Our model outperforms four Gaussian Splatting-based scene-specific approaches, as well as two diffusion-based approaches in reconstructing fine-grained geometric details across both static and dynamic regions.

关键词: dynamic view synthesis, monocular video, recurrent model, plane sweeps, camera motion disentanglement, geometric consistency, generalizable model, novel view synthesis

183. ❌ Leveraging Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection

作者: Rosario Leonardi, Antonino Furnari, Francesco Ragusa, Giovanni Maria Farinella 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究合成数据在提升第一人称视角手-物体交互检测中的应用，属于计算机视觉领域，主要涉及数据集生成、数据增强和检测模型训练。论文未涉及任何大语言模型、深度学习技术原理创新或AI for Science的具体应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了利用合成数据增强第一人称视角手-物体交互检测，通过实验证明合成数据能显著提升检测性能，特别是在真实标注数据稀缺时，并发布了新的数据生成流程和基准数据集。

摘要翻译

本研究探讨了合成数据在提升第一人称视角图像中手物交互检测性能方面的作用。通过在VISOR、EgoHOS和ENIGMA-51数据集上进行大量实验与对比分析，我们的研究结果表明，合成数据能够显著改善手物交互检测效果，尤其在真实标注数据稀缺或不可得时表现突出。通过使用合成数据并仅结合10%的真实标注数据，我们在整体平均精度上超越了仅使用真实数据训练的模型，在VISOR数据集上提升+5.67%，在EgoHOS数据集上提升+8.24%，在ENIGMA-51数据集上提升+11.69%。此外，我们系统研究了如何将合成数据在物体类别、抓握方式和环境背景等方面与特定真实世界基准对齐，证明合成数据与真实数据的对齐度越高，其有效性越能持续提升。基于本研究成果，我们发布了新的数据生成流程和HOI-Synth基准数据集，该数据集通过手物交互的合成图像对现有数据集进行了扩充。这些数据均自动标注了手物接触状态、边界框及像素级分割掩码。所有合成数据生成相关的数据、代码与工具均已公开：https://fpv-iplab.github.io/HOI-Synth/。

摘要 (Abstract)

In this work, we explore the role of synthetic data in improving the detection of Hand-Object Interactions from egocentric images. Through extensive experimentation and comparative analysis on VISOR, EgoHOS, and ENIGMA-51 datasets, our findings demonstrate the potential of synthetic data to significantly improve HOI detection, particularly when real labeled data are scarce or unavailable. By using synthetic data and only 10% of the real labeled data, we achieve improvements in Overall AP over models trained exclusively on real data, with gains of +5.67% on VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Furthermore, we systematically study how aligning synthetic data to specific real-world benchmarks with respect to objects, grasps, and environments, showing that the effectiveness of synthetic data consistently improves with better synthetic-real alignment. As a result of this work, we release a new data generation pipeline and the new HOI-Synth benchmark, which augments existing datasets with synthetic images of hand-object interaction. These data are automatically annotated with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. All data, code, and tools for synthetic data generation are available at: https://fpv-iplab.github.io/HOI-Synth/.

关键词: synthetic data, hand-object interaction, egocentric vision, HOI detection, data augmentation, benchmark dataset, computer vision, data generation pipeline

184. ❌ Compressive sensing inspired self-supervised single-pixel imaging

作者: Jijun Lu, Yifan Chen, Libang Chen, Yiqiang Zhou, Ye Zheng, Mingliang Chen, Zhe Sun, Xuelong Li 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29732v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于单像素成像（SPI）的压缩感知和自监督学习方法，提出了一种名为SISTA-Net的网络架构。论文内容涉及计算机视觉、图像处理和压缩感知技术，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究领域完全不同，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对单像素成像中缺乏物理稀疏约束、局部与全局特征整合不足导致噪声敏感和细节模糊的问题，提出了一种基于压缩感知的自监督方法SISTA-Net，通过混合CNN-VSSM架构和自适应稀疏变换，在模拟和真实水下场景中分别实现了2.6 dB和3.4 dB的PSNR提升。

摘要翻译

单像素成像（Single-Pixel Imaging, SPI）是一种在强扰动环境中具有独特优势的前沿成像技术。现有单像素成像方法缺乏物理稀疏性约束，且忽视了局部与全局特征的融合，导致严重的噪声敏感性、结构失真与细节模糊。为克服这些局限，本文提出SISTA-Net——一种受压缩感知启发的自监督单像素成像方法。该方法将迭代收缩阈值算法（Iterative Shrinkage-Thresholding Algorithm, ISTA）展开为一个由数据保真模块与邻近映射模块构成的可解释网络。保真模块采用混合CNN-视觉状态空间模型（Visual State Space Model, VSSM）架构，融合局部与全局特征建模，以提升重建的完整性与保真度。我们利用深度非线性网络作为自适应稀疏变换，结合可学习的软阈值算子，在隐域施加显式物理稀疏约束，从而在极低采样率下仍能实现噪声抑制与抗干扰鲁棒性。多场景仿真实验表明，SISTA-Net在峰值信噪比（PSNR）上优于现有先进方法2.6 dB。真实远场水下测试取得平均3.4 dB的PSNR提升，验证了其强大的抗干扰能力。

摘要 (Abstract)

Single-pixel imaging (SPI) is a promising imaging modality with distinctive advantages in strongly perturbed environments. Existing SPI methods lack physical sparsity constraints and overlook the integration of local and global features, leading to severe noise vulnerability, structural distortions and blurred details. To address these limitations, we propose SISTA-Net, a compressive sensing-inspired self-supervised method for single-pixel imaging. SISTA-Net unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) into an interpretable network consisting of a data fidelity module and a proximal mapping module. The fidelity module adopts a hybrid CNN-Visual State Space Model (VSSM) architecture to integrate local and global feature modeling, enhancing reconstruction integrity and fidelity. We leverage deep nonlinear networks as adaptive sparse transforms combined with a learnable soft-thresholding operator to impose explicit physical sparsity in the latent domain, enabling noise suppression and robustness to interference even at extremely low sampling rates. Extensive experiments on multiple simulation scenarios demonstrate that SISTA-Net outperforms state-of-the-art methods by 2.6 dB in PSNR. Real-world far-field underwater tests yield a 3.4 dB average PSNR improvement, validating its robust anti-interference capability.

关键词: Single-pixel imaging, Compressive sensing, Self-supervised learning, SISTA-Net, Iterative Shrinkage-Thresholding Algorithm, CNN-VSSM architecture, Adaptive sparse transforms, Underwater imaging

185. ❌ FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing

作者: Fengjian Xue, Xuecheng Wu, Heli Sun, Yunyun Shi, Shi Chen, Liangyu Fu, Jinheng Xie, Dingkang Yang, Hao Wang, Junxiao Xue, Liang He 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29697v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于面部表情图像编辑的基准测试和评估方法，属于计算机视觉和图像处理领域。论文内容涉及数据集构建、评估指标设计、模型基准测试和微调，但完全不涉及大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大语言模型、深度学习技术或AI科学应用相关，而本文研究的是具体的图像编辑任务，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了FED-Bench基准测试，用于面部表情图像编辑的精细化评估，通过构建高质量数据集和跨粒度评估协议解决了现有评估偏差问题，并发现当前模型在保持高保真度和准确表情操控方面存在困难。

摘要翻译

面部表情图像编辑需要细粒度控制，在精确操纵表情的同时严格保持人物身份与背景。然而，现有编辑基准主要关注通用场景，缺乏高质量面部图像及相应的编辑指令。此外，当前评估指标在该任务中存在系统性偏差，往往倾向于惰性编辑或过拟合编辑。为弥补这些不足，我们提出了FED-Bench——一个包含严谨测试与精准评估套件的综合基准。首先，我们通过级联可扩展流程精心构建了包含747个三元组的基准数据集，每个三元组由原始图像、编辑指令和真实结果图像组成，以支持精确评估。其次，我们提出了FED-Score跨粒度评估协议，将评估解耦为三个维度：验证指令遵循度的对齐性（Alignment）、测试图像质量与身份保持度的保真性（Fidelity），以及量化表情变化强度的相对表情增益（Relative Expression Gain），从而有效缓解前述评估偏差。第三，我们对18个图像编辑模型进行基准测试，发现现有方法难以同时实现高保真度与准确的表情操控，其中细粒度指令遵循被确定为当前主要瓶颈。最后，基于所引入基准引擎的可扩展特性，我们提供了包含2万+张真实场景面部图像的训练集，并通过微调基线模型验证了其有效性，该模型实现了显著的性能提升。我们的基准及相关代码将很快公开。

摘要 (Abstract)

Facial expression image editing requires fine-grained control to strictly preserve human identity and background while precisely manipulating expression. However, existing editing benchmarks primarily focus on general scenarios, lacking high-quality facial images and corresponding editing instructions. Furthermore, current evaluation metrics exhibit systemic biases in this task, often favoring lazy editing or overfit editing. To bridge these gaps, we propose FED-Bench, a comprehensive benchmark featuring rigorous testing and an accurate evaluation suite. First, we carefully construct a benchmark of 747 triplets through a cascaded and scalable pipeline, each comprising an original image, an editing instruction, and a ground-truth image for precise evaluation. Second, we introduce FED-Score, a cross-granularity evaluation protocol that disentangles assessment into three dimensions: Alignment for verifying instruction following, Fidelity for testing image quality and identity preservation, and Relative Expression Gain for quantifying the magnitude of expression changes, effectively mitigating the aforementioned evaluation biases. Third, we benchmark 18 image editing models, revealing that current approaches struggle to simultaneously achieve high fidelity and accurate expression manipulation, with fine-grained instruction following identified as the primary bottleneck. Finally, leveraging the scalable characteristic of introduced benchmark engine, we provide a 20k+ in-the-wild facial training set and demonstrate its effectiveness by fine-tuning a baseline model that achieves significant performance gains. Our benchmark and related code will be made publicly open soon.

关键词: Facial expression editing, Benchmark evaluation, Image quality preservation, Instruction following, Cross-granularity assessment, Fine-grained control, Model fine-tuning, Evaluation bias mitigation

186. ❌ SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

作者: Ning Wang, Tieyue Wu, Naeha Sharif, Farid Boussaid, Guangming Zhu, Lin Mei, Mohammed Bennamoun, zhang liang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29692v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出SkeletonContext框架，利用预训练语言模型（LLMs）生成上下文提示来增强骨架动作识别，属于大模型在计算机视觉领域的应用创新。核心相关关键词：1）‘Large Language Models’（8分）：论文明确使用LLMs指导上下文提示重建，是方法的关键组成部分；2）‘Pre-training’（5分）：使用了预训练语言模型；3）‘AI for Science’（5分）：属于AI在科学计算/计算机视觉领域的应用。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对零样本骨架动作识别中骨架特征与语义描述之间的上下文缺失问题，提出了SkeletonContext框架，通过LLM驱动的上下文提示学习和关键部位解耦模块，在多个基准测试中实现了最先进的性能。

摘要翻译

基于骨架的零样本动作识别旨在通过语义描述将已知类别的知识迁移至未知动作类别。现有方法通常将骨架特征与文本嵌入对齐至共享潜在空间。然而，由于缺乏动作相关的物体等上下文信息，骨架表示与语义表示之间存在固有鸿沟，导致难以区分视觉相似的动作。为此，我们提出SkeletonContext——一个基于提示词的框架，通过语言驱动的上下文语义增强骨架运动表示。具体而言，我们设计了跨模态上下文提示模块，该模块利用预训练语言模型在大型语言模型（LLM）生成的引导下重构被遮蔽的上下文提示词。这种设计能够将语言上下文有效迁移至骨架编码器，实现实例级语义 grounding 并提升跨模态对齐效果。此外，我们引入关键部位解耦模块，以解耦与运动相关的关节点特征，确保即使在缺乏显式物体交互的情况下仍能实现鲁棒的动作理解。在多个基准数据集上的大量实验表明，SkeletonContext 在常规与广义零样本设定下均取得了最先进的性能，验证了其在上下文推理及区分细粒度视觉相似动作方面的有效性。

摘要 (Abstract)

Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions.

关键词: skeleton-based action recognition, zero-shot learning, cross-modal alignment, context prompt learning, large language models, semantic grounding, visual similarity, state-of-the-art performance

187. ❌ Clinical DVH metrics as a loss function for 3D dose prediction in head and neck radiotherapy

作者: Ruochen Gao, Marius Staring, Frank Dankers 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29670v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像领域的深度学习应用，特别是头颈部放疗的3D剂量预测。论文的核心贡献是提出了一种临床导向的损失函数（CDM loss）和高效的ROI编码方法，以提高剂量预测与临床评估标准的一致性。虽然论文属于AI在科学领域的应用（医学影像/放疗），但所有关键词都直接与大模型（LLM）相关技术、训练方法、推理优化、对齐技术等特定主题相关，而本文完全不涉及大模型技术，仅使用标准的3D U-Net进行医学影像分析。因此，除了’AI for Science OR Bioinformatics OR Cheminformatics’因属于AI在科学（医学）应用而获得5分（有一定关联）外，其余所有关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

本研究针对头颈部放疗中深度学习剂量预测模型与临床评估标准不一致的问题，提出了一种临床DVH指标损失函数（CDM loss）和高效的ROI编码方法，显著提高了预测结果与临床治疗计划标准的一致性，同时大幅降低了训练时间和GPU内存使用。

摘要翻译

目的：基于深度学习的三维剂量预测广泛应用于自动化放疗工作流程。然而，现有模型大多采用体素级回归损失进行训练，这与基于剂量体积直方图（DVH）指标的临床计划评估标准契合度较低。本研究旨在开发一种临床引导的损失函数构建方法，能够直接优化临床使用的DVH指标，同时保持头颈部（H&N）剂量预测的计算效率。
方法：我们提出了一种临床DVH指标损失函数（CDM损失），该函数融合了可微分的D指标和替代V指标，并结合无损的感兴趣区域（ROI）位掩码编码以提高训练效率。该方法在174例头颈部患者数据上使用时序分割（137例训练，37例测试）进行评估。
结果：与基于平均绝对误差（MAE）和DVH曲线的损失函数相比，CDM损失显著改善了靶区覆盖度，并满足了所有临床约束条件。使用标准3D U-Net模型时，计划靶区（PTV）评分从1.544（MAE）降低至0.491（MAE + CDM），而危及器官（OAR）保护效果保持相当水平。位掩码编码使训练时间减少83%，并降低了GPU内存使用量。
结论：直接优化临床使用的DVH指标，能够使三维剂量预测比传统的体素级或基于DVH曲线的监督方法更符合临床治疗计划标准。所提出的CDM损失函数与高效的ROI位掩码编码相结合，为头颈部剂量预测提供了一个实用且可扩展的框架。

摘要 (Abstract)

Purpose: Deep-learning-based three-dimensional (3D) dose prediction is widely used in automated radiotherapy workflows. However, most existing models are trained with voxel-wise regression losses, which are poorly aligned with clinical plan evaluation criteria based on dose-volume histogram (DVH) metrics. This study aims to develop a clinically guided loss formulation that directly optimizes clinically used DVH metrics while remaining computationally efficient for head and neck (H&N) dose prediction. Methods: We propose a clinical DVH metric loss (CDM loss) that incorporates differentiable \textit{D-metrics} and surrogate \textit{V-metrics}, together with a lossless bit-mask region-of-interest (ROI) encoding to improve training efficiency. The method was evaluated on 174 H&N patients using a temporal split (137 training, 37 testing). Results: Compared with MAE- and DVH-curve based losses, CDM loss substantially improved target coverage and satisfied all clinical constraints. Using a standard 3D U-Net, the PTV Score was reduced from 1.544 (MAE) to 0.491 (MAE + CDM), while OAR sparing remained comparable. Bit-mask encoding reduced training time by 83% and lowered GPU memory usage. Conclusion: Directly optimizing clinically used DVH metrics enables 3D dose predictions that are better aligned with clinical treatment planning criteria than conventional voxel-wise or DVH-curve-based supervision. The proposed CDM loss, combined with efficient ROI bit-mask encoding, provides a practical and scalable framework for H&N dose prediction.

关键词: 3D dose prediction, head and neck radiotherapy, clinical DVH metrics, loss function, deep learning, U-Net, bit-mask encoding, treatment planning

188. ❌ CoRe-DA: Contrastive Regression for Unsupervised Domain Adaptation in Surgical Skill Assessment

作者: Dimitrios Anastasiou, Razvan Caramalau, Jialang Xu, Runlong He, Freweini Tesfai, Matthew Boal, Nader Francis, Danail Stoyanov, Evangelos B. Mazomenos 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29666v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于计算机视觉和深度学习在医疗手术技能评估（SSA）中的应用，核心贡献是提出了一种基于对比回归的无监督域适应（UDA）方法CoRe-DA。与关键词列表的相关性分析如下：1）高度相关（10分）：关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’中的’Domain Adaptation’是论文的核心技术，论文明确提出了UDA方法并建立了首个SSA回归的UDA基准。2）中度相关（8分）：关键词’AI for Science OR Bioinformatics OR Cheminformatics’中的’AI for Science’与论文在医疗科学（手术评估）中的应用相关，属于AI在科学领域的应用，但论文更侧重于计算机视觉而非大模型。3）完全无关（0分）：其余25个关键词均涉及大语言模型（LLMs）及其相关技术（如MoE、Scaling Laws、RLHF、RAG、Agents等）、推理方法（如CoT、System 2）、或模型优化技术（如Quantization、Speculative Decoding），而论文未涉及任何大模型、语言处理或这些特定技术，仅使用传统的深度学习/计算机视觉方法进行回归分析。

!!! tip deepseek-chat TL;DR

该论文针对手术技能评估中标注成本高和模型跨域泛化差的问题，提出了一种基于对比回归的无监督域适应方法CoRe-DA，在多个手术数据集上实现了优于现有方法的跨域性能，无需目标域标注数据。

摘要翻译

基于视觉的手术技能评估（SSA）能够实现客观且可扩展的手术表现评价。该领域的发展受到两方面制约：一是人工标注定量技能评分所需的高成本与时间消耗，二是现有回归模型对新手术任务及环境的泛化能力不足。与此同时，当前已有大量未标注的手术视频数据可用，这推动了针对SSA的无监督域自适应（UDA）方法的发展。我们首次构建了面向SSA回归任务的UDA基准测试，涵盖干式实验室与临床环境下的四个数据集，并包含开放手术与机器人手术场景。我们在具有挑战性的域偏移条件下评估了八种代表性模型，并提出了一种新颖的基于对比回归的自适应框架CoRe-DA。该方法通过相对评分监督与目标域自训练学习域不变特征表示。在两种UDA设置下的综合实验表明，CoRe-DA优于现有先进方法，在未使用任何标注目标域数据训练的情况下，于干式实验室和临床目标数据集上分别达到了0.46和0.41的斯皮尔曼相关系数。总体而言，CoRe-DA实现了具有可靠跨域泛化能力的可扩展SSA，而现有方法在此方面表现欠佳。我们的代码与数据集将在https://github.com/anastadimi/CoRe-DA发布。

摘要 (Abstract)

Vision-based surgical skill assessment (SSA) enables objective and scalable evaluation of operative performance. Progress in this field is constrained by the high cost and time demands for manual annotation of quantitative skill scores, as well as the poor generalization of existing regression models to new surgical tasks and environments. Meanwhile, appreciable volumes of unlabeled video data are now available, motivating the development of unsupervised domain adaptation (UDA) methods for SSA. We introduce the first benchmark for UDA in SSA regression, spanning four datasets across dry-lab and clinical settings as well as open and robotic surgery. We evaluate eight representative models under challenging domain shifts and propose CoRe-DA, a novel contrastive regression-based adaptation framework. Our method learns domain-invariant representations through relative-score supervision and target-domain self-training. Comprehensive experiments across two UDA settings show that CoRe-DA is superior to state-of-the-art methods, achieving Spearman Correlation Coefficients of 0.46 and 0.41 on dry-lab and clinical target datasets, respectively, without using any labeled target data for training. Overall, CoRe-DA enables scalable SSA with reliable cross-domain generalization, where existing methods underperform. Our code and datasets will be released at https://github.com/anastadimi/CoRe-DA.

关键词: surgical skill assessment, unsupervised domain adaptation, contrastive regression, domain-invariant representations, cross-domain generalization, video analysis, computer vision, medical AI

189. ❌ CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

作者: Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出CutClaw，一个基于多模态大语言模型（MLLMs）的多智能体框架，用于自动化视频编辑。核心相关关键词：1）‘Large Language Models OR LLMs OR Foundation Models’（10分）：论文明确使用MLLMs作为智能体系统的基础；2）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）：框架包含Playwriter Agent、Editor Agent和Reviewer Agent，构成完整的智能体工作流；3）‘Multi-agent Systems OR Agent Coordination’（10分）：多个智能体协作完成视频编辑任务。其他关键词如MoE、量化、推理加速等未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了CutClaw，一个基于多模态大语言模型的多智能体框架，用于自动化地将长时间原始视频编辑成与音乐同步的短视频，实验表明其显著优于现有基线方法。

摘要翻译

在当前社交媒体中，通过音频对齐编辑视频内容已成为一种数字人工艺术形式。然而，手动视频编辑耗时且重复的特性长期以来一直是电影制作者和专业内容创作者面临的挑战。本文介绍CutClaw，这是一个自主多智能体框架，旨在将数小时的原始素材编辑成有意义的短视频。该框架利用多个多模态语言模型（Multimodal Language Models，MLLMs）作为智能体系统，生成音乐同步、遵循指令且视觉表现力强的视频。具体而言，我们的方法首先采用分层多模态分解，以捕捉视觉和音频素材中的细粒度细节与全局结构。随后，为确保叙事连贯性，一个编剧智能体（Playwriter Agent）统筹整个故事流程并构建长期叙事结构，将视觉场景与音乐转换锚定。最后，编辑与审核智能体（Editor and Reviewer Agents）通过基于严格美学和语义标准选择细粒度视觉内容，协作优化最终剪辑，从而构建出精简的短视频。我们进行了详细实验，证明CutClaw在生成高质量、节奏对齐的视频方面显著优于现有先进基线方法。代码发布于：https://github.com/GVCLab/CutClaw。

摘要 (Abstract)

Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

关键词: autonomous multi-agent framework, Multimodal Language Models, video editing, music synchronization, hierarchical multimodal decomposition, agent collaboration, narrative consistency, aesthetic optimization

190. ❌ STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer

作者: Andrea DeMarco, Ian Fenech Conti, Hayley Camilleri, Ardiana Bushi, Simone Riggi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于计算机视觉领域（Vision Transformer）在射电天文学中的应用，属于AI for Science范畴，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文核心方法涉及’Continual Pre-training’（10分），但未涉及大语言模型（LLMs）或其他深度学习技术原理创新，因此其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出STRADAViT，一个基于自监督Vision Transformer的持续预训练框架，用于创建可迁移的射电天文学图像编码器，在多个射电天文学形态学基准测试中相比基线模型取得了改进的迁移性能。

摘要翻译

新一代射电天文巡天项目正在产生数百万个已解析源，但跨异构望远镜与成像流程的稳健形态学分析仍具挑战。本文提出STRADAViT——一种用于可迁移射电天文图像编码器的自监督视觉Transformer（Vision Transformer）持续预训练框架。该框架整合了混合巡天预训练数据集、射电天文感知视图生成技术，以及通过纯重建分支、纯对比分支和双阶段分支实现的受控持续预训练。预训练使用来自MeerKAT、ASKAP、LOFAR/LoTSS及SKA（平方公里阵列，Square Kilometre Array）数据的512x512射电天文切图。我们通过在三个形态学基准数据集——MiraBest、LoTSS DR2和Radio Galaxy Zoo（RGZ）——上进行线性探测与微调来评估迁移性能。相较于持续预训练所用的初始化模型，最佳的双阶段STRADViT模型在所有报告的线性探测设定及大多数微调设定中均提升了宏平均F1分数（Macro-F1），其中在RGZ DR1数据集上提升最为显著。相较于强大的DINOv2基线模型，性能提升具有选择性，但在线性探测下的LoTSS DR2与RGZ DR1数据集上，以及在微调下的MiraBest与RGZ DR1数据集上仍保持正向增益。一项针对DINOv2初始化的HCL消融实验进一步表明，该适应方案并非依赖于单一初始点。发布的STRADAViT检查点仍是优选模型，因为相较于基于DINOv2的替代方案，其能以更低的标记数量与下游成本实现具有竞争力的迁移性能。这些结果表明，射电天文感知视图生成与分阶段持续预训练为射电天文迁移任务提供了比即用型视觉Transformer更强的起点。

摘要 (Abstract)

Next-generation radio astronomy surveys are producing millions of resolved sources, but robust morphology analysis remains difficult across heterogeneous telescopes and imaging pipelines. We present STRADAViT, a self-supervised Vision Transformer continued-pretraining framework for transferable radio astronomy image encoders. STRADAViT combines a mixed-survey pretraining dataset, radio astronomy-aware view generation, and controlled continued pretraining through reconstruction-only, contrastive-only, and two-stage branches. Pretraining uses 512x512 radio astronomy cutouts from MeerKAT, ASKAP, LOFAR/LoTSS, and SKA data. We evaluate transfer with linear probing and fine-tuning on three morphology benchmarks: MiraBest, LoTSS DR2, and Radio Galaxy Zoo. Relative to the initialization used for continued pretraining, the best two-stage STRADAViT models improve Macro-F1 in all reported linear-probe settings and in most fine-tuning settings, with the largest gain on RGZ DR1. Relative to strong DINOv2 baselines, gains are selective but remain positive on LoTSS DR2 and RGZ DR1 under linear probing, and on MiraBest and RGZ DR1 under fine-tuning. A targeted DINOv2-initialized HCL ablation further shows that the adaptation recipe is not specific to a single starting point. The released STRADAViT checkpoint remains the preferred model because it offers competitive transfer at lower token count and downstream cost than the DINOv2-based alternative. These results show that radio astronomy-aware view generation and staged continued pretraining provide a stronger starting point than out-of-the-box Vision Transformers for radio astronomy transfer.

关键词: Vision Transformer, self-supervised learning, continued pretraining, radio astronomy, transfer learning, image encoder, morphology analysis, foundational model

191. ❌ Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors

作者: Pengfei Zhou, Xiangyue Zhang, Xukun Shen, Yong Hu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于掩码生成模型的文本到运动合成，专注于运动数据的局部动态复杂性分析，通过运动谱描述符改进掩码策略和注意力机制。所有评分关键词均针对大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等），而本文研究的是计算机视觉/图形学中的运动生成任务，使用掩码生成模型（如类似BERT的架构），并非大语言模型技术。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文针对文本到运动合成中掩码生成模型对运动帧处理过于均匀的问题，提出了基于运动谱描述符的复杂度感知方法DynMask，显著提升了动态复杂运动生成的质量和整体性能。

摘要翻译

掩码生成模型已成为文本驱动动作合成的强大范式，但其在掩码、注意力与解码过程中仍过于均匀地处理动作帧。这与动作序列的特性并不匹配——动作的局部动态复杂度随时间急剧变化。我们发现，当前掩码动作生成器在动态复杂度高的动作上性能下降尤为显著，且逐帧生成误差与运动动态特征高度相关。受此差异启发，我们提出了运动谱描述符（Motion Spectral Descriptor, MSD），这是一种简单且无需参数的方法，通过动作速度的短时频谱计算局部动态复杂度。与基于学习的难度预测器不同，MSD具有确定性、可解释性，并直接源自动作信号本身。我们利用MSD使掩码动作生成具备复杂度感知能力：具体而言，MSD在训练阶段指导以内容为中心的掩码策略，为自注意力机制提供谱相似性先验，并可在迭代解码过程中额外调节令牌级采样。基于现有掩码动作生成器构建的方法DynMask，在动态复杂的动作上提升生成效果最为显著，同时在HumanML3D和KIT-ML数据集上获得了更优的整体FID指标。这些结果表明，尊重局部运动复杂度是掩码动作生成的重要设计原则。项目页面：https://xiangyue-zhang.github.io/DynMask

摘要 (Abstract)

Masked generative models have become a strong paradigm for text-to-motion synthesis, but they still treat motion frames too uniformly during masking, attention, and decoding. This is a poor match for motion, where local dynamic complexity varies sharply over time. We show that current masked motion generators degrade disproportionately on dynamically complex motions, and that frame-wise generation error is strongly correlated with motion dynamics. Motivated by this mismatch, we introduce the Motion Spectral Descriptor (MSD), a simple and parameter-free measure of local dynamic complexity computed from the short-time spectrum of motion velocity. Unlike learned difficulty predictors, MSD is deterministic, interpretable, and derived directly from the motion signal itself. We use MSD to make masked motion generation complexity-aware. In particular, MSD guides content-focused masking during training, provides a spectral similarity prior for self-attention, and can additionally modulate token-level sampling during iterative decoding. Built on top of masked motion generators, our method, DynMask, improves motion generation most clearly on dynamically complex motions while also yielding stronger overall FID on HumanML3D and KIT-ML. These results suggest that respecting local motion complexity is a useful design principle for masked motion generation. Project page: https://xiangyue-zhang.github.io/DynMask

关键词: masked generative models, text-to-motion synthesis, motion spectral descriptor, dynamic complexity, complexity-aware generation, motion generation, HumanML3D, KIT-ML

192. ❌ Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification

作者: Mingkun Tan, Xilu Wang, Michael Kloster, Tim W. Nattkemper 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29633v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于自监督联邦学习在硅藻分类中的应用，研究数据异构性下的标签稀缺问题。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词都特指大语言模型（LLM）相关技术，而本文研究的是计算机视觉领域的联邦学习，未涉及任何LLM技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI应用于生物信息学（硅藻分类）这一科学领域，属于AI for Science的范畴，但并非核心焦点（核心是联邦学习方法），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了在数据异构性和标签稀缺条件下，自监督联邦学习在硅藻分类中的应用，提出了PreDi数据划分方案和PreP-WFL方法，实验表明自监督联邦学习优于本地训练，且PreP-WFL能有效缓解低流行度类别导致的性能下降。

摘要翻译

分散式异构数据下的标签稀缺视觉分类是模式识别中的一个基础性挑战，尤其在多个站点呈现部分重叠类别集的情况下。尽管自监督联邦学习（SSFL）提供了一种有前景的解决方案，但现有研究通常假设在预训练和微调阶段存在相同的数据异质性模式。此外，当前的数据划分方案往往无法生成纯粹的部分类别不相交数据设置，这限制了对现实世界标签空间异质性的可控模拟。在本研究中，我们以硅藻分类作为现实世界中的代表性实例引入SSFL，并系统性地研究了特定阶段的数据异质性。我们分析了预训练阶段未标记数据量的跨站点差异，以及下游微调阶段标签空间的错位问题。为了在可控环境中研究后者，我们提出了PreDi划分方案，该方案将标签空间异质性解耦为两个正交维度，即类别普遍性（Prevalence）和类别集规模差异（Disparity），从而能够分别分析其影响。基于所得洞见的指导，我们进一步提出了基于普遍性的个性化加权联邦学习（PreP-WFL），以在低普遍性场景中自适应地增强稀有类别的表征。大量实验表明，在同质和异质设置下，SSFL始终优于仅局部训练。未标记数据量的显著异质性与表征预训练效果的提升相关，而在标签空间异质性下，普遍性主导性能表现，差异性的影响较小。PreP-WFL有效缓解了性能下降，且随着普遍性降低，其增益逐渐增大。这些发现为表征分散式识别系统中的标签空间异质性提供了机制性基础。

摘要 (Abstract)

Label-scarce visual classification under decentralized and heterogeneous data is a fundamental challenge in pattern recognition, especially when sites exhibit partially overlapping class sets. While self-supervised federated learning (SSFL) offers a promising solution, existing studies commonly assume the same data heterogeneity pattern throughout pre-training and fine-tuning. Moreover, current partitioning schemes often fail to generate pure partially class-disjoint data settings, limiting controllable simulation of real-world label-space heterogeneity. In this work, we introduce SSFL for diatom classification as a representative real-world instance and systematically investigate stage-specific data heterogeneity. We study cross-site variation in unlabeled data volume during pre-training and label-space misalignment during downstream fine-tuning. To study the latter in a controllable setting, we propose PreDi, a partitioning scheme that disentangles label-space heterogeneity into two orthogonal dimensions, namely class Prevalence and class-set size Disparity, enabling separate analysis of their effects. Guided by the resulting insights, we further propose PreP-WFL (Prevalence-based Personalized Weighted Federated Learning) to adaptively strengthen rare-class representations in low-prevalence scenarios. Extensive experiments show that SSFL consistently outperforms local-only training under both homogeneous and heterogeneous settings. The pronounced heterogeneity in unlabeled data volume is associated with improved representation pre-training, whereas under label-space heterogeneity, prevalence dominates performance and disparity has a smaller effect. PreP-WFL effectively mitigates this degradation, with gains increasing as prevalence decreases. These findings provide a mechanistic basis for characterizing label-space heterogeneity in decentralized recognition systems.

关键词: self-supervised federated learning, data heterogeneity, label-scarce classification, diatom classification, partially overlapping class sets, PreDi partitioning, PreP-WFL, decentralized recognition systems

作者: Sherif Abdelwahab 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29631v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究边缘设备上的视频流检索优化，核心创新是新颖性过滤算法和轻量级编码器架构。与大多数大模型技术关键词无关，仅与"Small Language Models OR SLMs OR On-device AI"（8分）相关，因为论文使用8M参数的轻量级编码器在边缘设备运行；与"Retrieval-Augmented Generation OR RAG OR Retrieval-Generation"（5分）有一定关联，因为涉及检索增强的跨模态检索系统。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种边缘摄像头视频流检索架构，通过新颖性过滤算法去除冗余帧，结合轻量级编码器和云端重排序，在保持低功耗（2.7mW）的同时将Hit@5性能提升至45.6%。

摘要翻译

常开边缘摄像头持续生成视频流，其中冗余帧会挤占检索结果前列位置，从而降低跨模态检索的准确性。本文提出一种流式检索架构：设备端ε-网络过滤器仅保留语义新颖的帧，构建去噪嵌入索引；跨模态适配器与云端重排序器则补偿紧凑编码器在模态对齐上的不足。在两种第一人称数据集（AEA、EPIC-KITCHENS）上，针对八种视觉-语言模型（8M-632M参数规模），单次流式过滤器的表现优于离线替代方案（k-means、最远点采样、均匀采样、随机采样）。该架构整体采用8M参数的设备端编码器，在预估功耗2.7毫瓦的条件下，于预留测试数据上实现了45.6%的Hit@5检索准确率。

摘要 (Abstract)

Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder’s weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.

关键词: edge cameras, cross-modal retrieval, novelty filtering, streaming retrieval, on-device encoder, embedding index, vision-language models, egocentric datasets

194. ❌ BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation

作者: Johann-Ludwig Herzog, Mathis Jürgen Adler, Leonard Hackel, Yan Shu, Angelos Zavras, Ioannis Papoutsis, Paolo Rota, Begüm Demir 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29630v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于地球观测领域的视觉-语言模型数据集创建和基准测试，与大多数关键词（涉及大模型技术原理、训练方法、推理优化等）完全无关。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（地球观测/遥感）领域的应用，但论文本身不涉及生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文针对遥感领域缺乏大规模多传感器图像-文本数据集的问题，提出了BigEarthNet.txt数据集，并通过实验证明使用该数据集微调视觉-语言模型能显著提升多种地球观测任务的性能。

摘要翻译

视觉语言模型在计算机视觉领域展现出强大性能，但其在遥感数据上的表现仍受限于缺乏具有多样化文本标注的大规模多传感器遥感图文数据集。现有数据集主要包含航拍红绿蓝影像，其描述文本简短或缺乏地理依据，且标注类型多样性有限。为突破这一局限，我们提出了BigEarthNet.txt——一个为推进地球观测领域中多任务指令驱动的图文学习而设计的大规模多传感器图文数据集。该数据集包含464,044幅配准的哨兵-1合成孔径雷达影像与哨兵-2多光谱影像，并配有960万条文本标注，涵盖：i) 描述土地利用/土地覆盖类别、空间关系及环境背景的地理锚定描述；ii) 适用于不同任务的视觉问答对；iii) 用于边界框预测的指代表达检测指令。通过对比统计分析，我们证明BigEarthNet.txt在文本丰富度与标注类型多样性上均超越现有遥感图文数据集。我们进一步构建了经人工验证的基准数据集划分，用于评估视觉语言模型在遥感与计算机视觉任务中的表现。结果表明，现有模型在涉及复杂土地利用/土地覆盖类别的任务中存在局限，而使用BigEarthNet.txt进行微调后，所有评估任务均取得持续的性能提升。

摘要 (Abstract)

Vision-langugage models (VLMs) have shown strong performance in computer vision (CV), yet their performance on remote sensing (RS) data remains limited due to the lack of large-scale, multi-sensor RS image-text datasets with diverse textual annotations. Existing datasets predominantly include aerial Red-Green-Blue imagery, with short or weakly grounded captions, and provide limited diversity in annotation types. To address this limitation, we introduce BigEarthNet.txt, a large-scale, multi-sensor image-text dataset designed to advance instruction-driven image-text learning in Earth observation across multiple tasks. BigEarthNet.txt contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction. Through a comparative statistical analysis, we demonstrate that BigEarthNet.txt surpasses existing RS image-text datasets in textual richness and annotation type variety. We further establish a manually-verified benchmark split to evaluate VLMs in RS and CV. The results show the limitations of these models on tasks that involve complex LULC classes, whereas fine-tuning using BigEarthNet.txt results in consistent performance gains across all considered tasks.

关键词: vision-language models, remote sensing, image-text dataset, Earth observation, multi-sensor, fine-tuning, benchmark, land-use/land-cover

195. ❌ Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

作者: Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29620v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Unify-Agent，一个用于世界基础图像合成的统一多模态代理，核心创新在于将图像生成重构为代理流程（包括提示理解、多模态证据搜索、基础重述和最终合成），这与LLM代理、工具使用、检索增强生成、世界模型、思维链和系统2思维高度相关（8-15分）。论文涉及基础模型和事实性改进（8分），以及预训练和微调（5分）。其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出Unify-Agent，通过代理建模将图像生成重构为包含推理、搜索和生成的流程，以解决统一多模态模型在长尾和知识密集型概念图像生成中的局限性，实验表明其显著提升了基础模型性能并接近最强闭源模型的世界知识能力。

摘要翻译

统一多模态模型为理解多样且复杂的现实世界知识并生成高质量图像提供了一种自然且有前景的架构。然而，它们仍主要依赖于冻结的参数化知识，这使得其在涉及长尾和知识密集型概念的现实世界图像生成任务中表现不佳。受智能体在现实世界任务中广泛成功的启发，我们探索采用智能体建模来解决这一局限。具体而言，我们提出了Unify-Agent，一个用于世界知识接地的图像合成的统一多模态智能体，它将图像生成重新构建为一个智能体流程，包括提示理解、多模态证据搜索、接地重描述以及最终合成。为训练我们的模型，我们构建了一个定制的多模态数据处理流程，并精心整理了14.3万条用于世界知识接地的图像合成的高质量智能体轨迹，从而能够对整个智能体生成过程进行有效监督。我们进一步引入了FactIP基准，该基准涵盖12类具有文化意义和长尾特征的事实概念，明确要求外部知识接地。大量实验表明，我们提出的Unify-Agent在多种基准测试和现实世界生成任务中，相较于其基础统一模型有显著提升，同时在世界知识能力上接近最强的闭源模型。作为基于智能体的世界知识接地图像合成的早期探索，我们的工作凸显了将推理、搜索和生成紧密耦合对于实现可靠的开放世界智能体图像合成的价值。

摘要 (Abstract)

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

关键词: Unified Multimodal Agent, World-Grounded Image Synthesis, Agentic Pipeline, Multimodal Evidence Searching, Factual Concepts, Knowledge Grounding, Open-World Generation, Reasoning-Searching-Generation Coupling

196. ❌ Video-Oasis: Rethinking Evaluation of Video Understanding

作者: Geuntaek Lim, Minho Shim, Sungjune Park, Jaeyun Lee, Inwoong Lee, Taeoh Kim, Dongyoon Wee, Yukyung Choi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Video-Oasis: Rethinking Evaluation of Video Understanding》专注于视频理解领域的评估方法研究，提出了一个诊断套件来系统评估现有基准测试，并分析视频理解中的时空挑战。论文内容涉及视频理解、基准测试评估、模型性能分析等，但未涉及任何大模型、深度学习技术原理、AI for Science等关键词。所有关键词均与大模型技术、训练方法、推理优化、AI应用等具体技术相关，而本文是纯粹的计算机视觉/视频理解领域的评估方法论研究，与这些技术关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了视频理解评估中存在的问题，发现现有基准测试中54%的样本无需视觉或时序信息即可解决，并在剩余样本上SOTA模型表现接近随机猜测，为此提出了Video-Oasis诊断套件来提供更严格的评估指南。

摘要翻译

视频理解固有的复杂性使得难以判断性能提升究竟源于视觉感知、语言推理还是先验知识。尽管已有众多基准测试用于评估高级推理能力，但构成视频理解的核心标准在很大程度上仍被忽视。我们并未引入新的基准，而是退后一步重新审视当前视频理解的研究现状。本文提出Video-Oasis——一个可持续的诊断套件，旨在系统评估现有评测体系，并提炼出视频理解中的时空挑战。我们的分析揭示了两项关键发现：(1) 现有基准测试中54%的样本无需视觉输入或时序上下文即可求解；(2) 在剩余样本上，最先进模型的性能仅略高于随机猜测。为弥合这一差距，我们探究了哪些算法设计选择有助于实现鲁棒的视频理解，为未来研究提供实用指南。我们希望这项工作能为基准构建和架构开发的严谨评估提供标准参考。代码发布于https://github.com/sejong-rcv/Video-Oasis。

摘要 (Abstract)

The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sustainable diagnostic suite designed to systematically evaluate existing evaluations and distill spatio-temporal challenges for video understanding. Our analysis reveals two critical findings: (1) 54% of existing benchmark samples are solvable without visual input or temporal context, and (2) on the remaining samples, state-of-the-art models exhibit performance barely exceeding random guessing. To bridge this gap, we investigate which algorithmic design choices contribute to robust video understanding, providing practical guidelines for future research. We hope our work serves as a standard guideline for benchmark construction and the rigorous evaluation of architecture development. Code is available at https://github.com/sejong-rcv/Video-Oasis.

关键词: video understanding, evaluation, benchmark, diagnostic suite, spatio-temporal challenges, model performance, visual perception, temporal context

197. ❌ FlowID : Enhancing Forensic Identification with Latent Flow-Matching Models

作者: Jules Ripoll, David Bertoin, Alasdair Newson, Charles Dossal, Jose Pablo Baraybar 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29591v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《FlowID : Enhancing Forensic Identification with Latent Flow-Matching Models》专注于利用图像生成模型进行法医面部重建，属于计算机视觉和法医学的交叉应用。所有关键词均围绕大语言模型（LLM）及其相关技术（如训练、推理、对齐、代理等），而本文的核心是图像生成模型（如扩散模型或流匹配模型），并非LLM。因此，除“AI for Science OR Bioinformatics OR Cheminformatics”因涉及科学应用（法医学）得5分外，其余关键词均得0分。论文未提及任何指定专家作者。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FlowID的身份保持面部重建方法，利用图像生成模型处理暴力死亡导致的损伤面部，以支持法医识别，并在新基准InjuredFaces上优于现有开源方法。

摘要翻译

每天都有许多人死于暴力环境，无论是犯罪、战争、迁徙还是气候灾难。法医学与执法机构会记录许多逝者的肖像作为证据，但无法立即对其进行身份识别。虽然传统图像编辑工具可以处理这些照片以供公开发布，但工作流程冗长且效果欠佳。在本研究中，我们利用图像生成模型的最新进展——这些模型现已能够生成逼真的人像——提出FlowID，一种保持身份特征的面部重建方法。我们的方法结合了单图像微调（使生成模型能适应分布外受伤面部）与基于注意力的掩码技术（将编辑定位在受损区域，同时保留身份关键特征）。这些组件共同实现了暴力致死痕迹的消除，同时保留足够的身份信息以支持识别。为评估本方法，我们引入了InjuredFaces（受损面部）这一针对严重面部损伤下身份保持面部重建的新基准。除作为本研究的评估工具外，InjuredFaces还为学界提供了标准化资源，用于研究和比较极端条件下的面部重建方法。实验结果表明，FlowID在保持较低内存需求的同时优于现有开源方法，使其适合本地部署且不损害数据隐私。

摘要 (Abstract)

Every day, many people die under violent circumstances, whether from crimes, war, migration, or climate disasters. Medico-legal and law enforcement institutions document many portraits of the deceased for evidence, but cannot immediately carry out identification on them. While traditional image editing tools can process these photos for public release, the workflow is lengthy and produces suboptimal results. In this work, we leverage advances in image generation models, which can now produce photorealistic human portraits, to introduce FlowID, an identity-preserving facial reconstruction method. Our approach combines single-image fine-tuning, which adapts the generative model to out-of-distribution injured faces, with attention-based masking that localizes edits to damaged regions while preserving identity-critical features. Together, these components enable the removal of artifacts from violent death while retaining sufficient identity information to support identification. To evaluate our method, we introduce InjuredFaces, a novel benchmark for identity-preserving facial reconstruction under severe facial damage. Beyond serving as an evaluation tool for this work, InjuredFaces provides a standardized resource for the community to study and compare methods addressing facial reconstruction in extreme conditions. Experimental results show that FlowID outperforms state-of-the-art open-source methods while maintaining low memory requirements, making it suitable for local deployment without compromising data privacy.

关键词: forensic identification, facial reconstruction, image generation models, identity-preserving, injured faces, latent flow-matching, single-image fine-tuning, attention-based masking

198. ❌ Transmittance-Guided Structure-Texture Decomposition for Nighttime Image Dehazing

作者: Francesco Moretti, Giulia Bianchi, Andrea Gallo 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29507v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于夜间图像去雾的计算机视觉任务，提出了一种基于透射率校正和结构-纹理分解的两阶段框架。论文内容涉及图像处理、颜色空间转换、滤波算法和图像融合技术，但完全不涉及大语言模型、深度学习技术原理、AI for Science或任何评分关键词中的大模型相关主题。所有关键词均与大模型、深度学习技术或科学AI应用相关，而本文是传统的图像处理研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种两阶段的夜间图像去雾方法，通过透射率校正和YUV颜色空间中的结构-纹理分解，有效解决了夜间雾霾图像的低可见度、颜色失真和对比度降低问题。

摘要翻译

在雾霾条件下拍摄的夜间图像因大气散射、悬浮颗粒吸收及人工光源非均匀照明的共同作用，其质量严重退化，表现为能见度低、色彩失真和对比度下降。现有夜间去雾方法虽取得部分成效，但通常仅解决部分问题（如光晕抑制或亮度增强），未能联合处理全部退化因素。本文提出一种融合透射率校正与结构-纹理分层优化的两阶段夜间图像去雾框架。第一阶段，我们提出一种新颖的透射率校正方法：首先建立边界约束的初始透射率图，随后根据图像区域是否对应光源区域进行区域自适应补偿与归一化处理。通过YUV色彩空间中的二次高斯滤波方案估计空间变化的大气光图。校正后的透射率图与大气光图结合改进的夜间成像模型，生成初始去雾图像。第二阶段，我们提出STAR-YUV分解模型，在YUV色彩空间内将去雾图像分解为结构层与纹理层。对结构层采用伽马校正与基于MSRCR的色彩恢复进行光照补偿和色彩偏差校正，同时对纹理层应用拉普拉斯-高斯滤波以增强细节。通过包含非线性Retinex层融合与初始去雾结果线性混合的两阶段融合策略，最终生成优化输出。

摘要 (Abstract)

Nighttime images captured under hazy conditions suffer from severe quality degradation, including low visibility, color distortion, and reduced contrast, caused by the combined effects of atmospheric scattering, absorption by suspended particles, and non-uniform illumination from artificial light sources. While existing nighttime dehazing methods have achieved partial success, they typically address only a subset of these issues, such as glow suppression or brightness enhancement, without jointly tackling the full spectrum of degradation factors. In this paper, we propose a two-stage nighttime image dehazing framework that integrates transmittance correction with structure-texture layered optimization. In the first stage, we introduce a novel transmittance correction method that establishes boundary-constrained initial transmittance maps and subsequently applies region-adaptive compensation and normalization based on whether image regions correspond to light source areas. A quadratic Gaussian filtering scheme operating in the YUV color space is employed to estimate the spatially varying atmospheric light map. The corrected transmittance map and atmospheric light map are then used in conjunction with an improved nighttime imaging model to produce the initial dehazed image. In the second stage, we propose a STAR-YUV decomposition model that separates the dehazed image into structure and texture layers within the YUV color space. Gamma correction and MSRCR-based color restoration are applied to the structure layer for illumination compensation and color bias correction, while Laplacian-of-Gaussian filtering is applied to the texture layer for detail enhancement. A novel two-phase fusion strategy, comprising nonlinear Retinex-based fusion of the enhanced layers followed by linear blending with the initial dehazing result, yields the final output.

关键词: nighttime image dehazing, transmittance correction, structure-texture decomposition, YUV color space, atmospheric light estimation, image fusion, illumination compensation, detail enhancement

199. ❌ Emotion Diffusion Classifier with Adaptive Margin Discrepancy Training for Facial Expression Recognition

作者: Rongkang Dong, Cuixin Yang, Cong Zhang, Yushen Zuo, Kin-Man Lam 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29578v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于面部表情识别（FER）领域，提出了一种基于条件生成扩散模型（EmoDC）和自适应边界差异训练（AMDiT）的方法。虽然论文涉及深度学习技术（扩散模型），但其研究内容与所有评分关键词（主要围绕大语言模型、训练技术、推理优化、AI代理等）完全无关。论文未提及任何大模型、语言模型、MoE、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理技术、AI代理、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对面部表情识别中现有模型易受分布偏移影响的问题，提出了一种基于条件生成扩散模型的EmoDC分类器，并通过自适应边界差异训练（AMDiT）显著提升了识别准确性和对抗鲁棒性。

摘要翻译

面部表情识别（Facial Expression Recognition, FER）在人机交互中至关重要，它使机器能够通过面部情感行为解读人类情绪与内在状态。尽管深度学习显著提升了FER的性能，但现有大多数基于深度学习的FER方法严重依赖判别式分类器以实现快速预测。这些模型倾向于学习捷径，并且即使面对微小的分布偏移也显得脆弱。为解决这一问题，我们采用条件生成扩散模型，并提出了用于FER的情感扩散分类器（Emotion Diffusion Classifier, EmoDC），该模型展现出更强的对抗鲁棒性。然而，使用标准策略重新训练EmoDC未能有效惩罚错误的类别描述，导致识别性能欠佳。为改进EmoDC，我们提出了基于间隔的差异训练方法，该方法鼓励模型在基于正确类别描述的条件下做出准确预测，并惩罚基于不匹配描述条件下的预测。此方法在正确与错误类别的噪声预测误差之间强制设定一个最小间隔，从而增强了模型的判别能力。然而，使用固定间隔未能考虑到不同图像在噪声预测难度上的差异，限制了其有效性。为克服这一局限，我们提出了自适应间隔差异训练（Adaptive Margin Discrepancy Training, AMDiT），该方法动态调整每个样本的间隔。大量实验表明，在RAF-DB基础子集、RAF-DB复合子集、SFEW-2.0和AffectNet数据集上进行100步评估时，AMDiT相比采用标准去噪扩散训练的基准模型，显著提升了EmoDC的准确率。此外，EmoDC在对抗噪声和模糊的鲁棒性方面优于当前最先进的判别式分类器。

摘要 (Abstract)

Facial Expression Recognition (FER) is essential for human-machine interaction, as it enables machines to interpret human emotions and internal states from facial affective behaviors. Although deep learning has significantly advanced FER performance, most existing deep-learning-based FER methods rely heavily on discriminative classifiers for fast predictions. These models tend to learn shortcuts and are vulnerable to even minor distribution shifts. To address this issue, we adopt a conditional generative diffusion model and introduce the Emotion Diffusion Classifier (EmoDC) for FER, which demonstrates enhanced adversarial robustness. However, retraining EmoDC using standard strategies fails to penalize incorrect categorical descriptions, leading to suboptimal recognition performance. To improve EmoDC, we propose margin-based discrepancy training, which encourages accurate predictions when conditioned on correct categorical descriptions and penalizes predictions conditioned on mismatched ones. This method enforces a minimum margin between noise-prediction errors for correct and incorrect categories, thereby enhancing the model’s discriminative capability. Nevertheless, using a fixed margin fails to account for the varying difficulty of noise prediction across different images, limiting its effectiveness. To overcome this limitation, we propose Adaptive Margin Discrepancy Training (AMDiT), which dynamically adjusts the margin for each sample. Extensive experiments show that AMDiT significantly improves the accuracy of EmoDC over the Base model with standard denoising diffusion training on the RAF-DB basic subset, the RAF-DB compound subset, SFEW-2.0, and AffectNet, in 100-step evaluations. Additionally, EmoDC outperforms state-of-the-art discriminative classifiers in terms of robustness against noise and blur.

关键词: Facial Expression Recognition, Diffusion Model, Emotion Diffusion Classifier, Adaptive Margin Discrepancy Training, Adversarial Robustness, Conditional Generative Model, Noise Prediction, Discriminative Classifier

200. ❌ All-in-One Augmented Reality Guided Head and Neck Tumor Resection

作者: Yue Yang, Matthieu Chabanas, Carrie Reale, Annie Benson, Jason Slagle, Matthew Weinger, Michael Topf, Jie Ying Wu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29495v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究头颈肿瘤手术中的增强现实引导系统，属于医学应用领域。所有关键词均与大模型、深度学习技术原理或AI科学应用相关，但论文未涉及任何大模型、深度学习技术或AI算法创新，仅使用现成的AR技术（HoloLens 2）进行手术导航。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学应用可视为AI在科学领域的应用，但论文未明确使用AI算法，仅使用AR技术，因此给予5分（有一定关联）。其他关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文开发了一种无标记增强现实系统，用于头颈肿瘤手术中精确定位阳性切缘，实验表明该系统能将定位误差从口头指导的14.2毫米降低至3.2毫米。

摘要翻译

切缘阳性在头颈部鳞状细胞癌中较为常见，但术中二次切除往往不够精确，因为病理科通常仅以口头方式通报切缘位置。本研究提出了一体化增强现实（AR）系统，该系统利用HoloLens 2深度传感与全自动无标记表面配准技术，将切除标本中的阳性切缘重定位至切除床并进行原位可视化。在包含六名医学培训学员的硅胶模型研究中，无标记配准实现了与基于标记的基准方法相当的靶向配准误差（中位数1.8毫米对1.7毫米；最大值均小于4毫米）。在切缘重定位任务中，AR引导将口头引导的误差（中位数14.2毫米）降低至毫米级（中位数3.2毫米），所有AR定位误差均在5毫米以内。这些结果证实了无标记AR切缘引导技术实现更精确术中二次切除的可行性。

摘要 (Abstract)

Positive margins are common in head and neck squamous cell carcinoma, yet intraoperative re-resection is often imprecise because margin locations are typically communicated verbally from pathology. We present an all-in-one augmented reality (AR) system that relocalizes positive margins from a resected specimen to the resection bed and visualizes them in situ using HoloLens 2 depth sensing and fully automated markerless surface registration. In a silicone phantom study with six medical trainees, markerless registration achieved target registration errors comparable to a marker-based baseline (median 1.8 mm vs. 1.7 mm; maximum < 4 mm). In a margin relocalization task, AR guidance reduced error from verbal guidance (median 14.2 mm) to a few millimeters (median 3.2 mm), with all AR localizations within 5 mm error. These results support the feasibility of markerless AR margin guidance for more precise intraoperative re-excision.

关键词: augmented reality, head and neck tumor resection, margin relocalization, markerless registration, HoloLens 2, intraoperative guidance, surgical navigation, target registration error

201. ❌ Square Superpixel Generation and Representation Learning via Granular Ball Computing

作者: Shuyin Xia, Meng Yang, Dawei Dai, Fan Chen, Shilin Zhao, Junwei Han, Xinbo Gao, Guoyin Wang, Wen Lu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29460v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的超像素生成和表示学习，提出了一种基于粒度球计算的方形超像素方法，并将其集成到图神经网络和视觉Transformer中。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是传统的视觉表示学习问题，未涉及大模型、深度学习技术原理创新或AI在生物/化学等科学领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于粒度球计算的方形超像素生成方法，解决了传统不规则超像素与深度学习框架不兼容的问题，并通过实验验证了其在视觉任务中的有效性。

摘要翻译

超像素提供了一种紧凑的基于区域的表示方法，能够保留物体边界和局部结构，因此已被广泛应用于各种视觉任务中以降低计算成本。然而，现有的大多数超像素算法会产生不规则形状的区域，这些区域与卷积等规则算子不能很好地对齐。因此，超像素通常被视为离线预处理步骤，这限制了并行实现，并阻碍了深度学习流程中的端到端优化。受粒度球计算的自适应表示和覆盖特性启发，我们开发了一种方形超像素生成方法。具体而言，我们使用多尺度方形块来近似超像素，以避免不规则形状带来的计算和实现困难，从而实现高效的并行处理和可学习的特征提取。对于每个块，我们基于像素强度相似性计算一个纯度分数，并据此选择高质量的块。所生成的方形超像素可以轻松地作为图神经网络中的图节点或视觉变换器中的令牌进行集成，促进多尺度信息聚合和结构化视觉表示。在下游任务上的实验结果展示了一致的性能提升，验证了所提方法的有效性。

摘要 (Abstract)

Superpixels provide a compact region-based representation that preserves object boundaries and local structures, and have therefore been widely used in a variety of vision tasks to reduce computational cost. However, most existing superpixel algorithms produce irregularly shaped regions, which are not well aligned with regular operators such as convolutions. Consequently, superpixels are often treated as an offline preprocessing step, limiting parallel implementation and hindering end-to-end optimization within deep learning pipelines. Motivated by the adaptive representation and coverage property of granular-ball computing, we develop a square superpixel generation approach. Specifically, we approximate superpixels using multi-scale square blocks to avoid the computational and implementation difficulties induced by irregular shapes, enabling efficient parallel processing and learnable feature extraction. For each block, a purity score is computed based on pixel-intensity similarity, and high-quality blocks are selected accordingly. The resulting square superpixels can be readily integrated as graph nodes in graph neural networks (GNNs) or as tokens in Vision Transformers (ViTs), facilitating multi-scale information aggregation and structured visual representation. Experimental results on downstream tasks demonstrate consistent performance improvements, validating the effectiveness of the proposed method.

关键词: square superpixel, granular-ball computing, graph neural networks, Vision Transformers, multi-scale representation, parallel processing, learnable feature extraction, visual representation

202. ❌ VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

作者: Anmin Liu, Ruixuan Yang, Huiqiang Jiang, Bin Lin, Minmin Sun, Yong Li, Chen Zhang, Tao Xie 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29494v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出VecAttention，一种用于加速长上下文视频模型推理的向量级稀疏注意力框架。核心相关关键词：1）‘Mixture of Experts OR MoE OR Sparse Models’（10分）：论文核心是稀疏注意力方法，直接相关；2）‘Context Window Extension OR Long Context LLMs’（8分）：针对长上下文视频理解/生成，虽非LLM但解决类似长上下文挑战；3）‘KV Cache Compression OR Linear Attention OR FlashAttention’（10分）：属于注意力优化技术，与FlashAttention等同类；4）‘Speculative Decoding OR Inference Acceleration’（10分）：核心目标是推理加速，实现2.65倍加速。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出VecAttention，一种向量级稀疏注意力框架，通过动态选择信息向量来加速长上下文视频模型的推理，在保持与全注意力相当精度的同时实现2.65倍加速。

摘要翻译

长视频理解与生成对基于Transformer的视频模型构成显著计算挑战，这源于自注意力机制的二次复杂度。现有稀疏注意力方法虽采用粗粒度模式提升效率，但通常伴随冗余计算与次优性能。为解决此问题，本文提出\textbf{VecAttention}——一种新颖的向量级稀疏注意力框架，可为视频模型实现更优的精度-效率权衡。我们观察到视频注意力图呈现强烈的垂直向量稀疏模式，并进一步证明相较于现有粗粒度稀疏模式，这种垂直向量模式能持续提供更优的精度-稀疏度平衡。基于此发现，VecAttention通过轻量级重要向量选择机制动态筛选并仅处理信息丰富的垂直向量，该机制最小化内存访问开销，并配合优化的向量稀疏注意力内核实现高效计算。在视频理解（VideoMME、LongVideoBench与VCRBench）与生成（VBench）任务上的综合评估表明，VecAttention相比全注意力机制实现2.65$\times$加速，较当前最先进的稀疏注意力方法提升1.83$\times$速度，同时保持与全注意力相当的精度。代码已发布于https://github.com/anminliu/VecAttention。

摘要 (Abstract)

Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65$\times$ speedup over full attention and a 1.83$\times$ speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at https://github.com/anminliu/VecAttention.

关键词: VecAttention, sparse attention, long-context video, inference acceleration, vector-wise attention, Transformer, video understanding, video generation

203. ❌ FedDBP: Enhancing Federated Prototype Learning with Dual-Branch Features and Personalized Global Fusion

作者: Ningzhi Gao, Siquan Huang, Leyu Shi, Ying Gao 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦原型学习（FPL）方法FedDBP，旨在解决异构联邦学习中的数据与模型异质性问题。其核心贡献在于客户端双分支特征投影器（结合L2对齐与对比学习）与服务器端个性化全局原型融合（利用Fisher信息）。所有评分关键词均围绕大模型/深度学习技术原理创新或其在科学领域的应用，而本文研究的是联邦学习中的原型学习，属于分布式机器学习范畴，与评分关键词列表中的大模型技术、科学AI应用等主题无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对异构联邦学习中现有联邦原型学习方法在特征保真度与判别性平衡及全局原型单一性方面的局限，提出了FedDBP方法，通过客户端双分支特征投影器和服务器端个性化全局原型融合，在实验中超越了十种现有先进方法。

摘要翻译

联邦原型学习（Federated Prototype Learning, FPL）作为解决异构联邦学习（Heterogeneous Federated Learning, HFL）的一种方案，能有效缓解数据与模型异构性带来的挑战。然而，现有FPL方法未能平衡特征的真实性与判别力，且受限于单一的全局原型。本文提出一种新颖的FPL方法FedDBP以解决上述问题。在客户端侧，我们设计了一种双分支特征投影器，同时采用L2对齐与对比学习，从而确保局部特征的真实性与判别力。在服务器侧，我们引入了一种个性化全局原型融合方法，利用费舍尔信息识别局部原型的关键通道。大量实验证明，FedDBP在性能上优于十种现有先进方法。

摘要 (Abstract)

Federated prototype learning (FPL), as a solution to heterogeneous federated learning (HFL), effectively alleviates the challenges of data and model heterogeneity.However, existing FPL methods fail to balance the fidelity and discriminability of the feature, and are limited by a single global prototype. In this paper, we propose FedDBP, a novel FPL method to address the above issues. On the client-side, we design a Dual-Branch feature projector that employs L2 alignment and contrastive learning simultaneously, thereby ensuring both the fidelity and discriminability of local features. On the server-side, we introduce a Personalized global prototype fusion approach that leverages Fisher information to identify the important channels of local prototypes. Extensive experiments demonstrate the superiority of FedDBP over ten existing advanced methods.

关键词: Federated Prototype Learning, Heterogeneous Federated Learning, Dual-Branch Feature Projector, L2 Alignment, Contrastive Learning, Personalized Global Prototype Fusion, Fisher Information, Feature Fidelity and Discriminability

作者: Yijie Zheng, Weijie Wu, Bingyue Wu, Long Zhao, Guoqing Li, Mikolaj Czerkawski, Konstantin Klemmer 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29441v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要介绍EarthEmbeddingExplorer——一个用于全球卫星图像跨模态检索的Web应用程序。与关键词的相关性分析如下：1. 与’Large Language Models OR LLMs OR Foundation Models’（8分）相关，因为论文提到’high-impact foundation models’和’precomputed Earth embeddings’，表明使用了基础模型技术。2. 与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’（8分）相关，因为论文核心是跨模态检索（cross-modal retrieval），涉及自然语言、视觉和地理位置查询。3. 与’AI for Science OR Bioinformatics OR Cheminformatics’（10分）高度相关，因为论文属于地球观测科学领域的AI应用，具体涉及卫星图像分析和科学发现。其他关键词（如MoE、SFT、RLHF等）与论文内容无关，因为论文聚焦于应用工具开发而非底层模型技术细节。

!!! tip deepseek-chat TL;DR

该论文介绍了EarthEmbeddingExplorer——一个交互式Web应用程序，旨在将静态的地球观测基础模型和嵌入数据集转化为动态、实用的跨模态检索工作流，以促进科学发现。

摘要翻译

尽管地球观测领域已涌现出大量高影响力的基础模型和全球地球嵌入数据集，如何将这些学术资产转化为可自由访问的工具仍存在显著障碍。本教程介绍EarthEmbeddingExplorer——一个旨在弥合此鸿沟的交互式网络应用程序，它将静态研究成果转化为动态实用的探索工作流。我们将提供该系统的全面实践指南，详细阐述其云原生软件架构，演示跨模态查询（自然语言、视觉与地理位置），并展示如何从检索结果中提取科学洞见。通过普及对预计算地球嵌入数据的访问，本教程使研究人员能够无缝地从前沿模型与数据档案过渡到实际应用与分析。该网络应用程序可通过 https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer 访问。

摘要 (Abstract)

While the Earth observation community has witnessed a surge in high-impact foundation models and global Earth embedding datasets, a significant barrier remains in translating these academic assets into freely accessible tools. This tutorial introduces EarthEmbeddingExplorer, an interactive web application designed to bridge this gap, transforming static research artifacts into dynamic, practical workflows for discovery. We will provide a comprehensive hands-on guide to the system, detailing its cloud-native software architecture, demonstrating cross-modal queries (natural language, visual, and geolocation), and showcasing how to derive scientific insights from retrieval results. By democratizing access to precomputed Earth embeddings, this tutorial empowers researchers to seamlessly transition from state-of-the-art models and data archives to real-world application and analysis. The web application is available at https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer.

关键词: Earth observation, foundation models, cross-modal retrieval, satellite images, web application, Earth embeddings, scientific insights, interactive tool

205. ❌ Polyhedral Unmixing: Bridging Semantic Segmentation with Hyperspectral Unmixing via Polyhedral-Cone Partitioning

作者: Antoine Bottenmuller, Etienne Decencière, Petr Dokládal 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29438v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高光谱图像分析中的语义分割和光谱解混问题，提出了一种基于多面体锥分割的数学方法，属于传统的计算机视觉和信号处理领域。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，但论文完全不涉及这些内容，没有使用任何深度学习或大模型方法，也没有讨论生物信息学或化学信息学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多面体锥分割的数学框架，将语义分割与高光谱解混问题联系起来，通过构建空间分割来改进解混效果，并在真实数据集上验证了其优于现有深度学习和非深度学习方法的性能。

摘要翻译

语义分割与高光谱解混是光谱图像分析中的两个核心问题。前者为每个像素分配与其材料类别对应的离散标签，而后者则估计称为端元（endmembers）的纯净材料光谱，并为每个像素生成一个表示观测场景中材料丰度的向量。尽管二者具有互补性，这两类问题通常被独立处理。本文旨在通过形式化证明在线性混合模型下，基于主导材料的像素分类会在光谱空间中诱导出多面锥体区域，从而连接这两条研究路线。我们利用这一基本性质，提出了一种从分割到解混的直接处理流程：通过构建与已标记像素最匹配的空间多面锥体划分，实现从任意语义分割结果出发的盲高光谱解混。随后计算像素到估计区域的带符号距离，在距离空间中进行基变换的线性转换，并投影至概率单纯形，从而得到初始丰度估计。该估计值被用于提取端元并通过矩阵伪逆运算获得最终丰度。由于分割方法可自由选择，用户能显式控制解混过程，而流程其余部分本质上保持确定性与轻量化。除提升可解释性外，在三个真实数据集上的实验表明，所提方法结合适当的聚类算法具有显著有效性，相比近期基于深度学习与非深度学习的最先进方法均取得稳定提升。代码发布于：https://github.com/antoine-bottenmuller/polyhedral-unmixing

摘要 (Abstract)

Semantic segmentation and hyperspectral unmixing are two central problems in spectral image analysis. The former assigns each pixel a discrete label corresponding to its material class, whereas the latter estimates pure material spectra, called endmembers, and, for each pixel, a vector representing material abundances in the observed scene. Despite their complementarity, these two problems are usually addressed independently. This paper aims to bridge these two lines of work by formally showing that, under the linear mixing model, pixel classification by dominant materials induces polyhedral-cone regions in the spectral space. We leverage this fundamental property to propose a direct segmentation-to-unmixing pipeline that performs blind hyperspectral unmixing from any semantic segmentation by constructing a polyhedral-cone partition of the space that best fits the labeled pixels. Signed distances from pixels to the estimated regions are then computed, linearly transformed via a change of basis in the distance space, and projected onto the probability simplex, yielding an initial abundance estimate. This estimate is used to extract endmembers and recover final abundances via matrix pseudo-inversion. Because the segmentation method can be freely chosen, the user gains explicit control over the unmixing process, while the rest of the pipeline remains essentially deterministic and lightweight. Beyond improving interpretability, experiments on three real datasets demonstrate the effectiveness of the proposed approach when associated with appropriate clustering algorithms, and show consistent improvements over recent deep and non-deep state-of-the-art methods. The code is available at: https://github.com/antoine-bottenmuller/polyhedral-unmixing

关键词: hyperspectral unmixing, semantic segmentation, polyhedral-cone partitioning, linear mixing model, spectral image analysis, abundance estimation, endmember extraction, computer vision

206. ❌ SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

作者: Wenli Li, Kai Zhao, Haoran Jiang, Enquan Yang, Yi Su, Dan Zeng 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29437v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究3D视觉问答中的视觉令牌剪枝问题，核心创新是结合语义和几何信息进行令牌选择以提高推理效率。与关键词的相关性分析如下：1. 高度相关（10分）：论文明确使用LLM进行3D QA推理，是核心组件。2. 有一定关联（5分）：论文通过令牌剪枝减少计算量，间接涉及推理加速。3. 完全无关（0分）：其他关键词如MoE、SFT、RAG等均未在论文中涉及，论文专注于视觉令牌处理而非大模型训练、对齐、压缩或其他应用领域。

!!! tip deepseek-chat TL;DR

该论文针对3D视觉问答中多视角视觉令牌冗余导致推理效率低的问题，提出了语义-几何视觉令牌剪枝框架SeGPruner，在减少91%视觉令牌和86%推理延迟的同时保持了3D推理性能。

摘要翻译

视觉语言模型（VLMs）已在三维问答（3D QA）任务中得到广泛应用。在典型流程中，从多视角提取的视觉标记与语言标记拼接后，由大语言模型（LLM）进行联合处理以完成推理。然而，聚合多视角观测不可避免地会引入严重的标记冗余，导致视觉标记集规模过大，在受限的标记预算下显著阻碍推理效率。视觉标记剪枝已成为解决该问题的普遍策略。然而，现有剪枝方法主要针对二维输入设计或依赖间接几何线索，这限制了其显式保留语义关键对象以及维持充分空间覆盖以支持稳健三维推理的能力。本文提出SeGPruner——一种面向多视图图像高效三维问答的语义感知与几何引导的标记约简框架。具体而言，SeGPruner首先通过基于注意力的重要性模块（显著性感知标记选择器）保留语义显著的标记，确保对象关键证据得以留存；随后通过几何引导选择器（几何感知标记多样化器）补充空间多样化的标记，该模块同时考虑语义相关性与三维几何距离。在激进的标记约简条件下，显著性保留与几何引导多样化之间的协同作用平衡了对象级证据与全局场景覆盖。在ScanQA和OpenEQA数据集上的大量实验表明，SeGPruner显著提升了推理效率，将视觉标记预算降低91%，推理延迟减少86%，同时在三维推理任务中保持了具有竞争力的性能。

摘要 (Abstract)

Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.

关键词: 3D question answering, visual token pruning, semantic-geometric framework, multi-view images, inference efficiency, large language model, token reduction, 3D reasoning

207. ❌ Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions

作者: Xuesong Wang, Harry Wang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29428v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLMs）在视觉错觉任务上的系统偏差，并提出一个无需训练的、基于工具引导的推理框架来解决此问题。论文核心是工具使用（Tool Use）框架，涉及多步推理（Chain of Thought）和深度推理（System 2 Thinking），因此与这三个关键词相关。其他关键词主要涉及大语言模型（LLMs）的技术原理、训练方法、优化技术、特定应用领域（如科学AI）等，而本文专注于视觉语言模型（VLMs）的推理偏差和工具增强方法，未涉及LLMs的核心技术或科学应用，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在视觉错觉任务上表现出的系统偏差，提出了一种无需训练的、基于工具引导的推理框架，通过通用图像处理工具和路由系统提示，显著提升了模型在未知错觉变体上的泛化性能。

摘要翻译

视觉语言模型（VLMs）在面对经典光学幻觉时表现出一种系统性偏差：无论图像是否经过反事实修改，它们都倾向于将幻觉预测为“真实”。我们为DataCV 2026挑战赛（任务I和II）提出了一种工具引导的推理框架，以在不进行任何模型训练的情况下解决这一失效模式。该框架使一个现成的视觉语言模型能够调用一组通用的图像处理工具：线条绘制、区域裁剪、并排比较和通道分离，并配合一个幻觉类型路由系统提示，该提示规定了针对每个感知问题类别应调用哪些工具。关键在于，每次工具调用都会生成一个新的、不可变的图像资源，并将其附加到一个持久化注册表中，因此模型可以在其推理链中引用和组合任何先前标注的视图。这种通用工具加路由的设计并未硬编码针对特定幻觉的模块，而是实现了强大的跨结构泛化能力：从验证集到包含结构上陌生的幻觉变体（例如，马赫带从垂直堆叠旋转为水平堆叠）的测试集，性能保持了一致性。我们进一步报告了三个我们认为值得深入研究的实证观察结果：（i）一种强烈的阳性检测偏差，可能源于不平衡的幻觉训练数据；（ii）在像素级精确的空间推理与对自生成标注的逻辑推断之间存在显著分离；（iii）对图像压缩伪影的明显敏感性，这会加剧假阳性结果。

摘要 (Abstract)

Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as “real” regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather than hard-coding illusion-specific modules, this generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from the validation set to a test set containing structurally unfamiliar illusion variants (e.g., Mach Bands rotated from vertical to horizontal stacking). We further report three empirical observations that we believe warrant additional investigation: (i) a strong positive-detection bias likely rooted in imbalanced illusion training data, (ii) a striking dissociation between pixel-accurate spatial reasoning and logical inference over self-generated annotations, and (iii) pronounced sensitivity to image compression artifacts that compounds false positives.

关键词: Vision-Language Models, Visual Illusions, Tool-Guided Inference, Systematic Bias, Cross-Structural Generalization, Reasoning Chain, Image Manipulation Tools, Perceptual Question

作者: Chenxin Zhu, Yushun Fang, Lu Liu, Shibo Yin, Xiaohong Liu, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29423v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的盲人脸恢复任务，使用基于扩散变换器的架构，结合图像-文本跨模态注意力进行属性感知的生成。虽然论文涉及深度学习技术，但所有关键词均针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化等），或特定于LLM的应用（如AI for Science）。论文未提及任何语言模型、文本生成、推理技术或大模型技术原理，也未涉及科学领域的AI应用（如生物信息学）。因此，所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了A²BFR框架，通过属性感知学习和语义双重训练，解决了盲人脸恢复中高保真重建与提示可控生成的统一问题，在严重退化条件下实现了最先进的恢复保真度和指令遵循能力。

摘要翻译

盲人脸部复原（Blind Face Restoration, BFR）旨在从退化输入中恢复高质量的人脸图像，但其固有的不适定性导致解具有模糊性和不可控性。基于扩散模型的近期BFR方法虽提升了感知质量，但仍缺乏可控性；而基于文本引导的人脸编辑虽能实现属性操控，却无法保证可靠的复原效果。为解决这些问题，我们提出A$^2$BFR——一种属性感知的盲人脸部复原框架，将高保真重建与提示可控生成相统一。该框架以具备统一图像-文本跨模态注意力机制的扩散Transformer为核心，使去噪过程同时以退化输入和文本提示为条件。为注入语义先验，我们提出属性感知学习，通过属性感知编码器提取人脸属性嵌入向量，并以此监督去噪隐空间。为进一步增强提示可控性，我们引入语义双重训练，利用新构建的AttrFace-90K数据集中成对的属性差异数据，在保持保真度的同时强化属性判别能力。大量实验表明，A$^2$BFR在复原保真度与指令遵循度上均达到最先进水平，相较于基于扩散模型的BFR基线方法，其LPIPS指标降低0.0467，属性准确率提升52.58%，并能在严重退化条件下实现细粒度、提示可控的人脸复原。

摘要 (Abstract)

Blind face restoration (BFR) aims to recover high-quality facial images from degraded inputs, yet its inherently ill-posed nature leads to ambiguous and uncontrollable solutions. Recent diffusion-based BFR methods improve perceptual quality but remain uncontrollable, whereas text-guided face editing enables attribute manipulation without reliable restoration. To address these issues, we propose A$^2$BFR, an attribute-aware blind face restoration framework that unifies high-fidelity reconstruction with prompt-controllable generation. Built upon a Diffusion Transformer backbone with unified image-text cross-modal attention, A$^2$BFR jointly conditions the denoising trajectory on both degraded inputs and textual prompts. To inject semantic priors, we introduce attribute-aware learning, which supervises denoising latents using facial attribute embeddings extracted by an attribute-aware encoder. To further enhance prompt controllability, we introduce semantic dual-training, which leverages the pairwise attribute variations in our newly curated AttrFace-90K dataset to enforce attribute discrimination while preserving fidelity. Extensive experiments demonstrate that A$^2$BFR achieves state-of-the-art performance in both restoration fidelity and instruction adherence, outperforming diffusion-based BFR baselines by -0.0467 LPIPS and +52.58% attribute accuracy, while enabling fine-grained, prompt-controllable restoration even under severe degradations.

关键词: Blind Face Restoration, Diffusion Transformer, Attribute-Aware Learning, Cross-Modal Attention, Semantic Dual-Training, Prompt-Controllable Generation, Facial Attribute Embeddings, AttrFace-90K Dataset

209. ❌ Multimodal Models Meet Presentation Attack Detection on ID Documents

作者: Marina Villanueva, Juan M. Espin, Juan E. Tapia 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29422v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态模型在ID文档呈现攻击检测中的应用，属于大模型在特定领域（安全/生物识别）的应用研究。论文明确提到使用预训练的多模态模型（如Paligemma, Llava, Qwen），这与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为这些模型是预训练的，且应用于新领域（ID文档安全）。然而，论文未涉及大模型技术原理的创新（如MoE、量化、推理加速等），也未涉及科学领域应用（如生物信息学），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究探索使用预训练多模态模型（如Paligemma、Llava、Qwen）结合视觉和文本模态来增强ID文档的呈现攻击检测，但实验结果表明这些模型在ID文档上准确检测攻击存在困难。

摘要翻译

将多模态模型集成至身份证件呈现攻击检测中，代表了生物识别安全领域的重大进展。传统呈现攻击检测系统仅依赖视觉特征，往往难以检测复杂的欺骗攻击。本研究探索通过利用预训练多模态模型（如Paligemma、Llava和Qwen），结合视觉与文本模态，以增强对身份证件呈现攻击的检测能力。该方法将深度视觉嵌入与上下文元数据（如证件类型、签发机构和日期）相融合。然而，实验结果表明，这些模型在准确检测身份证件呈现攻击方面仍面临困难。

摘要 (Abstract)

The integration of multimodal models into Presentation Attack Detection (PAD) for ID Documents represents a significant advancement in biometric security. Traditional PAD systems rely solely on visual features, which often fail to detect sophisticated spoofing attacks. This study explores the combination of visual and textual modalities by utilizing pre-trained multimodal models, such as Paligemma, Llava, and Qwen, to enhance the detection of presentation attacks on ID Documents. This approach merges deep visual embeddings with contextual metadata (e.g., document type, issuer, and date). However, experimental results indicate that these models struggle to accurately detect PAD on ID Documents.

关键词: Multimodal Models, Presentation Attack Detection, ID Documents, Biometric Security, Pre-trained Models, Visual and Textual Modalities, Spoofing Attacks, Deep Visual Embeddings

210. ❌ Native-Domain Cross-Attention for Camera-LiDAR Extrinsic Calibration Under Large Initial Perturbations

作者: Ni Ou, Zhuo Chen, Xinru Zhang, Junzheng Wang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多模态传感器（相机和激光雷达）的外参标定问题，提出了一种基于原生域交叉注意力的标定框架。论文内容完全聚焦于计算机视觉和机器人感知领域的具体技术问题，与所有评分关键词（均涉及大语言模型、深度学习技术原理、AI for Science等）均无直接关联。论文未涉及任何语言模型、模型训练、推理优化、AI代理、科学AI应用等相关内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于原生域交叉注意力的相机-激光雷达外参标定方法，在存在大初始扰动的情况下显著提高了标定精度和鲁棒性，在KITTI和nuScenes数据集上超越了现有方法。

摘要翻译

精确的相机-激光雷达融合依赖于精准的外参标定，其根本上取决于在可能存在较大错位的情况下建立可靠的跨模态对应关系。现有的基于学习的方法通常将激光雷达点投影为深度图以进行特征融合，这会扭曲三维几何结构，并在外参初始化远离真实值时导致性能下降。为解决这一问题，我们提出了一种外参感知的交叉注意力框架，直接在图像块和激光雷达点组的原始域中进行对齐。所提出的注意力机制显式地将外参假设注入到对应关系建模过程中，实现了几何一致的跨模态交互，而无需依赖投影的二维深度图。在KITTI和nuScenes基准测试上的大量实验表明，我们的方法在精度和鲁棒性上均持续优于现有先进方法。在大幅度外参扰动下，我们的方法在88%的KITTI案例和99%的nuScenes案例中实现了精确标定，显著超越了次优基线方法。我们已在https://github.com/gitouni/ProjFusion开源代码，以惠及研究社区。

摘要 (Abstract)

Accurate camera-LiDAR fusion relies on precise extrinsic calibration, which fundamentally depends on establishing reliable cross-modal correspondences under potentially large misalignments. Existing learning-based methods typically project LiDAR points into depth maps for feature fusion, which distorts 3D geometry and degrades performance when the extrinsic initialization is far from the ground truth. To address this issue, we propose an extrinsic-aware cross-attention framework that directly aligns image patches and LiDAR point groups in their native domains. The proposed attention mechanism explicitly injects extrinsic parameter hypotheses into the correspondence modeling process, enabling geometry-consistent cross-modal interaction without relying on projected 2D depth maps. Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in both accuracy and robustness. Under large extrinsic perturbations, our approach achieves accurate calibration in 88% of KITTI cases and 99% of nuScenes cases, substantially surpassing the second-best baseline. We have open sourced our code on https://github.com/gitouni/ProjFusion to benefit the community.

关键词: camera-LiDAR fusion, extrinsic calibration, cross-modal correspondences, native-domain cross-attention, large initial perturbations, geometry-consistent interaction, KITTI benchmark, nuScenes benchmark

211. ❌ AA-Splat: Anti-Aliased Feed-forward Gaussian Splatting

作者: Taewoo Suh, Sungpyo Kim, Jongmin Park, Munchurl Kim 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29394v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AA-Splat专注于计算机视觉和图形学领域的3D高斯泼溅技术，提出了一种抗锯齿渲染方法。所有评分关键词均涉及大语言模型、深度学习技术原理或AI在科学领域的应用，而该论文研究的是3D重建和渲染的特定计算机视觉问题，与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对现有前馈3D高斯泼溅方法在非分布采样率下产生渲染伪影的问题，提出了一种抗锯齿渲染模型AA-Splat，通过不透明度平衡带限设计显著提升了多分辨率下的新视角合成性能。

摘要翻译

前馈式三维高斯泼溅（FF-3DGS）已成为稀疏视角三维重建与新视角合成（NVS）的一种快速且鲁棒的解决方案。然而，现有FF-3DGS方法基于不正确的屏幕空间扩张滤波器构建，导致在分布外采样率下渲染时产生严重的伪影。我们首次提出一种名为AA-Splat的FF-3DGS模型，以实现任意分辨率下鲁棒的抗锯齿渲染。AA-Splat采用了一种不透明度平衡的带限（OBBL）设计，该设计融合了两个组件：一个三维带限后滤波器将多视角最大频率边界整合到前馈重建流程中，有效对生成的三维场景表示进行带限处理并消除退化高斯基元；以及一项不透明度平衡（OB）技术，将全部像素对齐的高斯基元无缝集成到渲染过程中，以补偿扩张后高斯基元间重叠度的增加。AA-Splat在所有分辨率下（从4倍到1/4倍）均展现出显著改进，其新视角合成性能相较于前沿（SOTA）基准方法DepthSplat平均获得5.4～7.5分贝的峰值信噪比提升。代码将公开提供。

摘要 (Abstract)

Feed-forward 3D Gaussian Splatting (FF-3DGS) emerges as a fast and robust solution for sparse-view 3D reconstruction and novel view synthesis (NVS). However, existing FF-3DGS methods are built on incorrect screen-space dilation filters, causing severe rendering artifacts when rendering at out-of-distribution sampling rates. We firstly propose an FF-3DGS model, called AA-Splat, to enable robust anti-aliased rendering at any resolution. AA-Splat utilizes an opacity-balanced band-limiting (OBBL) design, which combines two components: a 3D band-limiting post-filter integrates multi-view maximal frequency bounds into the feed-forward reconstruction pipeline, effectively band-limiting the resulting 3D scene representations and eliminating degenerate Gaussians; an Opacity Balancing (OB) to seamlessly integrate all pixel-aligned Gaussian primitives into the rendering process, compensating for the increased overlap between expanded Gaussian primitives. AA-Splat demonstrates drastic improvements with average 5.4$\sim$7.5dB PSNR gains on NVS performance over a state-of-the-art (SOTA) baseline, DepthSplat, at all resolutions, between $4\times$ and $1/4\times$. Code will be made available.

关键词: 3D Gaussian Splatting, anti-aliased rendering, novel view synthesis, sparse-view reconstruction, feed-forward 3D reconstruction, opacity balancing, band-limiting, rendering artifacts

212. ❌ Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement

作者: Fabian Kabus, Julia Hindel, Jelena Bratulić, Meropi Karakioulaki, Ayush Gupta, Cristina Has, Thomas Brox, Abhinav Valada, Harald Binder 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29376v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究罕见皮肤病RDEB的多模态嵌入评估，核心贡献是TriDerm框架。与关键词相关性分析：1）高度相关（10分）：‘AI for Science’ - 直接应用于生物信息学/医学领域；2）中等相关（8分）：‘Large Language Models’ - 使用LLM处理临床文本和比较查询；3）有限相关（5分）：‘Pre-training/Domain Adaptation’ - 适应视觉基础模型到RDEB，‘Supervised Fine-tuning’ - 框架涉及训练，‘Explainable AI’ - 学习可解释表示；4）无关（0分）：其他关键词未涉及。

!!! tip deepseek-chat TL;DR

该论文针对罕见皮肤病RDEB，提出了TriDerm多模态框架，通过整合伤口图像、边界掩码和专家报告来学习可解释的伤口表示，融合视觉和文本模态后与专家的一致性达到73.5%，优于现成的单模态基础模型。

摘要翻译

隐性营养不良型大疱性表皮松解症（RDEB）是一种罕见的遗传性皮肤病，临床医生通过图像和临床文本寻找相似病例能极大获益。然而，现成的基础模型难以可靠地捕捉这种异质性长尾疾病中具有临床意义的特征，且与专家意见一致性的结构化测量颇具挑战。为弥补这些不足，我们提出利用专家序数比较（三元组判断）来评估嵌入空间，该方法收集快速并能编码隐性的临床相似性知识。我们进一步提出了TriDerm——一个多模态框架，通过整合伤口图像、边界掩膜和专家报告，从小规模队列中学习可解释的伤口表征。在视觉方面，TriDerm通过伤口级注意力池化和非对比表征学习，使视觉基础模型适应RDEB任务。对于文本，我们使用比较查询提示大语言模型，并通过软序数嵌入（SOE）恢复具有医学意义的表征。研究表明，视觉与文本模态能捕捉伤口表型的互补特征，融合双模态可使专家一致性达到73.5%，较最佳现成单模态基础模型提升超过5.6个百分点。我们已将专家标注工具、模型代码及代表性数据集样本公开提供。

摘要 (Abstract)

Recessive dystrophic epidermolysis bullosa (RDEB) is a rare genetic skin disorder for which clinicians greatly benefit from finding similar cases using images and clinical text. However, off-the-shelf foundation models do not reliably capture clinically meaningful features for this heterogeneous, long-tail disease, and structured measurement of agreement with experts is challenging. To address these gaps, we propose evaluating embedding spaces with expert ordinal comparisons (triplet judgments), which are fast to collect and encode implicit clinical similarity knowledge. We further introduce TriDerm, a multimodal framework that learns interpretable wound representations from small cohorts by integrating wound imagery, boundary masks, and expert reports. On the vision side, TriDerm adapts visual foundation models to RDEB using wound-level attention pooling and non-contrastive representation learning. For text, we prompt large language models with comparison queries and recover medically meaningful representations via soft ordinal embeddings (SOE). We show that visual and textual modalities capture complementary aspects of wound phenotype, and that fusing both modalities yields 73.5% agreement with experts, outperforming the best off-the-shelf single-modality foundation model by over 5.6 percentage points. We make the expert annotation tool, model code and representative dataset samples publicly available.

关键词: multimodal embeddings, chronic wound assessment, RDEB, expert triplet agreement, visual foundation models, large language models, clinical similarity, medical AI

213. ❌ StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

作者: Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, Liujuan Cao 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29368v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是计算机视觉领域的立体视觉任务，使用视觉基础模型VGGT作为骨干网络，并提出了一种无需训练的特征调整方法。虽然论文涉及基础模型和预训练概念，但所有关键词都专门针对大语言模型（LLM）技术，而本文研究的是视觉基础模型（VFM），属于完全不同的模态和技术领域。只有’Pre-training OR Continual Pre-training OR Domain Adaptation’关键词与论文中提到的预训练概念有一定关联，但论文重点不是预训练技术本身，而是利用预训练模型进行应用，因此给5分。其他所有关键词均与LLM技术直接相关，与本文的视觉研究完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对立体视觉任务中现有模型缺乏显式几何约束的问题，提出了一种基于预训练视觉几何基础模型VGGT的无需训练特征调整方法StereoVGGT，在KITTI基准测试中取得了最优性能。

摘要翻译

随着三维设备发展的推动，包含立体匹配与立体转换在内的立体视觉任务已成为关键的研究前沿。当代立体视觉骨干网络通常依赖于单目深度估计模型或视觉基础模型。关键在于，这些模型主要在缺乏相机姿态显式监督的情况下进行预训练。鉴于此类几何知识对于立体视觉不可或缺，显式空间约束的缺失构成了现有架构的显著性能瓶颈。考虑到视觉几何基础Transformer作为一种在包含相机姿态在内的广泛三维先验知识上预训练的基础模型，我们探究了其作为立体视觉任务鲁棒骨干网络的潜力。然而，实验结果表明，将其直接应用于立体视觉任务时性能欠佳。我们观察到VGGT在特征提取过程中存在更显著的几何细节退化问题。这一特性与双目立体视觉的需求相冲突，从而限制了其在相关任务中的效能。为弥合此差距，我们提出了StereoVGGT——一个专为立体视觉定制的特征骨干网络。通过利用冻结的VGGT并引入免训练的特征调整流程，我们减轻了几何退化现象，并有效利用了模型内嵌的潜在相机标定知识。基于StereoVGGT的立体匹配网络在KITTI基准测试中取得了所有已发表方法中的第一名，验证了StereoVGGT可作为立体视觉任务的高效骨干网络。

摘要 (Abstract)

Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.

关键词: Stereo Vision, Visual Geometry Grounded Transformer, VGGT, Foundation Model, Training-Free, Feature Adjustment, Camera Poses, Stereo Matching

214. ❌ Uncertainty-Aware Trajectory Prediction: A Unified Framework Harnessing Positional and Semantic Uncertainties

作者: Jintao Sun, Hu Zhang, Gangyi Ding, Zhedong Zheng 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29362v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于轨迹预测领域，提出了一种统一框架来建模位置和语义不确定性，以提高预测鲁棒性。论文内容涉及计算机视觉、自动驾驶和不确定性建模，但未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用。所有关键词均与大模型技术、训练方法、推理优化、AI代理或科学AI应用相关，而本文研究的是传统轨迹预测问题，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种统一框架，通过联合建模位置和语义不确定性来增强轨迹预测的鲁棒性，在nuScenes数据集上验证了其能有效量化地图不确定性并提升现有模型的预测性能。

摘要翻译

轨迹预测旨在基于历史运动数据的时间跨度和环境上下文，动态实体（如车辆与行人）的未来运动进行预测。该领域的核心挑战在于实时地图固有的不确定性，其主要源于两个方面：（1）由传感器限制或环境遮挡导致的位置误差，以及（2）因场景上下文误读而产生的语义错误。为应对这些挑战，我们提出一种新颖的统一框架，该框架联合建模位置与语义不确定性，并将其显式整合至轨迹预测流程中。我们的方法采用双头架构，以双通道方式独立估计语义与位置预测，并以端到端形式推导预测方差作为不确定性指标。这些不确定性随后与语义及位置预测相融合，以增强轨迹预测的鲁棒性。我们在nuScenes真实世界驾驶数据集上评估了这一不确定性感知框架，并在四种地图估计方法与两种轨迹预测基线模型上进行了广泛实验。结果表明，我们的方法（1）能通过位置与语义维度有效量化地图不确定性，且（2）在多项指标上持续提升现有轨迹预测模型的性能，包括最小平均位移误差（minADE）、最小最终位移误差（minFDE）以及漏检率（MR）。代码将在https://github.com/JT-Sun/UATP公开。

摘要 (Abstract)

Trajectory prediction seeks to forecast the future motion of dynamic entities, such as vehicles and pedestrians, given a temporal horizon of historical movement data and environmental context. A central challenge in this domain is the inherent uncertainty in real-time maps, arising from two primary sources: (1) positional inaccuracies due to sensor limitations or environmental occlusions, and (2) semantic errors stemming from misinterpretations of scene context. To address these challenges, we propose a novel unified framework that jointly models positional and semantic uncertainties and explicitly integrates them into the trajectory prediction pipeline. Our approach employs a dual-head architecture to independently estimate semantic and positional predictions in a dual-pass manner, deriving prediction variances as uncertainty indicators in an end-to-end fashion. These uncertainties are subsequently fused with the semantic and positional predictions to enhance the robustness of trajectory forecasts. We evaluate our uncertainty-aware framework on the nuScenes real-world driving dataset, conducting extensive experiments across four map estimation methods and two trajectory prediction baselines. Results verify that our method (1) effectively quantifies map uncertainties through both positional and semantic dimensions, and (2) consistently improves the performance of existing trajectory prediction models across multiple metrics, including minimum Average Displacement Error (minADE), minimum Final Displacement Error (minFDE), and Miss Rate (MR). Code will available at https://github.com/JT-Sun/UATP.

关键词: trajectory prediction, uncertainty modeling, positional uncertainty, semantic uncertainty, dual-head architecture, autonomous driving, nuScenes dataset, robustness

215. ❌ FOSCU: Feasibility of Synthetic MRI Generation via Duo-Diffusion Models for Enhancement of 3D U-Nets in Hepatic Segmentation

作者: Youngung Han, Kyeonghun Kim, Seoyoung Ju, Yeonju Jean, Minkyung Cha, Seohyoung Park, Hyeonseok Jung, Nam-Joon Kim, Woo Kyoung Jeong, Ken Ying-Kai Liao, Hyuk-Jae Lee 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29343v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割领域，使用扩散模型（Duo-Diffusion）和3D U-Net生成合成MRI数据以解决临床数据稀缺问题。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在医学影像分析（可视为生物信息学相关领域）的应用，但并非核心匹配，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出FOSCU框架，通过Duo-Diffusion模型生成合成MRI数据和分割标签，结合增强的3D U-Net训练，解决了医学图像分割中数据稀缺和标注成本高的问题，实验表明使用合成数据能提升分割性能并改善图像保真度。

摘要翻译

医学图像分割面临若干根本性挑战，包括通过影像归档与通信系统（PACS）获取临床数据受限、标注成本高昂以及数据短缺。这些系统性障碍严重阻碍了鲁棒分割算法的开发。为应对这些挑战，我们提出FOSCU框架，其整合了Duo-Diffusion——一种结合ControlNet的三维隐空间扩散模型，能够同步生成高分辨率、解剖结构逼真的合成MRI体数据及对应的分割标签，并采用增强型三维U-Net训练流程。Duo-Diffusion利用分割条件引导的扩散机制，确保生成数据具有空间一致性和精确的解剖细节。在720例腹部MRI扫描上的实验评估表明，使用真实数据与合成数据联合训练的模型，相较于仅使用真实数据的模型，平均Dice分数提升0.67%，同时Fréchet起始距离（FID）降低36.4%，反映出图像保真度的显著增强。

摘要 (Abstract)

Medical image segmentation faces fundamental challenges including restricted access, costly annotation, and data shortage to clinical datasets through Picture Archiving and Communication Systems (PACS). These systemic barriers significantly impede the development of robust segmentation algorithms. To address these challenges, we propose FOSCU, which integrates Duo-Diffusion, a 3D latent diffusion model with ControlNet that simultaneously generates high-resolution, anatomically realistic synthetic MRI volumes and corresponding segmentation labels, and an enhanced 3D U-Net training pipeline. Duo-Diffusion employs segmentation-conditioned diffusion to ensure spatial consistency and precise anatomical detail in the generated data. Experimental evaluation on 720 abdominal MRI scans shows that models trained with combined real and synthetic data yield a mean Dice score gain of 0.67% over those using only real data, and achieve a 36.4% reduction in Fréchet Inception Distance (FID), reflecting enhanced image fidelity.

关键词: medical image segmentation, synthetic MRI generation, diffusion models, 3D U-Net, abdominal MRI, data augmentation, Fréchet Inception Distance, Dice score

216. ❌ HSFM: Hard-Set-Guided Feature-Space Meta-Learning for Robust Classification under Spurious Correlations

作者: Aryan Yazdan Parast, Khawar Islam, Soyoun Won, Basim Azam, Naveed Akhtar 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29313v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是深度神经网络中的虚假相关性问题，提出了一种基于特征空间的元学习方法（HSFM）来改进分类器头部的性能。论文的核心内容集中在深度学习模型的鲁棒性、特征表示学习和元学习技术上，与所有评分关键词（主要针对大语言模型及其相关技术）均无直接关联。论文未涉及大模型、语言模型、提示工程、对齐、推理、代理、压缩、科学AI应用等任何关键词领域。

!!! tip deepseek-chat TL;DR

该论文针对深度神经网络依赖虚假特征进行预测的问题，提出了一种硬集引导的特征空间元学习方法（HSFM），通过在特征空间中进行数据增强来改进分类器头部处理虚假相关性的能力，从而提高了模型在分布偏移和少数群体样本上的性能。

摘要翻译

深度神经网络常依赖虚假特征进行预测，这导致其在分布偏移及虚假相关性不成立的样本（例如少数群体示例）上表现脆弱。近期研究表明，即使在此类场景下，经验风险最小化（Empirical Risk Minimization, ERM）训练模型的特征提取器仍能学习到丰富且信息量大的表征，而大部分失败可能归因于分类器头部。具体而言，在保持骨干网络冻结的同时重新训练轻量级分类头，可显著提升模型在偏移分布和少数群体上的性能。基于这一观察，我们提出一种双层元学习方法，直接在特征空间中进行数据增强以改善分类器头部对虚假相关性的处理能力。该方法通过学习支持侧的特征编辑，使得分类器在编辑后的特征上进行少量内部循环更新后，能在困难样本上获得更低损失并提升最差群体性能。由于该方法在骨干网络输出端而非像素空间或通过端到端优化进行操作，其具有高效性和稳定性，仅需在单GPU上训练数分钟。我们进一步通过基于CLIP的可视化验证了本方法，结果表明学习到的特征空间更新会引发与虚假属性对齐的、语义层面有意义的特征偏移。

摘要 (Abstract)

Deep neural networks often rely on spurious features to make predictions, which makes them brittle under distribution shift and on samples where the spurious correlation does not hold (e.g., minority-group examples). Recent studies have shown that, even in such settings, the feature extractor of an Empirical Risk Minimization (ERM)-trained model can learn rich and informative representations, and that much of the failure may be attributed to the classifier head. In particular, retraining a lightweight head while keeping the backbone frozen can substantially improve performance on shifted distributions and minority groups. Motivated by this observation, we propose a bilevel meta-learning method that performs augmentation directly in feature space to improve spurious correlation handling in the classifier head. Our method learns support-side feature edits such that, after a small number of inner-loop updates on the edited features, the classifier achieves lower loss on hard examples and improved worst-group performance. By operating at the backbone output rather than in pixel space or through end-to-end optimization, the method is highly efficient and stable, requiring only a few minutes of training on a single GPU. We further validate our method with CLIP-based visualizations, showing that the learned feature-space updates induce semantically meaningful shifts aligned with spurious attributes.

关键词: spurious correlations, feature-space meta-learning, robust classification, distribution shift, minority-group examples, bilevel optimization, CLIP visualizations, hard-set guidance

217. ❌ Self-Consistency for LLM-Based Motion Trajectory Generation and Verification

作者: Jiaju Ma, R. Kenny Jones, Jiajun Wu, Maneesh Agrawala 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29301v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在视觉领域（运动轨迹生成）的应用，并创新性地将自然语言推理中的self-consistency技术适配到视觉任务中。因此，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为论文明确使用LLM生成运动轨迹；与"Self-Correction OR Self-Improvement OR Self-Reflection"高度相关（10分），因为论文的核心方法是基于self-consistency（自我一致性）技术，该技术属于自我改进/反思范畴。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、AI for Science等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究如何将自然语言推理中的自我一致性技术适配到视觉领域，用于改进LLM生成的运动轨迹的准确性和验证能力，实验表明该方法将轨迹生成准确率提升了4-6%，验证精度提升了11%。

摘要翻译

自洽性已被证明是一种以轻量级、无监督方式提升大语言模型在自然语言推理任务上性能的有效技术。本研究探讨如何将自洽性方法适配到视觉领域。具体而言，我们关注大语言模型生成的运动图形轨迹的生成与验证。给定一个提示（例如“让圆以螺旋路径移动”），我们首先从大语言模型中采样多样化的运动轨迹，随后通过聚类识别出一致性轨迹组。我们的核心见解是将与提示相关联的形状族建模为一个原型轨迹搭配一组几何变换（例如刚性变换、相似变换和仿射变换）。在此框架下，若一条轨迹可通过变换组所允许的形变转换为另一条轨迹，则两者可被视为一致的。我们提出一种算法，利用候选变换组之间的层次关系，自动恢复形状族。该方法将基于大语言模型的轨迹生成准确率提升了4-6%。我们进一步扩展本方法以支持验证任务，相比视觉语言模型基线，其精确度提高了11%。代码与数据集已公开于 https://majiaju.io/trajectory-self-consistency。

摘要 (Abstract)

Self-consistency has proven to be an effective technique for improving LLM performance on natural language reasoning tasks in a lightweight, unsupervised manner. In this work, we study how to adapt self-consistency to visual domains. Specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. Given a prompt (e.g., “Move the circle in a spiral path”), we first sample diverse motion trajectories from an LLM, and then identify groups of consistent trajectories via clustering. Our key insight is to model the family of shapes associated with a prompt as a prototype trajectory paired with a group of geometric transformations (e.g., rigid, similarity, and affine). Two trajectories can then be considered consistent if one can be transformed into the other under the warps allowable by the transformation group. We propose an algorithm that automatically recovers a shape family, using hierarchical relationships between a set of candidate transformation groups. Our approach improves the accuracy of LLM-based trajectory generation by 4-6%. We further extend our method to support verification, observing 11% precision gains over VLM baselines. Our code and dataset are available at https://majiaju.io/trajectory-self-consistency .

关键词: Self-consistency, LLM, Motion trajectory generation, Visual domains, Trajectory verification, Geometric transformations, Prototype trajectory, Clustering

218. ❌ MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting

作者: Haoran Zhou, Gim Hee Lee 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29296v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting》专注于计算机视觉和神经渲染领域，特别是动态4D场景重建。其核心贡献在于提出了一种基于4D高斯泼溅的可扩展框架，用于从单目视频中重建动态场景的外观、几何和运动。论文涉及的关键技术包括高斯泼溅、运动场建模、渐进优化策略、相机姿态优化和阴影建模。然而，所有给定的评分关键词均与大模型、深度学习技术原理、AI for Science等主题相关，而本论文的研究内容与这些关键词无直接关联。论文未提及任何大模型、语言模型、训练技术、对齐方法、推理加速、AI代理或科学AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MotionScale的可扩展4D高斯泼溅框架，用于从单目视频中重建动态场景的外观、几何和运动，通过集群中心基变换和渐进优化策略显著提升了重建质量和时间稳定性。

摘要翻译

从单目视频中实现动态四维场景的真实重建对于理解物理世界至关重要。尽管神经渲染领域近期取得了进展，但现有方法在复杂环境中仍难以恢复精确的三维几何结构和时间一致的运动。为应对这些挑战，我们提出了MotionScale，这是一个四维高斯泼溅框架，能够高效扩展至大场景和长序列，同时保持高保真的结构与运动一致性。我们方法的核心是一个可扩展的运动场，其通过以聚类为中心的基变换进行参数化，能够自适应扩展以捕捉多样且演变的运动模式。为确保长时间序列下的鲁棒重建，我们引入了一种渐进式优化策略，包含两个解耦的传播阶段：1）背景扩展阶段，该阶段适应新可见区域、优化相机位姿，并显式建模瞬态阴影；2）前景传播阶段，通过专门的三阶段细化流程强制保持运动一致性。在具有挑战性的真实世界基准测试上进行的大量实验表明，MotionScale在重建质量和时间稳定性方面均显著优于现有先进方法。项目页面：https://hrzhou2.github.io/motion-scale-web/。

摘要 (Abstract)

Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: https://hrzhou2.github.io/motion-scale-web/.

关键词: 4D Gaussian Splatting, dynamic scene reconstruction, motion field, progressive optimization, temporal consistency, monocular video, scalable framework, neural rendering

219. ❌ MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters

作者: Soomin Park, Eunseong Lee, Kwang Bin Lee, Sung-Hee Lee 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29272v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters》专注于基于物理的人形角色控制中的运动适应框架，采用两阶段残差学习范式（mask-invariant base policy + residual policy）。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是机器人控制、运动生成和物理模拟领域，未涉及LLMs、深度学习模型训练/优化技术或AI在生物/化学信息学中的应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了MaskAdapt框架，通过mask-invariant先验和残差学习实现基于物理的人形角色的灵活运动适应，在运动组合和文本驱动的部分目标跟踪任务中表现出优越的鲁棒性和适应性。

摘要翻译

本文提出MaskAdapt框架，一种基于物理仿真的人形控制器柔性运动适配方法。该框架采用两阶段残差学习范式：第一阶段，我们通过随机身体部位掩码与正则化项训练掩码不变的基础策略，该正则化项强制不同掩码条件下的动作分布保持一致，从而获得鲁棒的运动先验。该先验在观测信息缺失时仍保持稳定，并能预判后续对这些区域的适配需求。第二阶段，在冻结的基础控制器之上训练残差策略，使其仅针对目标身体部位进行运动调整，同时保持其他部位的原始行为。我们通过两个应用展示该设计的通用性：（一）运动组合，通过变化掩码实现在单一运动序列中进行多部位适配；（二）文本驱动的局部目标跟踪，使指定身体部位跟随由预训练文本条件自回归运动生成器提供的运动学目标。实验表明，MaskAdapt在掩码观测下展现出强大的鲁棒性与适应性，能够生成多样行为，并在目标运动适配任务上优于现有方法。

摘要 (Abstract)

We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.

关键词: physics-based characters, motion adaptation, mask-invariant prior, residual learning, humanoid control, motion composition, text-driven tracking, robust policy

220. ❌ GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

作者: Yaning Zhang, Linlin Shen, Zitong Yu, Chunjie Ma, Zan Gao 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29295v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的深度伪造检测和归因任务，使用CLIP模型和视觉注意力机制（gaze），不涉及大语言模型、深度学习技术原理创新或科学领域应用。所有关键词均与大语言模型、深度学习技术原理或AI for Science相关，而本文是纯计算机视觉应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于CLIP和视觉注意力引导的深度伪造检测与归因方法，通过融合注视向量和自适应语言提示来提升模型对未见伪造方法的泛化能力，在基准测试中平均性能优于现有方法6.56% ACC和5.32% AUC。

摘要翻译

当前深度伪造溯源或检测方法因仅局限于视觉模态的有限探索，往往对新型生成方法泛化能力较差。这些方法倾向于粗略评估模型在未见过的先进生成器上的溯源或检测性能，且未能充分考虑两项任务的协同作用。为此，我们提出一种新颖的视线引导CLIP模型，结合自适应增强的细粒度语言提示，用于细粒度深度伪造溯源与检测（DFAD）。具体而言，我们构建了一个新颖的细粒度基准测试，用于评估网络在扩散模型与流模型等新型生成器上的DFAD性能。此外，我们引入了一种基于CLIP的视线感知模型，旨在增强对未见过的面部伪造攻击的泛化能力。基于一项新观察——真实与伪造视线向量存在显著分布差异，且GAN与扩散模型生成的面部图像中目标视线的保留程度差异巨大，我们设计了一个视觉感知编码器，利用固有的视线差异挖掘跨外观与视线域的全局伪造嵌入。我们提出一种视线感知图像编码器（GIE），将通过视线编码器提取的伪造视线提示与常见的伪造图像嵌入相融合，以捕捉通用的溯源模式，使特征能够转换到更稳定、更通用的DFAD特征空间。我们构建了一个语言精炼编码器（LRE），通过自适应增强的词语选择器生成动态增强的语言嵌入，以实现精确的视觉-语言匹配。在我们构建的基准测试上进行的大量实验表明，在溯源与检测设置下，我们的模型平均性能分别以6.56%的准确率（ACC）和5.32%的曲线下面积（AUC）优于现有最优方法。代码将在GitHub上公开。

摘要 (Abstract)

Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.

关键词: deepfake detection, deepfake attribution, CLIP, gaze-guided, vision-language matching, generalization, diffusion models, forgery embeddings

221. ❌ ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

作者: Wenyang Chen, Zhanxuan Hu, Yaping Zhang, Hailong Ning, Yonghang Tai 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29271v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉领域的遥感图像分割，使用视觉-语言模型（VLMs）实现免训练开放词汇分割。虽然属于AI应用范畴，但所有关键词均针对大语言模型（LLMs）的技术原理、训练方法、推理优化、对齐、代理系统等具体方面，而本文未涉及任何LLM相关技术。唯一相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为遥感属于地球科学应用，可视为AI for Science的一个子领域，但并非核心生物信息学或化学信息学，故给5分（有一定关联）。其他关键词与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像中孤立补丁预测导致分割不准确的问题，提出了一个上下文感知推理框架ConInfer，通过建模空间单元间的语义依赖关系，在多个基准数据集上显著提升了开放词汇语义分割和目标提取的性能。

摘要翻译

基于视觉语言模型的无训练开放词汇遥感分割（OVRSS）已成为实现遥感影像类别无关语义理解的一种前景广阔的研究范式。现有方法主要侧重于增强特征表示或减少模态差异，以提高图块级别的预测精度。然而，这种独立预测方案与遥感数据的内在特性存在根本性错位。在实际应用中，遥感场景通常规模宏大，并表现出强烈的空间及语义关联性，使得孤立的图块级预测难以实现精确分割。为应对这一局限，我们提出了ConInfer——一种面向OVRSS的上下文感知推理框架，该框架在多个空间单元上进行联合预测，并显式建模单元间的语义依赖关系。通过融入全局上下文信息，我们的方法在复杂遥感环境中显著提升了分割的一致性、鲁棒性和泛化能力。在多个基准数据集上的大量实验表明，我们的方法持续超越了基于逐像素视觉语言模型的先进基线方法（如SegEarth-OV），在开放词汇语义分割和对象提取任务上分别实现了平均2.80%和6.13%的性能提升。实现代码已发布于：https://github.com/Dog-Yang/ConInfer

摘要 (Abstract)

Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/Dog-Yang/ConInfer

关键词: open-vocabulary remote sensing segmentation, vision-language models, context-aware inference, semantic dependencies, training-free, spatial correlations, SegEarth-OV, joint prediction

222. ❌ Unbiased Model Prediction Without Using Protected Attribute Information

作者: Puspita Majumdar, Surbhi Mittal, Mayank Vatsa, Richa Singh 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29270v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究深度学习模型中的偏见缓解问题，提出了一种不依赖受保护属性信息的去偏算法（NPAD），并应用于人脸属性预测任务。所有评分关键词均与大模型技术、训练方法、推理优化、AI应用等具体领域相关，而本文专注于传统深度学习模型的公平性算法研究，未涉及大模型、LLM、MoE、量化、推理加速、AI for Science等任何评分关键词的技术内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种不依赖受保护属性信息的深度学习模型去偏算法（NPAD），通过非受保护属性的辅助信息和两种损失函数（DACL和FRL）优化模型公平性，在LFWA和CelebA数据集的人脸属性预测任务中显著减少了性别和年龄子组间的偏见。

摘要翻译

深度学习领域中的偏见问题持续存在，模型在不同人口统计子组间仍表现出显著性能差异。为此，学界已提出多种算法以提升深度模型的公平性。然而，这些算法大多依赖受保护属性信息进行偏见缓解，这严重限制了其在现实场景中的应用。为解决这一问题，我们提出了一种名为基于非保护属性的去偏算法的新方法，该算法无需使用受保护属性信息即可实现偏见缓解。所提出的NPAD算法利用非保护属性提供的辅助信息，通过优化模型来减轻偏见。此外，我们提出了两种不同的损失函数——基于属性聚类损失的去偏与过滤冗余损失，以优化模型实现公平性目标。我们在LFWA和CelebA数据集上针对面部属性预测任务进行了多项实验，结果观察到模型在不同性别和年龄子组间的偏见显著降低。

摘要 (Abstract)

The problem of bias persists in the deep learning community as models continue to provide disparate performance across different demographic subgroups. Therefore, several algorithms have been proposed to improve the fairness of deep models. However, a majority of these algorithms utilize the protected attribute information for bias mitigation, which severely limits their application in real-world scenarios. To address this concern, we have proposed a novel algorithm, termed as \textbf{Non-Protected Attribute-based Debiasing (NPAD)} algorithm for bias mitigation, that does not require the protected attribute information. The proposed NPAD algorithm utilizes the auxiliary information provided by the non-protected attributes to optimize the model for bias mitigation. Further, two different loss functions, \textbf{Debiasing via Attribute Cluster Loss (DACL)} and \textbf{Filter Redundancy Loss (FRL)} have been proposed to optimize the model for fairness goals. Multiple experiments are performed on the LFWA and CelebA datasets for facial attribute prediction, and a significant reduction in bias across different gender and age subgroups is observed.

关键词: bias mitigation, fairness, deep learning, protected attribute, non-protected attribute, debiasing algorithm, facial attribute prediction, demographic subgroups

223. ❌ Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

作者: Jingqi Xu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29258v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于通过对比微调方法改进CLIP模型对否定表达的理解，属于视觉-语言模型（VLM）的特定应用研究。与评分关键词列表高度相关的只有"Post-training OR Supervised Fine-tuning OR SFT"，因为论文核心是微调CLIP模型（属于监督微调）。其他关键词主要针对大语言模型（LLM）的技术、应用或评估，而本文研究的是视觉-语言模型（VLM），且未涉及LLM相关的技术如MoE、缩放定律、对齐、推理方法、代理系统等。论文也未涉及科学领域的AI应用（如生物信息学）。因此，除监督微调外，其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对CLIP模型在理解图像描述中否定表达方面的不足，提出了一种通过前层对比微调的方法（Omni-NegCLIP），显著提升了模型对存在性否定和缺失性否定的理解能力，同时保持了图像-文本检索的一般性能。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）已在广泛的多模态任务中展现出强大能力。然而，近期研究表明，诸如CLIP等视觉语言模型在理解自然语言中常见的否定表达方面表现不佳。本文提出Omni-NegCLIP，这是一种通过改进CLIP原始InfoNCE对比损失进行微调的模型，旨在提升CLIP对两类否定的理解能力：存在性否定与缺席性否定——前者对应图像中实际存在但被否定的对象表达，后者对应图像中可能合理存在但实际缺席的对象表达。具体而言，我们设计了存在性对比目标，使图像嵌入更接近其原始描述嵌入，同时远离对应的存在性否定描述嵌入；以及缺席性对比目标，使图像嵌入同时与原始描述及缺席性否定描述嵌入对齐，同时保持两种文本嵌入间的语义区分。基于我们观察到CLIP文本编码器的前部Transformer层对否定文本具有比后部更强的学习能力，我们在每个训练步骤中采用组合对比目标对CLIP文本编码器的前部Transformer层进行微调。实验结果表明，与预训练CLIP相比，Omni-NegCLIP在存在性否定和缺席性否定任务上的性能分别提升高达52.65%和12.50%，且未牺牲图文检索的通用能力，甚至将其提升达19.62%。与先前研究相比，Omni-NegCLIP展现出更全面的多类型否定任务理解能力。

摘要 (Abstract)

Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP’s understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP’s original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.

关键词: Vision-Language Models, CLIP, negation understanding, contrastive fine-tuning, presence-based negation, absence-based negation, front-layer tuning, image-text retrieval

224. ❌ Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

作者: Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29252v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）的长视频理解，核心创新是提出FlexMem机制，通过视觉记忆机制处理无限长度的视频输入。与关键词高度相关的是：1）‘Large Language Models’（论文明确研究MLLMs）；2）‘Context Window Extension’（解决长视频理解，本质上是扩展上下文窗口）；3）‘KV Cache Compression’（论文使用视觉KV缓存作为记忆源，并进行压缩设计）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、AI for Science等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型的长视频理解挑战，提出了一种基于视觉记忆机制的FlexMem方法，实现了对无限长度视频的处理，并在单GPU上处理超过1000帧，性能优于现有高效方法。

摘要翻译

长视频理解是制约多模态大语言模型发展的核心挑战。本文从视觉记忆机制的角度研究该问题，提出了一种无需训练的新方法——灵活记忆。该方法旨在模拟人类观看视频的行为模式，即持续观看视频内容并回忆最相关的记忆片段以回答问题。通过这种方式，灵活记忆能够帮助多模态大语言模型实现无限时长的视频理解，突破了传统方法需一次性处理全部视频信息且存在输入长度限制的瓶颈。具体而言，灵活记忆首先将视觉键值缓存视为记忆源，通过双路径压缩设计实现高效记忆迁移与存储。随后，针对包括主流流式视频任务在内的多样化视频理解需求，该方法探索了不同的记忆读取策略。为验证其有效性，我们将灵活记忆应用于两种主流视频多模态大语言模型，并在五个长视频任务及一个流式视频任务上进行了广泛实验。结果表明，在单张3090 GPU上，灵活记忆相比现有高效视频理解方法能实现显著性能提升，并可处理超过1000帧的视频数据。该方法还使基础多模态大语言模型在部分基准测试中达到甚至超越了当前最先进模型（如GPT-4o和Gemini-1.5 Pro）的性能水平。

摘要 (Abstract)

Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.

关键词: Multimodal Large Language Models, Long Video Understanding, Visual Memory Mechanism, KV Cache Compression, Flexible Memory, Streaming Video, Infinite Length, Training-free Approach

225. ❌ Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method

作者: Yanjiao Song, Bowen Cai, Timo Balz, Zhenfeng Shao, Neema Simon Sumari, James Magidi, Walter Musakwa 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29245v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于利用PhiSat-2卫星图像进行单目建筑高度估计，提出了一个数据集（PHDataset）和一个两流有序网络（TSONet）。论文的核心是计算机视觉和遥感技术，涉及图像分割、高度回归和特征交互模块。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，但论文未提及任何大模型、LLM、MoE、训练方法、对齐、推理、代理、压缩等主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为建筑高度估计可视为AI在科学（遥感/地理信息）中的一个应用，但论文未明确使用这些术语，且重点在传统CV方法而非大模型创新，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了从PhiSat-2光学图像中单目估计建筑高度的挑战，通过构建一个全球数据集（PHDataset）并提出了一个两流有序网络（TSONet），在实验中显著降低了误差并提高了分割精度。

摘要翻译

基于光学影像的单目建筑物高度估计对于城市形态表征具有重要意义，但由于高度线索模糊、城市间建筑形态差异巨大以及建筑高度呈长尾分布，该任务仍具挑战性。PhiSat-2因其全球覆盖、4.75米空间分辨率和七波段光谱观测能力，成为该任务中一个前景广阔的开放数据源，但其潜力尚未得到系统评估。为填补这一空白，本研究构建了PhiSat-2高度数据集（PHDataset）并提出了一种双流有序网络（TSONet）。PHDataset包含来自全球26个城市的9,475个已配准的图像-标签图斑对。TSONet联合建模建筑基底分割与高度估计，并引入了跨流交换模块（CSEM）和特征增强分箱优化模块（FEBR），以实现基底感知的特征交互和有序高度优化。在PHDataset上的实验表明，TSONet取得了最佳综合性能，与最强竞争结果相比，其平均绝对误差和均方根误差分别降低了13.2%和9.7%，交并比和F1分数则提升了14.0%和10.1%。消融实验进一步验证了CSEM、FEBR以及有序回归与基底辅助联合使用的有效性。补充分析表明，PhiSat-2通过平衡结合与建筑物相关的空间细节和多光谱观测，有益于单目建筑物高度估计。总体而言，本研究证实了PhiSat-2在单目建筑物高度估计方面的潜力，并为未来研究提供了一个专用数据集和一种有效方法。

摘要 (Abstract)

Monocular building height estimation from optical imagery is important for urban morphology characterization but remains challenging due to ambiguous height cues, large inter-city variations in building morphology, and the long-tailed distribution of building heights. PhiSat-2 is a promising open-access data source for this task because of its global coverage, 4.75 m spatial resolution, and seven-band spectral observations, yet its potential has not been systematically evaluated. To address this gap, we construct a PhiSat-2-Height dataset (PHDataset) and propose a Two-Stream Ordinal Network (TSONet). PHDataset contains 9,475 co-registered image-label patch pairs from 26 cities worldwide. TSONet jointly models footprint segmentation and height estimation, and introduces a Cross-Stream Exchange Module (CSEM) and a Feature-Enhanced Bin Refinement (FEBR) module for footprint-aware feature interaction and ordinal height refinement. Experiments on PHDataset show that TSONet achieves the best overall performance, reducing MAE and RMSE by 13.2% and 9.7%, and improving IoU and F1-score by 14.0% and 10.1% over the strongest competing results. Ablation studies further verify the effectiveness of CSEM, FEBR, and the joint use of ordinal regression and footprint assistance. Additional analyses indicate that PhiSat-2 benefits monocular building height estimation through its balanced combination of building-relevant spatial detail and multispectral observations. Overall, this study confirms the potential of PhiSat-2 for monocular building height estimation and provides a dedicated dataset and an effective method for future research.

关键词: monocular building height estimation, PhiSat-2 imagery, PHDataset, Two-Stream Ordinal Network, footprint segmentation, ordinal regression, remote sensing, urban morphology

226. ❌ Diffusion Mental Averages

作者: Phonphrm Thawatdamrongkit, Sukit Seripanitkarn, Supasorn Suwajanakorn 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29239v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型（Diffusion Models）生成概念“心理平均值”的方法，属于计算机视觉/生成模型领域，而非大语言模型（LLM）或深度学习技术原理的核心创新。唯一相关关键词是“PEFT OR LoRA OR Parameter-efficient Fine-tuning”，因为摘要提到使用LoRA（Low-Rank Adaptation）将CLIP集群桥接到扩散空间，但LoRA只是作为辅助技术被提及，并非论文核心创新点。其他关键词均与LLM、推理、对齐、科学AI应用等无关，因此评分为0。

!!! tip deepseek-chat TL;DR

论文提出了Diffusion Mental Averages（DMA）方法，通过轨迹对齐在扩散模型的语义空间内生成清晰、现实的概念“心理平均值”，解决了现有数据平均方法产生模糊结果的问题。

摘要翻译

扩散模型能否生成其自身对概念的“心理平均”——一种与典型样本同样清晰且逼真的表征？我们提出了扩散心理平均（Diffusion Mental Averages, DMA），作为对这一问题的模型中心式解答。现有方法旨在对图像集合进行平均，但当应用于同一提示词生成的扩散样本时，其输出结果往往模糊不清。这些以数据为中心的技术在模型外部操作，忽略了生成过程本身。与之相反，DMA 在扩散模型的语义空间内进行平均，这一空间特性已得到近期研究的揭示。由于该空间随时间步演变且缺乏直接解码器，我们将平均问题转化为轨迹对齐：通过优化多个噪声隐变量，使其去噪轨迹逐步收敛至共享的从粗到细的语义，从而生成一个清晰的原型样本。我们将该方法扩展至多模态概念（例如包含多个品种的“狗”概念），通过在语义丰富的空间（如CLIP）中对样本进行聚类，并运用文本反转（Textual Inversion）或低秩适应（LoRA）技术将CLIP聚类桥接至扩散空间。据我们所知，这是首个能够生成一致且逼真的平均结果的方法，即使对于抽象概念亦如此，它既可作为一种具体的视觉摘要，也可作为探究模型偏见与概念表征的观察窗口。

摘要 (Abstract)

Can a diffusion model produce its own “mental average” of a concept-one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model’s semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.

关键词: Diffusion Models, Mental Averages, Trajectory Alignment, Semantic Space, CLIP, LoRA, Concept Representation, Model Biases

227. ❌ M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding

作者: U. V. B. L. Udugama, George Vosselman, Francesco Nex 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29236v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和机器人感知领域，研究多任务密集视觉感知模型（M2H-MX）用于实时单目空间理解，涉及深度估计、语义分割和SLAM集成。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，但论文内容完全不涉及这些主题：没有提到任何语言模型（大、小或基础模型）、模型训练技术（预训练、微调、对齐、RLHF、PEFT等）、推理优化（注意力机制、量化、解码加速）、推理方法（思维链、系统2思维、MCTS）、代理系统、幻觉缓解、可解释性、世界模型、模型合并、上下文学习，也没有涉及生物信息学或化学信息学等科学AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了M2H-MX多任务密集视觉感知模型，用于实时单目空间理解，通过改进深度和语义预测的交互，在NYUDv2和ScanNet数据集上显著提升了准确性和SLAM系统性能。

摘要翻译

单目相机因其低成本与易部署特性在机器人感知领域备受青睐，然而从单一图像流中实现可靠、实时的空间理解仍具挑战。尽管近年来的多任务密集预测模型已提升了逐像素深度与语义估计性能，但将这些进展转化为稳定的单目建图系统仍非易事。
本文提出M2H-MX，一种用于单目空间理解的实时多任务感知模型。该模型在保留多尺度特征表示的同时，于轻量级解码器中引入了寄存器门控全局上下文与受控的跨任务交互机制，使得深度与语义预测能在严格延迟约束下相互增强。其输出通过紧凑的感知-建图接口，可直接集成至未经修改的单目SLAM（同步定位与建图）流程中。
我们评估了密集预测精度与系统闭环性能。在NYUDv2数据集上，M2H-MX-L模型取得了领先性能，相较于代表性多任务基线模型，语义平均交并比（mIoU）提升6.6%，深度均方根误差（RMSE）降低9.4%。在ScanNet数据集的实际单目建图系统中部署时，M2H-MX相比强大的单目SLAM基线将平均轨迹误差降低了60.7%，同时生成更清晰的度量-语义地图。这些结果表明，现代多任务密集预测技术能够可靠地应用于机器人系统的实时单目空间感知任务。

摘要 (Abstract)

Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.

关键词: monocular spatial understanding, multi-task dense prediction, real-time perception, depth estimation, semantic segmentation, SLAM integration, robotic perception, lightweight decoder

228. ❌ CCDNet: Learning to Detect Camouflage against Distractors in Infrared Small Target Detection

作者: Zikai Liao, Zhaozheng Yin 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29228v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于红外小目标检测的计算机视觉任务，提出CCDNet网络解决目标伪装和干扰物问题，使用加权多分支感知器、聚合细化融合颈部和对比辅助干扰物鉴别器等传统深度学习技术。所有评分关键词均涉及大语言模型、模型训练优化、推理加速、对齐技术、智能体系统等大模型相关主题，与论文的计算机视觉检测任务完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CCDNet的新型网络，通过加权多分支感知器、聚合细化融合颈部和对比辅助干扰物鉴别器来解决红外小目标检测中目标伪装和干扰物导致的误报问题，实验证明其性能优于现有方法。

摘要翻译

红外目标检测（IRSTD）任务在野外救援与海上搜救等领域具有关键应用价值。然而，由于红外目标对比度低、易与复杂背景融合形成有效伪装，其检测面临严峻挑战。此外，具有相似特征的其他物体（干扰物）可能引发虚警，进一步降低检测性能。为解决这些问题，本文提出一种新颖的伪装感知抗干扰网络（Camouflage-aware Counter-Distraction Network，CCDNet）。我们设计了一种采用加权多分支感知器（Weighted Multi-branch Perceptrons，WMPs）的主干网络，通过聚合自调节的多层次特征来精确表征目标与背景。基于这些丰富特征，我们进一步提出一种新颖的聚合-精化融合颈部模块（Aggregation-and-Refinement Fusion Neck，ARFN），用于从浅层/深层特征图中精化结构/语义信息，并双向重建目标与背景间的关联关系，在抑制复杂背景的同时突出目标，从而提升检测精度。此外，我们提出一种新的对比辅助干扰物判别器（Contrastive-aided Distractor Discriminator，CaDD），通过在真实目标与背景之间执行局部与全局的自适应相似度计算，以更精确地区分干扰物，从而降低虚警率。在红外图像数据集上的大量实验证实，CCDNet的性能优于其他先进方法。

摘要 (Abstract)

Infrared target detection (IRSTD) tasks have critical applications in areas like wilderness rescue and maritime search. However, detecting infrared targets is challenging due to their low contrast and tendency to blend into complex backgrounds, effectively camouflaging themselves. Additionally, other objects with similar features (distractors) can cause false alarms, further degrading detection performance. To address these issues, we propose a novel \textbf{C}amouflage-aware \textbf{C}ounter-\textbf{D}istraction \textbf{Net}work (CCDNet) in this paper. We design a backbone with Weighted Multi-branch Perceptrons (WMPs), which aggregates self-conditioned multi-level features to accurately represent the target and background. Based on these rich features, we then propose a novel Aggregation-and-Refinement Fusion Neck (ARFN) to refine structures/semantics from shallow/deep features maps, and bidirectionally reconstruct the relations between the targets and the backgrounds, highlighting the targets while suppressing the complex backgrounds to improve detection accuracy. Furthermore, we present a new Contrastive-aided Distractor Discriminator (CaDD), enforcing adaptive similarity computation both locally and globally between the real targets and the backgrounds to more precisely discriminate distractors, so as to reduce the false alarm rate. Extensive experiments on infrared image datasets confirm that CCDNet outperforms other state-of-the-art methods.

关键词: Infrared target detection, Camouflage-aware, Distractor discrimination, Weighted Multi-branch Perceptrons, Aggregation-and-Refinement Fusion Neck, Contrastive-aided Distractor Discriminator, False alarm reduction

229. ❌ LightHarmony3D: Harmonizing Illumination and Shadows for Object Insertion in 3D Gaussian Splatting

作者: Tianyu Huang, Zhenyang Ren, Zhenchen Wan, Jiyang Zheng, Wenjie Wang, Runnan Chen, Mingming Gong, Tongliang Liu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29209v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D高斯泼溅（3DGS）场景中的光照一致性和阴影渲染技术，属于计算机视觉和图形学领域，与所有提供的大模型和深度学习技术关键词（如LLMs、MoE、RLHF、RAG等）以及AI for Science应用领域均无直接关联。论文未涉及语言模型、模型训练、推理优化、对齐、代理系统或科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文提出了LightHarmony3D框架，通过生成式模块预测HDR环境图，解决了在3D高斯泼溅场景中插入网格物体时实现光照和阴影物理一致性的挑战，并创建了首个专用基准测试，实验表明其达到了最先进的真实感和多视图一致性。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting，简称3DGS）能够实现场景几何与外观的高保真重建。基于此能力，将外部网格对象插入重建后的3DGS场景中，可为增强现实/虚拟现实（AR/VR）、虚拟场景布置及数字内容创作等沉浸式应用实现交互式编辑与内容增强。然而，为网格插入实现物理一致的光照与阴影仍具挑战，因为这需要精确的场景光照估计与多视角一致渲染。为解决这一难题，我们提出了LightHarmony3D——一种在3DGS场景中实现光照一致网格插入的新型框架。我们方法的核心在于所提出的生成模块，该模块通过单次前向传播即可在插入位置预测完整的360度高动态范围（HDR）环境贴图。通过利用生成先验而非迭代优化，我们的方法能高效捕捉场景主导光照，并为插入的网格提供基于物理的着色与阴影，同时保持多视角一致性。此外，我们首次为3DGS中的网格插入任务构建了专用基准测试，为评估光照一致性与照片真实感提供了标准化框架。在多个真实世界重建数据集上的大量实验表明，LightHarmony3D在真实感与多视角一致性方面均达到了当前最优水平。

摘要 (Abstract)

3D Gaussian Splatting (3DGS) enables high-fidelity reconstruction of scene geometry and appearance. Building on this capability, inserting external mesh objects into reconstructed 3DGS scenes enables interactive editing and content augmentation for immersive applications such as AR/VR, virtual staging, and digital content creation. However, achieving physically consistent lighting and shadows for mesh insertion remains challenging, as it requires accurate scene illumination estimation and multi-view consistent rendering. To address this challenge, we present LightHarmony3D, a novel framework for illumination-consistent mesh insertion in 3DGS scenes. Central to our approach is our proposed generative module that predicts a full 360° HDR environment map at the insertion location via a single forward pass. By leveraging generative priors instead of iterative optimization, our method efficiently captures dominant scene illumination and enables physically grounded shading and shadows for inserted meshes while maintaining multi-view coherence. Furthermore, we introduce the first dedicated benchmark for mesh insertion in 3DGS, providing a standardized evaluation framework for assessing lighting consistency and photorealism. Extensive experiments across multiple real-world reconstruction datasets demonstrate that LightHarmony3D achieves state-of-the-art realism and multi-view consistency.

关键词: 3D Gaussian Splatting, illumination consistency, mesh insertion, HDR environment map, multi-view rendering, photorealistic editing, generative module, lighting estimation

230. ❌ Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention

作者: Sunil Tiwari, Payal Fofadiya 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29194v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agents的长时记忆架构，与’LLM Agents’、‘Large Language Models’、‘Context Window Extension’高度相关（10分）。通过分层记忆和检索机制解决长时对话中的语义漂移，与’Retrieval-Augmented Generation’、‘Chain of Thought’、‘System 2 Thinking’相关（8分）。其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在长时对话中的语义漂移和记忆不稳定问题，提出了一种多层记忆框架，实验证明该框架在受限上下文预算下显著提升了长期记忆保持和推理稳定性。

摘要翻译

长程对话系统在跨多轮会话中普遍面临语义漂移与记忆稳定性不足的问题。本文提出一种多层记忆框架，通过将对话历史解耦为工作记忆、情景记忆与语义记忆三层，并引入自适应检索门控与记忆保持正则化机制。该架构在控制跨会话语义漂移的同时，实现了有界上下文增长与计算效率的平衡。在LOCOMO、LOCCO和LoCoMo数据集上的实验表明，该框架显著提升了系统性能：任务成功率（Success Rate）达46.85%，整体F1值提升至0.618（其中多跳推理F1为0.594），六周期记忆保持率达到56.90%，同时将错误记忆率降低至5.1%，上下文使用率控制在58.40%。实验结果证实了该框架在有限上下文预算下，能够有效增强长期记忆保持与推理稳定性。

摘要 (Abstract)

Long-horizon dialogue systems suffer from semanticdrift and unstable memory retention across extended sessions. This paper presents a Multi-Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization. The architecture controls cross-session drift while maintaining bounded context growth and computational efficiency. Experiments on LOCOMO, LOCCO, and LoCoMo show improved performance, achieving 46.85 Success Rate, 0.618 overall F1 with 0.594 multi-hop F1, and 56.90% six-period retention while reducing false memory rate to 5.1% and context usage to 58.40%. Results confirm enhanced long-term retention and reasoning stability under constrained context budgets.

关键词: LLM Agents, Long-Term Context Retention, Multi-Layer Memory Framework, Semantic Drift, Adaptive Retrieval, Memory Retention, Dialogue Systems, Context Budget

231. ❌ Developing Adaptive Context Compression Techniques for Large Language Models (LLMs) in Long-Running Interactions

作者: Payal Fofadiya, Sunil Tiwari 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29193v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在长上下文交互中的性能问题，提出自适应上下文压缩技术。与’Large Language Models’和’Context Window Extension’高度相关（10分），因为直接针对LLMs的长上下文挑战。与’KV Cache Compression’和’Speculative Decoding’有一定关联（5分），涉及推理效率优化但非核心方法。其他关键词如MoE、SFT、RAG等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在长时交互中因上下文增长导致的性能下降问题，提出了一种自适应上下文压缩框架，在多个基准测试中实现了对话稳定性、检索性能的提升，同时降低了计算开销。

摘要翻译

大型语言模型（LLM）在长程交互中常因上下文长度增加、内存饱和及计算开销上升而导致性能下降。本文提出一种自适应上下文压缩框架，该框架整合了重要性感知记忆选择、连贯性敏感过滤与动态预算分配机制，旨在控制上下文增长的同时保留关键对话信息。该方法在LOCOMO、LOCCO和LongBench基准测试上进行了评估，以检验其回答质量、检索准确性、连贯性保持能力及运行效率。实验结果表明，与现有的基于记忆和压缩的方法相比，所提出的方法在降低标记使用量和推理延迟的同时，实现了对话稳定性与检索性能的持续提升。这些发现表明，在持久性LLM交互中，自适应上下文压缩能够在长期记忆保持与计算效率之间达成有效平衡。

摘要 (Abstract)

Large Language Models (LLMs) often experience performance degradation during long-running interactions due to increasing context length, memory saturation, and computational overhead. This paper presents an adaptive context compression framework that integrates importance-aware memory selection, coherence-sensitive filtering, and dynamic budget allocation to retain essential conversational information while controlling context growth. The approach is evaluated on LOCOMO, LOCCO, and LongBench benchmarks to assess answer quality, retrieval accuracy, coherence preservation, and efficiency. Experimental results demonstrate that the proposed method achieves consistent improvements in conversational stability and retrieval performance while reducing token usage and inference latency compared with existing memory and compression-based approaches. These findings indicate that adaptive context compression provides an effective balance between long-term memory preservation and computational efficiency in persistent LLM interactions

关键词: Large Language Models, context compression, long-running interactions, memory selection, computational efficiency, inference latency, adaptive framework, conversational stability

232. ❌ 3D Architect: An Automated Approach to Three-Dimensional Modeling

作者: Sunil Tiwari, Payal Fofadiya, Vicky Vishwakarma 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29191v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《3D Architect: An Automated Approach to Three-Dimensional Modeling》专注于计算机视觉和计算机图形学领域，提出了一种从正交视图自动重建3D模型的方法，涉及Harris角点检测、几何投影、包络构造和计算几何等技术。所有评分关键词均与大模型、深度学习技术原理或其在科学领域的应用相关，而该论文未涉及任何大模型、深度学习、AI技术或相关应用，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种从正交视图自动重建3D模型的自动化方法，通过角点检测、几何投影和计算几何技术，最终使用OpenGL渲染出三维物体。

摘要翻译

本文旨在利用一组正交视图实现三维物体的重建。首先在输入视图上应用角点检测器（哈里斯检测器）以获取控制点。这些控制点被垂直投影至各自对应的视图方向，从而构建出三维包络体。通过计算这些相互垂直的包络体的交集，获得描述三维物体的一组空间点集。随后运用计算几何方法，基于该点集重建物体的表面模型。最终，使用OpenGL对三维物体进行渲染。

摘要 (Abstract)

The aim of our paper is to render an object in 3-dimension using a set of its orthographic views. Corner detector (Harris Detector) is applied on the input views to obtain control points. These control points are projected perpendicular to respective views, in order to construct an envelope. A set of points describing the object in 3-dimension, are obtained from the intersection of these mutually perpendicular envelopes. These set of points are used to regenerate the surfaces of the object using computational geometry. At the end, the object in 3-dimension is rendered using OpenGL

关键词: 3D modeling, orthographic views, Harris corner detector, computational geometry, OpenGL rendering, automated reconstruction, control points, envelope construction

233. ❌ SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

作者: Ryosuke Matsuda, Keito Kudo, Haruto Yoshida, Nobuyuki Shimizu, Jun Suzuki 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29186v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本到长视频生成的评估基准（SLVMEval），涉及视频质量评估、合成数据生成和众包验证，但完全不涉及大模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统、压缩、科学AI等主题相关，而本文研究的是视频生成评估的元评估基准，与这些技术领域无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了SLVMEval基准，用于元评估文本到长视频生成系统的评估方法，实验发现现有系统在10个方面中有9个的评估准确性低于人类水平。

摘要翻译

本文提出合成长视频元评估（SLVMEval），这是一个用于元评估文本到视频（T2V）评价系统的基准。所提出的SLVMEval基准专注于评估这些系统在长达10,486秒（约3小时）视频上的表现。该基准针对一个基本要求，即这些系统能否在人类易于判断的场景中准确评估视频质量。我们采用基于成对比较的元评估框架。基于密集视频描述数据集，我们通过合成方式对源视频进行降质处理，在10个不同方面创建受控的“高质量与低质量”视频对。随后，我们通过众包筛选并仅保留那些降质效果清晰可辨的视频对，从而建立一个有效的最终测试平台。利用此测试平台，我们评估了现有评价系统在对这些视频对进行排序时的可靠性。实验结果表明，人类评估者能以84.7%-96.8%的准确率识别更优的长视频，而在10个方面中的9个方面，现有系统的准确率均低于人类评估水平，揭示了文本到长视频评估中存在的不足。

摘要 (Abstract)

This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled “high-quality versus low-quality” pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

关键词: text-to-video generation, video evaluation benchmark, meta-evaluation, long video, synthetic degradation, pairwise comparison, crowdsourcing, human assessment

234. ❌ Refined Detection for Gumbel Watermarking

作者: Tor Lattimore 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.30017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Gumbel水印方案的检测机制，属于水印技术领域，与所有评分关键词（均涉及大模型/深度学习技术原理、应用或相关方法）无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对Aaronson（2022）提出的Gumbel水印方案，提出了一种简单且被证明在模型无关水印方案中接近最优的检测机制。

摘要翻译

我们针对Aaronson（2022）提出的Gumbel水印方案提出了一种简单的检测机制。在假设下一个词元分布为独立同分布采样的前提下，该新机制被证明在所有与模型无关的水印方案中，在问题依赖意义上接近最优。

摘要 (Abstract)

We propose a simple detection mechanism for the Gumbel watermarking scheme proposed by Aaronson (2022). The new mechanism is proven to be near-optimal in a problem-dependent sense among all model-agnostic watermarking schemes under the assumption that the next-token distribution is sampled i.i.d.

关键词: Gumbel watermarking, detection mechanism, model-agnostic, near-optimal, next-token distribution, i.i.d. sampling, watermarking scheme

235. ❌ Aligning Validation with Deployment: Target-Weighted Cross-Validation for Spatial Prediction

作者: Alexander Brenning, Thomas Suesse 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29981v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 这篇论文专注于空间预测中的交叉验证方法改进，提出了Target-Weighted CV（TWCV）来解决验证任务与部署任务分布不匹配的问题。论文内容完全属于统计学、机器学习验证方法和空间数据分析领域，不涉及任何大模型、深度学习技术原理、AI应用或相关技术关键词。所有关键词都围绕大模型技术、训练方法、推理优化、AI应用等主题，与本文的统计验证方法研究无任何关联。

!!! tip deepseek-chat TL;DR

该论文针对空间预测中交叉验证因验证任务与部署任务分布不匹配而产生偏差的问题，提出了Target-Weighted CV（TWCV）方法，通过校准加权和空间缓冲重采样来更准确地估计部署风险，并在模拟研究和环境污染物映射案例中验证了其有效性。

摘要翻译

交叉验证（CV）通常用于在缺乏独立测试数据时估计预测风险。其有效性依赖于以下假设：验证任务与部署期间遇到的预测任务来自同一分布。在空间预测及其他结构化数据场景中，该假设常被违反，导致对部署风险的估计产生偏差。我们提出目标加权交叉验证（Target-Weighted CV, TWCV），这是一种能够解释验证任务与部署任务分布差异的部署风险估计方法，从而同时考虑（1）协变量偏移和（2）任务难度偏移。我们通过协变量和空间配置等描述符来刻画预测任务。TWCV通过对验证损失赋予权重，使得验证任务的加权经验分布与目标域上的对应分布相匹配。权重通过校准加权法获得，从而得到一个以部署风险为目标的重要性加权估计量。由于TWCV要求对部署分布的支持集具有充分覆盖，我们将其与空间缓冲重采样相结合，以提升任务难度分布的多样性。在一项模拟研究中，传统估计量及空间估计量均因采样方式不同而表现出显著偏差，而缓冲TWCV在所有场景中均保持近似无偏。一项环境污染制图的案例研究进一步证实，验证任务与部署任务分布间的差异会影响性能评估，且缓冲TWCV能更好地反映目标域上的预测任务。这些结果确立了任务分布不匹配是空间预测中交叉验证偏差的主要来源，并表明校准加权与合适的验证任务生成器相结合，可为数据集偏移下的预测风险估计提供可行途径。

摘要 (Abstract)

Cross-validation (CV) is commonly used to estimate predictive risk when independent test data are unavailable. Its validity depends on the assumption that validation tasks are sampled from the same distribution as prediction tasks encountered during deployment. In spatial prediction and other settings with structured data, this assumption is frequently violated, leading to biased estimates of deployment risk. We propose Target-Weighted CV (TWCV), an estimator of deployment risk that accounts for discrepancies between validation and deployment task distributions, thus accounting for (1) covariate shift and (2) task-difficulty shift. We characterize prediction tasks by descriptors such as covariates and spatial configuration. TWCV assigns weights to validation losses such that the weighted empirical distribution of validation tasks matches the corresponding distribution over a target domain. The weights are obtained via calibration weighting, yielding an importance-weighted estimator that targets deployment risk. Since TWCV requires adequate coverage of the deployment distribution’s support, we combine it with spatially buffered resampling that diversifies the task difficulty distribution. In a simulation study, conventional as well as spatial estimators exhibit substantial bias depending on sampling, whereas buffered TWCV remains approximately unbiased across scenarios. A case study in environmental pollution mapping further confirms that discrepancies between validation and deployment task distributions can affect performance assessment, and that buffered TWCV better reflects the prediction task over the target domain. These results establish task distribution mismatch as a primary source of CV bias in spatial prediction and show that calibration weighting combined with a suitable validation task generator provides a viable approach to estimating predictive risk under dataset shift.

关键词: cross-validation, spatial prediction, target-weighted CV, deployment risk, task distribution mismatch, calibration weighting, covariate shift, dataset shift

236. ❌ Meteorology-Driven GPT4AP: A Multi-Task Forecasting LLM for Atmospheric Air Pollution in Data-Scarce Settings

作者: Prasanjit Dey, Soumyabrata Dev, Bianca Schoen-Phelan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29974v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文GPT4AP基于预训练的GPT-2模型，采用参数高效的微调方法（rsLoRA）进行多任务空气污染预测，属于大模型在科学领域的应用。核心相关关键词：1) ‘Large Language Models’（使用GPT-2，权重1.0，相关度10）；2) ‘Pre-training’（基于预训练模型，权重1.0，相关度10）；3) ‘Post-training’（进行监督微调，权重1.0，相关度10）；4) ‘PEFT’（使用rsLoRA进行参数高效微调，权重1.0，相关度15）；5) ‘AI for Science’（应用于大气污染预测的科学领域，权重1.0，相关度10）。其他关键词如MoE、SLMs、RAG等未涉及，相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了GPT4AP，一种基于预训练GPT-2和rsLoRA的参数高效多任务预测框架，用于数据稀缺环境下的大气污染预测，在少样本、零样本和长期预测设置中表现出优于基线的泛化能力和竞争力。

摘要翻译

准确预测空气污染对于环境监测与政策制定至关重要，然而在观测数据稀疏的区域，数据驱动模型往往泛化能力有限。本文提出气象驱动的空气污染预测GPT模型（GPT4AP），这是一个基于预训练GPT-2主干网络和高斯秩稳定低秩自适应（rsLoRA）的参数高效多任务预测框架。该模型冻结自注意力层和前馈网络层，仅适配轻量级位置编码模块与输出模块，显著减少了可训练参数量。GPT4AP在六个真实空气质量监测数据集上进行了小样本、零样本和长期预测场景的评估。在使用10%训练数据的小样本场景中，GPT4AP取得了平均MSE/MAE为0.686/0.442的指标，优于DLinear（0.728/0.530）和ETSformer（0.734/0.505）。在零样本跨站点迁移任务中，本模型获得平均MSE/MAE为0.529/0.403，相较于现有基线模型展现出更强的泛化能力。在使用全量训练数据的长期预测任务中，GPT4AP保持竞争力，取得平均MAE为0.429，而专用时间序列模型仅呈现轻微更低的误差。这些结果表明，GPT4AP提供了一种数据高效的预测方法，在有限监督和领域偏移条件下表现稳健，同时在数据充足场景中仍保持具有竞争力的预测精度。

摘要 (Abstract)

Accurate forecasting of air pollution is important for environmental monitoring and policy support, yet data-driven models often suffer from limited generalization in regions with sparse observations. This paper presents Meteorology-Driven GPT for Air Pollution (GPT4AP), a parameter-efficient multi-task forecasting framework based on a pre-trained GPT-2 backbone and Gaussian rank-stabilized low-rank adaptation (rsLoRA). The model freezes the self-attention and feed-forward layers and adapts lightweight positional and output modules, substantially reducing the number of trainable parameters. GPT4AP is evaluated on six real-world air quality monitoring datasets under few-shot, zero-shot, and long-term forecasting settings. In the few-shot regime using 10% of the training data, GPT4AP achieves an average MSE/MAE of 0.686/0.442, outperforming DLinear (0.728/0.530) and ETSformer (0.734/0.505). In zero-shot cross-station transfer, the proposed model attains an average MSE/MAE of 0.529/0.403, demonstrating improved generalization compared with existing baselines. In long-term forecasting with full training data, GPT4AP remains competitive, achieving an average MAE of 0.429, while specialized time-series models show slightly lower errors. These results indicate that GPT4AP provides a data-efficient forecasting approach that performs robustly under limited supervision and domain shift, while maintaining competitive accuracy in data-rich settings.

关键词: air pollution forecasting, GPT-2, parameter-efficient fine-tuning, rsLoRA, multi-task learning, few-shot learning, zero-shot transfer, time-series prediction

237. ❌ Do covariates explain why these groups differ? The choice of reference group can reverse conclusions in the Oaxaca-Blinder decomposition

作者: Manuel Quintero, Advik Shreekumar, William T. Stephenson, Tamara Broderick 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29972v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 该论文研究的是统计学中的Oaxaca-Blinder分解方法，探讨参考组选择如何影响结论，属于传统统计方法研究。论文内容完全不涉及大模型、深度学习、AI技术或科学AI应用，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文研究了Oaxaca-Blinder分解方法中参考组选择对统计结论的影响，发现不同参考组可能导致实质性不同的结论，但在实际数据分析中这种差异较为罕见。

摘要翻译

科学家常需解释为何两组结果存在差异。例如，两家医院患者死亡率的差异可能源于患者自身特征（协变量）的差异，也可能源于医疗护理（给定协变量下的结果）的差异。Oaxaca-Blinder分解（OBD）是区分这些因素的标准工具。众所周知，OBD需要选择其中一组作为参照组，而数值结果会随参照组的选择而变化。据我们所知，目前尚未有系统研究探讨OBD参照组的选择是否会导致不同的实质性结论，以及该问题的普遍性如何。本文通过真实数据与模拟数据的存在性证明，表明OBD参照组的选择确实可能产生实质不同的结论，且这些差异并非完全由模型设定错误或数据量小所驱动。我们证明，在高达一半的参数空间中会出现实质性不同的结论，但在所研究的真实数据分析中，这类差异较为罕见。通过考察现实数据生成过程如何偏向于那些在OBD下不会改变结论的参数，我们解释了这种经验上的罕见性。

摘要 (Abstract)

Scientists often want to explain why an outcome is different in two groups. For instance, differences in patient mortality rates across two hospitals could be due to differences in the patients themselves (covariates) or differences in medical care (outcomes given covariates). The Oaxaca–Blinder decomposition (OBD) is a standard tool to tease apart these factors. It is well known that the OBD requires choosing one of the groups as a reference, and the numerical answer can vary with the reference. To the best of our knowledge, there has not been a systematic investigation into whether the choice of OBD reference can yield different substantive conclusions and how common this issue is. In the present paper, we give existence proofs in real and simulated data that the OBD references can yield substantively different conclusions and that these differences are not entirely driven by model misspecification or small data. We prove that substantively different conclusions occur in up to half of the parameter space, but find these discrepancies rare in the real-data analyses we study. We explain this empirical rarity by examining how realistic data-generating processes can be biased towards parameters that do not change conclusions under the OBD.

关键词: Oaxaca-Blinder decomposition, reference group, statistical analysis, covariate explanation, group differences, model misspecification, data-generating processes

238. ❌ Think Anywhere in Code Generation

作者: Xue Jiang, Tianyu Zhang, Ge Li, Mengyang Liu, Taozhi Chen, Zhenhua Xu, Binhua Li, Wenpin Jiao, Zhi Jin, Yongbin Li, Yihong Dong 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Think-Anywhere机制，核心是改进LLMs在代码生成中的推理过程，允许在任意位置按需思考。与’Large Language Models’高度相关（核心研究对象），与’Chain of Thought’和’System 2 Thinking’高度相关（直接改进推理机制），与’Post-training’和’RLHF’有一定关联（使用冷启动训练和基于结果的RL奖励），与’Self-Correction’和’Explainable AI’有一定关联（涉及自我改进和增强可解释性）。其他关键词如MoE、SLMs、RAG等未涉及。

!!! tip deepseek-chat TL;DR

论文针对代码生成中现有LLMs推理机制（如前置思考）的不足，提出Think-Anywhere方法，允许模型在生成代码时按需在任意位置进行推理，通过冷启动训练和强化学习实现，在多个基准测试中达到最先进性能并增强可解释性。

摘要翻译

近期推理大语言模型（LLMs）的进展主要依赖于前置思考模式，即在给出最终答案前完成推理过程。然而，该方法在代码生成任务中存在明显局限：由于问题的完整复杂性往往仅在代码实现过程中才显现，前置思考常显不足；同时，该方法无法在难度差异显著的代码生成过程中自适应地分配推理资源。本文提出“随处思考”（Think-Anywhere）——一种新颖的推理机制，使大语言模型能在代码生成过程中的任意标记位置按需触发思考。我们通过两阶段方法实现该机制：首先通过冷启动训练使大语言模型模仿推理模式，继而利用基于结果的强化学习奖励驱动模型自主探索触发推理的时机与位置。在四大主流代码生成基准测试（LeetCode、LiveCodeBench、HumanEval及MBPP）上的大量实验表明，“随处思考”在现有推理方法与近期后训练方法中均达到最优性能，且在不同大语言模型间展现出一致的泛化能力。进一步分析揭示，该机制使模型能在高熵值位置自适应触发推理，从而提供更强的可解释性。

摘要 (Abstract)

Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems’ full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model’s autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.

关键词: Large Language Models, code generation, reasoning mechanism, on-demand thinking, reinforcement learning, state-of-the-art performance, interpretability, adaptive reasoning

239. ❌ Task Scarcity and Label Leakage in Relational Transfer Learning

作者: Francisco Galuppo Azevedo, Clarissa Lima Loures, Denis Oliveira Correa 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29914v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究关系型基础模型（relational foundation models），属于基础模型范畴，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及预训练（pretrained tabular encoders）和迁移学习，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分）。其他关键词如MoE、SFT、RAG、推理加速、AI for Science等均未在论文标题或摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

论文研究了关系型基础模型中因任务稀缺导致的标签泄漏问题，并提出了一种梯度投影方法来抑制泄漏，在RelBench数据集上提高了0.145 AUROC的迁移性能。

摘要翻译

训练关系型基础模型需要学习能够跨任务迁移的表征，但可用的监督通常仅限于每个数据库中的少量预测目标。这种任务稀缺性会导致学习到的表征编码任务特定的捷径，从而损害甚至在同一模式内的迁移效果，我们将此问题称为标签泄漏。我们使用K-Space对此进行研究，该模块化架构将冻结的预训练表格编码器与轻量级消息传递核心相结合。为抑制泄漏，我们引入了一种梯度投影方法，该方法可从表征更新中移除标签预测方向。在RelBench基准上，该方法将数据集内迁移的平均AUROC提升了+0.145，通常能恢复接近单任务性能的水平。我们的结果表明，制约关系型基础模型的因素不仅是有限的数据，还包括有限的任务多样性。

摘要 (Abstract)

Training relational foundation models requires learning representations that transfer across tasks, yet available supervision is typically limited to a small number of prediction targets per database. This task scarcity causes learned representations to encode task-specific shortcuts that degrade transfer even within the same schema, a problem we call label leakage. We study this using K-Space, a modular architecture combining frozen pretrained tabular encoders with a lightweight message-passing core. To suppress leakage, we introduce a gradient projection method that removes label-predictive directions from representation updates. On RelBench, this improves within-dataset transfer by +0.145 AUROC on average, often recovering near single-task performance. Our results suggest that limited task diversity, not just limited data, constrains relational foundation models.

关键词: relational foundation models, transfer learning, task scarcity, label leakage, gradient projection, pretrained encoders, within-dataset transfer, RelBench

240. ❌ Real-Time Explanations for Tabular Foundation Models

作者: Luan Borges Teodoro Reis Sena, Francisco Galuppo Azevedo 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29946v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究tabular foundation models（表格基础模型）的实时可解释性方法ShapPFN，直接与’Foundation Models’（权重1.0）高度相关（10分），因为论文明确研究tabular foundation models。同时，论文聚焦于模型可解释性（interpretability），与’Mechanistic Interpretability OR Explainable AI’（权重1.0）高度相关（10分）。论文提及科学机器学习（scientific machine learning），与’AI for Science’（权重1.0）有一定关联（5分），但未深入特定科学子领域如生物信息学。其他关键词如MoE、SFT、RAG、量化等均未在摘要中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对表格基础模型现有解释方法（如SHAP）计算成本高的问题，提出了ShapPFN模型，将Shapley值回归集成到架构中，实现了在单次前向传播中同时生成预测和解释，在标准基准测试中达到竞争性性能，且解释速度比KernelSHAP快1000倍以上。

摘要翻译

可解释性是科学机器学习的核心，因为理解模型\emph{为何}做出预测能够促进假设的生成与验证。尽管表格基础模型展现出强大的性能，但现有的解释方法（如SHAP）计算成本高昂，限制了交互式探索。我们提出了ShapPFN，这是一个将沙普利值回归直接集成到其架构中的基础模型，能在单次前向传播中同时生成预测和解释。在标准基准测试中，ShapPFN在保持竞争力的性能的同时，能以超过KernelSHAP 1000倍的速度（0.06秒 vs 610秒）生成高保真度的解释（$R^2$=0.96，余弦相似度=0.99）。我们的代码可在https://github.com/kunumi/ShapPFN获取。

摘要 (Abstract)

Interpretability is central for scientific machine learning, as understanding \emph{why} models make predictions enables hypothesis generation and validation. While tabular foundation models show strong performance, existing explanation methods like SHAP are computationally expensive, limiting interactive exploration. We introduce ShapPFN, a foundation model that integrates Shapley value regression directly into its architecture, producing both predictions and explanations in a single forward pass. On standard benchmarks, ShapPFN achieves competitive performance while producing high-fidelity explanations ($R^2$=0.96, cosine=0.99) over 1000\times faster than KernelSHAP (0.06s vs 610s). Our code is available at https://github.com/kunumi/ShapPFN

关键词: tabular foundation models, interpretability, Shapley values, real-time explanations, ShapPFN, scientific machine learning, model explanations

241. ❌ $p$-adic Character Neural Network

作者: Tomoki Mihara 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29905v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《p-adic Character Neural Network》提出了一种基于p-adic数的神经网络框架，使用p-adic字符作为激活函数，并证明了p-adic通用逼近定理。该研究属于纯数学和理论计算机科学领域，专注于p-adic数理论和神经网络的数学基础，与所有评分关键词（均涉及大模型、深度学习技术原理、应用或优化方法）无直接关联。关键词主要针对现代深度学习和大语言模型的技术栈，而本文研究的是抽象数学结构在神经网络中的理论应用，未涉及任何关键词中的具体技术、方法或应用场景。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用p-adic字符作为激活函数的p-adic神经网络新框架，并证明了其通用逼近定理，将问题简化为有限环上的多项式方程可行性问题。

摘要翻译

我们提出了一种新的$p$进神经网络框架。与S. Albeverio、A. Khrennikov和B. Tirrozi提出的原始$p$进神经网络使用以精度超参数为索引的特征函数族作为激活函数不同，我们采用$p$进整数拓扑阿贝尔群$\mathbb{Z}_p$上的单射$p$进特征作为激活函数。我们证明了该$p$进神经网络表述的$p$进通用逼近定理，并将其归结为模$p$幂的有限整数环上多项式方程的可解性问题。

摘要 (Abstract)

We propose a new frame work of $p$-adic neural network. Unlike the original $p$-adic neural network by S.\ Albeverio, A.\ Khrennikov, and B.\ Tirrozi using a family of characteristic functions indexed by hyperparameters of precision as activation functions, we use a single injective $p$-adic character on the topological Abelian group $\mathbb{Z}_p$ of $p$-adic integers as an activation function. We prove the $p$-adic universal approximation theorem for this formulation of $p$-adic neural network, and reduce it to the feasibility problem of polynomial equations over the finite ring of integers modulo a power of $p$.

关键词: p-adic neural network, p-adic character, activation function, universal approximation theorem, topological Abelian group, polynomial equations, finite ring, mathematical foundation

242. ❌ Curvature-Guided LoRA: Steering in the pretrained NTK subspace

作者: Frédéric Zheng, Alexandre Proutière 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29824v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LoRA（参数高效微调）的改进方法CG-LoRA，因此与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分）。论文涉及大模型微调，与’Large Language Models OR LLMs OR Foundation Models’（8分）和’Post-training OR Supervised Fine-tuning OR SFT’（8分）相关。论文提到预训练模型，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分）。其他关键词如MoE、SLMs、RAG、对齐等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对LoRA等参数高效微调方法性能不足的问题，提出了基于曲率引导的CG-LoRA方法，通过预测对齐和二阶优化实现更优的微调效果和更快收敛。

摘要翻译

参数高效微调方法（如LoRA）能够实现对大型预训练模型的高效适配，但其性能往往不及全参数微调。现有方法主要关注参数更新的对齐，这仅能间接控制模型预测。本文中，我们提出了预测对齐问题，其目标是在输出层面上将通过参数高效微调获得的预测器与全参数微调所得的预测器相匹配。我们证明，该目标自然导出一个曲率感知的二阶优化形式，其中最优的低秩更新对应于一种牛顿式、经曲率白化的梯度。基于这一洞见，我们提出了曲率引导的LoRA（Curvature-Guided LoRA, CG-LoRA），该方法利用局部曲率信息来选择和缩放适配方向。我们的方法计算高效，且避免了显式的二阶矩阵构建。在标准自然语言理解基准上的初步实验表明，相较于现有LoRA变体，该方法在性能和收敛速度上均有提升。

摘要 (Abstract)

Parameter-efficient fine-tuning methods such as LoRA enable efficient adaptation of large pretrained models but often fall short of full fine-tuning performance. Existing approaches focus on aligning parameter updates, which only indirectly control model predictions. In this work, we introduce the prediction alignment problem, aiming to match the predictor obtained via PEFT to that of full fine-tuning at the level of outputs. We show that this objective naturally leads to a curvature-aware, second-order formulation, where optimal low-rank updates correspond to a Newton-like, curvature-whitened gradient. Based on this insight, we propose Curvature-Guided LoRA (CG-LoRA), which selects and scales adaptation directions using local curvature information. Our method is computationally efficient and avoids explicit second-order matrix construction. Preliminary experiments on standard natural language understanding benchmarks demonstrate improved performance and faster convergence compared to existing LoRA variants.

关键词: LoRA, Parameter-efficient fine-tuning, Curvature-guided, Prediction alignment, Second-order optimization, CG-LoRA, Fine-tuning performance, Newton-like gradient

243. ❌ DiSGMM: A Method for Time-varying Microscopic Weight Completion on Road Networks

作者: Yan Lin, Jilin Hu, Shengnan Guo, Christian S. Jensen, Youfang Lin, Huaiyu Wan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29837v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《DiSGMM: A Method for Time-varying Microscopic Weight Completion on Road Networks》专注于交通网络中的微观权重补全问题，提出了一种结合稀疏感知嵌入和时空建模的方法来估计道路段权重的分布。研究内容涉及交通工程、时空数据建模和机器学习（特别是高斯混合模型），但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或AI for Science等关键词。所有关键词均与大模型、深度学习技术、AI科学应用等主题无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了交通网络中时间变化的微观权重补全问题，提出了一种结合稀疏感知嵌入和时空建模的DiSGMM方法，能够有效估计道路段权重的复杂分布，并在真实数据集上超越了现有方法。

摘要翻译

微观路网权重代表从个体车辆获取的细粒度、时变交通状况。例如车辆通过路段时关联的行程速度。这些权重支持包括交通微观仿真与具有可靠性保证的车辆路径规划在内的任务。我们研究时变微观权重补全问题：在单个时间段内，可用权重通常仅覆盖部分路段。权重补全旨在恢复当前时间段内每个路段权重的分布。该问题面临两大挑战：(i) 需应对两层稀疏性——权重在网络层面（许多路段缺乏权重）和路段层面（单个路段可能因权重不足而无法进行准确分布估计）均存在缺失；(ii) 需获得闭式且能灵活捕捉复杂条件（包括重尾和多峰分布）的权重分布表征。
为应对这些挑战，我们提出DiSGMM模型，该模型将稀疏感知嵌入与时空建模相结合，利用稀疏已知权重以及学习到的路段属性和长程相关性进行分布估计。DiSGMM将微观权重的分布表征为可学习的高斯混合模型，从而提供能够灵活捕捉复杂条件的闭式分布。在两个真实数据集上的实验表明，DiSGMM的性能优于现有先进方法。

摘要 (Abstract)

Microscopic road-network weights represent fine-grained, time-varying traffic conditions obtained from individual vehicles. An example is travel speeds associated with road segments as vehicles traverse them. These weights support tasks including traffic microsimulation and vehicle routing with reliability guarantees. We study the problem of time-varying microscopic weight completion. During a time slot, the available weights typically cover only some road segments. Weight completion recovers distributions for the weights of every road segment at the current time slot. This problem involves two challenges: (i) contending with two layers of sparsity, where weights are missing at both the network layer (many road segments lack weights) and the segment layer (a segment may have insufficient weights to enable accurate distribution estimation); and (ii) achieving a weight distribution representation that is closed-form and can capture complex conditions flexibly, including heavy tails and multiple clusters. To address these challenges, we propose DiSGMM that combines sparsity-aware embeddings with spatiotemporal modeling to leverage sparse known weights alongside learned segment properties and long-range correlations for distribution estimation. DiSGMM represents distributions of microscopic weights as learnable Gaussian mixture models, providing closed-form distributions capable of capturing complex conditions flexibly. Experiments on two real-world datasets show that DiSGMM can outperform state-of-the-art methods.

关键词: time-varying microscopic weight completion, road networks, sparsity-aware embeddings, spatiotemporal modeling, Gaussian mixture models, distribution estimation, traffic conditions, DiSGMM

244. ❌ Loss Gap Parity for Fairness in Heterogeneous Federated Learning

作者: Brahim Erraji, Michaël Perrot, Aurélien Bellet 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29818v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是联邦学习中的公平性问题，提出了一种名为EAGLE的算法来最小化客户端之间的损失差距差异。论文内容完全集中在联邦学习、分布式机器学习、公平性算法和异构数据环境，没有涉及任何大语言模型、深度学习技术原理、AI科学应用或相关关键词。所有关键词都与大模型、深度学习技术或AI科学应用相关，而本文是传统的联邦学习研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对异构联邦学习中客户端期望全局模型在本地数据上表现良好的问题，提出了一种名为EAGLE的新算法，通过正则化全局模型来最小化客户端之间的损失差距差异，实现了公平的相对改进，并在理论和实验上验证了其有效性。

摘要翻译

尽管客户端参与联邦学习旨在提升其在本地罕见数据上的性能，他们通常仍保持自利性，期望全局模型在其自身数据上表现良好。这促使我们确立一个目标：确保所有客户端获得相近的损失差距——即全局模型与仅使用其本地数据可训练出的最优模型之间的性能差异。为此，我们提出EAGLE，一种新颖的联邦学习算法，通过显式正则化全局模型以最小化客户端间损失差距的差异。我们的方法在异构环境中尤为有效，其中客户端的最优本地模型可能存在错位。与现有仅追求损失平等、却可能损害多数客户端性能的方法不同，EAGLE致力于实现相对改进的公平性。我们在非凸损失函数下为EAGLE提供了理论收敛保证，并通过一种新颖的异质性度量刻画了其迭代过程相对于标准联邦学习目标的表现。实证研究表明，在凸与非凸场景下，相较于强基线方法，EAGLE通过优先优化距离其本地最优损失最远的客户端，有效降低了客户端间损失差距的差异，同时保持了具有竞争力的效用。

摘要 (Abstract)

While clients may join federated learning to improve performance on data they rarely observe locally, they often remain self-interested, expecting the global model to perform well on their own data. This motivates an objective that ensures all clients achieve a similar loss gap -the difference in performance between the global model and the best model they could train using only their local data-. To this end, we propose EAGLE, a novel federated learning algorithm that explicitly regularizes the global model to minimize disparities in loss gaps across clients. Our approach is particularly effective in heterogeneous settings, where the optimal local models of the clients may be misaligned. Unlike existing methods that encourage loss parity, potentially degrading performance for many clients, EAGLE targets fairness in relative improvements. We provide theoretical convergence guarantees for EAGLE under non-convex loss functions, and characterize how its iterates perform relative to the standard federated learning objective using a novel heterogeneity measure. Empirically, we demonstrate that EAGLE reduces the disparity in loss gaps among clients by prioritizing those furthest from their local optimal loss, while maintaining competitive utility in both convex and non-convex cases compared to strong baselines.

关键词: federated learning, fairness, heterogeneous data, loss gap parity, EAGLE algorithm, client disparity, global model regularization, convergence guarantees

245. ❌ AMShortcut: An Inference- and Training-Efficient Inverse Design Model for Amorphous Materials

作者: Yan Lin, Jonas A. Finkler, Tao Du, Jilin Hu, Morten M. Smedskjaer 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29812v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文《AMShortcut: An Inference- and Training-Efficient Inverse Design Model for Amorphous Materials》专注于非晶材料的逆向设计，提出了一种高效的概率生成模型。论文的核心是材料科学和生成模型的应用，属于AI for Science领域。所有关键词中，只有“AI for Science OR Bioinformatics OR Cheminformatics”与论文高度相关（10分），因为论文明确属于AI在科学领域的应用（材料科学）。其他关键词均涉及大语言模型、训练技术、推理优化、对齐、代理等特定技术或概念，论文未涉及这些内容，因此评分为0分。加权总分计算为10.0（仅一个关键词相关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AMShortcut的高效概率生成模型，用于非晶材料的逆向设计，实现了仅需少量采样步骤即可准确推断材料结构，并能通过一次训练处理多种属性组合。

摘要翻译

非晶态材料是缺乏长程原子序但具有复杂短程与中程序的一类固体。与可通过仅包含数个至数百个原子的晶胞描述的晶体材料不同，非晶态材料需要包含至少数百乃至数千个原子的更大模拟体系。基于概率生成模型的非晶态材料逆向设计，旨在根据一组目标性质生成非晶态材料的原子位置与元素组成。该方法已成为推动非晶态材料在能源存储与热管理等领域应用的前沿途径。本文提出AMShortcut——一种推理与训练高效的非晶态材料概率生成模型。AMShortcut仅需少量采样步骤即可精确推断非晶态材料中多样化的短程与中程结构，从而避免了因采样步骤过多导致的推理效率瓶颈。该模型仅需通过一次训练即可学习所有相关性质，并能在推理阶段根据目标性质的任意组合进行条件生成，无需为每种性质组合单独训练模型。在三个具有不同结构与性质的非晶材料数据集上的实验表明，AMShortcut成功实现了其设计目标。

摘要 (Abstract)

Amorphous materials are solids that lack long-range atomic order but possess complex short- and medium-range order. Unlike crystalline materials that can be described by unit cells containing few up to hundreds of atoms, amorphous materials require larger simulation cells with at least hundreds or often thousands of atoms. Inverse design of amorphous materials with probabilistic generative models aims to generate the atomic positions and elements of amorphous materials given a set of desired properties. It has emerged as a promising approach for facilitating the application of amorphous materials in domains such as energy storage and thermal management. In this paper, we introduce AMShortcut, an inference- and training-efficient probabilistic generative model for amorphous materials. AMShortcut enables accurate inference of diverse short- and medium-range structures in amorphous materials with only a few sampling steps, mitigating the need for an excessive number of sampling steps that hinders inference efficiency. AMShortcut can be trained once with all relevant properties and perform inference conditioned on arbitrary combinations of desired properties, mitigating the need for training one model for each combination. Experiments on three amorphous materials datasets with diverse structures and properties demonstrate that AMShortcut achieves its design goals.

关键词: amorphous materials, inverse design, probabilistic generative model, inference-efficient, training-efficient, AMShortcut, short-range order, medium-range order

246. ❌ Multimodal Machine Learning for Early Prediction of Metastasis in a Swedish Multi-Cancer Cohort

作者: Franco Rugolon, Korbinian Randl, Braslav Jovanovic, Ioanna Miliou, Panagiotis Papapetrou 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29793v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于使用多模态机器学习（传统和深度学习分类器）预测癌症转移风险，属于AI在生物医学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文使用SHAP进行可解释性分析，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词均涉及大模型技术原理（如LLM、MoE、微调、推理优化等）或特定AI方法（如RAG、CoT、智能体），论文未涉及这些内容，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个多模态机器学习框架，利用电子健康记录数据预测四种癌症（乳腺癌、结肠癌、肺癌、前列腺癌）的诊断前一个月转移风险，结果显示深度学习模型和中间融合策略在大多数癌症类型中实现了最佳预测性能（F1分数最高达0.845）。

摘要翻译

多模态机器学习通过整合电子健康记录（EHR）中的结构化与非结构化数据，为患者状态提供整体性视角。我们提出一个框架，利用EHR中六个月的临床病史数据，在诊断前一个月预测转移风险。本研究分析了卡罗林斯卡大学医院（瑞典斯德哥尔摩）收集的四种癌症队列数据：乳腺癌（n = 743）、结肠癌（n = 387）、肺癌（n = 870）和前列腺癌（n = 1890）。数据集包含人口统计学特征、合并症、实验室结果、用药信息及临床文本。我们比较了传统分类器与深度学习分类器在单模态及多模态组合中的表现，采用多种融合策略，并遵循个体预后或诊断多变量预测模型透明报告（TRIPOD）2a设计规范，以80-20的开发-验证分割确保评估的严谨性与可重复性。性能评估指标包括AUROC、AUPRC、F1分数、敏感性和特异性。随后，我们采用多模态适配的SHAP方法分析分类器的决策依据。中间融合策略在乳腺癌（0.845）、结肠癌（0.786）和前列腺癌（0.845）上获得最高的F1分数，展现出强大的预测性能。对于肺癌，中间融合的F1分数为0.819，而纯文本模型表现最佳，F1分数达0.829。深度学习分类器始终优于传统模型。样本量最小的结肠癌队列性能最低，凸显了充足训练数据的重要性。SHAP分析显示，不同模态的相对重要性因癌症类型而异。融合策略各有优劣：中间融合能持续提供最佳结果，但策略选择应与数据特征及机构需求相匹配。

摘要 (Abstract)

Multimodal Machine Learning offers a holistic view of a patient’s status, integrating structured and unstructured data from electronic health records (EHR). We propose a framework to predict metastasis risk one month prior to diagnosis, using six months of clinical history from EHR data. Data from four cancer cohorts collected at Karolinska University Hospital (Stockholm, Sweden) were analyzed: breast (n = 743), colon (n = 387), lung (n = 870), and prostate (n = 1890). The dataset included demographics, comorbidities, laboratory results, medications, and clinical text. We compared traditional and deep learning classifiers across single modalities and multimodal combinations, using various fusion strategies and a Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) 2a design, with an 80-20 development-validation split to ensure a rigorous, repeatable evaluation. Performance was evaluated using AUROC, AUPRC, F1 score, sensitivity, and specificity. We then employed a multimodal adaptation of SHAP to analyze the classifiers’ reasoning. Intermediate fusion achieved the highest F1 scores on breast (0.845), colon (0.786), and prostate cancer (0.845), demonstrating strong predictive performance. For lung cancer, the intermediate fusion achieved an F1 score of 0.819, while the text-only model achieved the highest, with an F1 score of 0.829. Deep learning classifiers consistently outperformed traditional models. Colon cancer, the smallest cohort, had the lowest performance, highlighting the importance of sufficient training data. SHAP analysis showed that the relative importance of modalities varied across cancer types. Fusion strategies offer distinct strengths and weaknesses. Intermediate fusion consistently delivered the best results, but strategy choices should align with data characteristics and organizational needs.

关键词: Multimodal Machine Learning, Metastasis Prediction, Electronic Health Records, Deep Learning Classifiers, SHAP Analysis, Cancer Cohort, Intermediate Fusion, Clinical Text Analysis

247. ❌ Big2Small: A Unifying Neural Network Framework for Model Compression

作者: Jing-Xiao Liao, Haoran Wang, Tao Li, Daoming Lyu, Yi Zhang, Chengjun Cai, Feng-Lei Fan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29768v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究模型压缩技术，与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分）。论文提到基础模型发展，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。压缩模型可用于设备部署，与’Small Language Models OR SLMs OR On-device AI’有一定关联（5分）。其他关键词如MoE、Scaling Laws、训练方法、推理加速、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于测度论的统一数学框架Big2Small，用于模型压缩，通过隐式神经表示编码大模型权重并在推理时重建，在图像分类和分割任务中实现了竞争性的准确率和压缩比。

摘要翻译

随着基础模型的发展，模型压缩已成为一项关键需求。基于不同的启发式方法，研究者提出了多种模型压缩技术，如低秩分解、剪枝、量化、遍历动态系统以及知识蒸馏。为了将该领域从零散状态提升至具有原则性的学科，我们构建了一个基于测度论的、统一的模型压缩数学框架。我们进一步证明，每种模型压缩技术在数学上都等价于一个受到正则化约束的神经网络。基于这种数学与结构上的等价性，我们提出了一种经过实验验证的无数据模型压缩框架，称为 \textit{Big2Small}。该框架将隐式神经表示从数据域转换到网络参数域。\textit{Big2Small} 训练紧凑的隐式神经表示来编码大型模型的权重，并在推理过程中重建这些权重。为了提高重建保真度，我们引入了异常值感知预处理来处理极端权重值，以及一种频率感知损失函数来保留高频细节。在图像分类和分割任务上的实验表明，与最先进的基线方法相比，\textit{Big2Small} 在精度和压缩率方面均取得了有竞争力的结果。

摘要 (Abstract)

With the development of foundational models, model compression has become a critical requirement. Various model compression approaches have been proposed such as low-rank decomposition, pruning, quantization, ergodic dynamic systems, and knowledge distillation, which are based on different heuristics. To elevate the field from fragmentation to a principled discipline, we construct a unifying mathematical framework for model compression grounded in measure theory. We further demonstrate that each model compression technique is mathematically equivalent to a neural network subject to a regularization. Building upon this mathematical and structural equivalence, we propose an experimentally-verified data-free model compression framework, termed \textit{Big2Small}, which translates Implicit Neural Representations (INRs) from data domain to the domain of network parameters. \textit{Big2Small} trains compact INRs to encode the weights of larger models and reconstruct the weights during inference. To enhance reconstruction fidelity, we introduce Outlier-Aware Preprocessing to handle extreme weight values and a Frequency-Aware Loss function to preserve high-frequency details. Experiments on image classification and segmentation demonstrate that \textit{Big2Small} achieves competitive accuracy and compression ratios compared to state-of-the-art baselines.

关键词: model compression, foundational models, quantization, neural network framework, Implicit Neural Representations, weight encoding, data-free compression, outlier-aware preprocessing

248. ❌ HyperKKL: Learning KKL Observers for Non-Autonomous Nonlinear Systems via Hypernetwork-Based Input Conditioning

作者: Yahia Salaheldin Shaaban, Abdelrahman Sayed Sayed, M. Umar B. Niazi, Karl Henrik Johansson 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29744v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是非线性系统状态观测器的神经网络实现，具体针对KKL观测器在非自治系统中的应用。虽然使用了神经网络（hypernetwork），但研究领域是控制理论和动态系统，而非大语言模型、深度学习技术原理或AI在科学领域的应用。所有关键词都涉及大语言模型、深度学习技术或AI科学应用，与论文的控制系统主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了两种基于超网络的神经网络KKL观测器设计方法（HyperKKL_obs和HyperKKL_dyn），用于具有外生输入的非自治非线性系统，实验表明输入条件化方法相比静态自主映射平均减少了29%的状态估计误差。

摘要翻译

Kazantzis-Kravaris/Luenberger（KKL）观测器是一类针对非线性系统的状态观测器，其依赖于一个单射映射将非线性动力学变换至稳定的拟线性潜空间，再通过该变换映射的左逆从潜空间获取原始坐标下的状态估计。当前基于学习的映射设计方法专为自治系统而设，难以推广至受控或非自治系统。本文针对受外生输入影响动力学的非自治系统，提出了两种基于学习的神经KKL观测器设计。为此，我们提出了一种基于超网络的框架（$HyperKKL$），并采用两种输入条件化策略。首先，增强型观测器方法（$HyperKKL_{obs}$）在保持静态变换映射的同时，向潜观测器动力学添加输入依赖的修正项。其次，动态观测器方法（$HyperKKL_{dyn}$）利用超网络生成输入依赖的编码器与解码器权重，从而得到时变的变换映射。我们推导了状态估计误差的理论最坏情况上界。在四个非线性基准系统上的数值评估表明，输入条件化相较于静态自治映射能持续提升估计精度，在所有非零输入工况下平均对称平均绝对百分比误差（SMAPE）降低了29%。

摘要 (Abstract)

Kazantzis-Kravaris/Luenberger (KKL) observers are a class of state observers for nonlinear systems that rely on an injective map to transform the nonlinear dynamics into a stable quasi-linear latent space, from where the state estimate is obtained in the original coordinates via a left inverse of the transformation map. Current learning-based methods for these maps are designed exclusively for autonomous systems and do not generalize well to controlled or non-autonomous systems. In this paper, we propose two learning-based designs of neural KKL observers for non-autonomous systems whose dynamics are influenced by exogenous inputs. To this end, a hypernetwork-based framework ($HyperKKL$) is proposed with two input-conditioning strategies. First, an augmented observer approach ($HyperKKL_{obs}$) adds input-dependent corrections to the latent observer dynamics while retaining static transformation maps. Second, a dynamic observer approach ($HyperKKL_{dyn}$) employs a hypernetwork to generate encoder and decoder weights that are input-dependent, yielding time-varying transformation maps. We derive a theoretical worst-case bound on the state estimation error. Numerical evaluations on four nonlinear benchmark systems show that input conditioning yields consistent improvements in estimation accuracy over static autonomous maps, with an average symmetric mean absolute percentage error (SMAPE) reduction of 29% across all non-zero input regimes.

关键词: KKL observers, non-autonomous nonlinear systems, hypernetwork, input conditioning, state estimation, neural observers, dynamic transformation maps, estimation accuracy

249. ❌ mlr3mbo: Bayesian Optimization in R

作者: Marc Becker, Lennart Schneider, Martin Binder, Lars Kotthoff, Bernd Bischl 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29730v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《mlr3mbo: Bayesian Optimization in R》专注于贝叶斯优化工具箱的开发、评估和应用，属于传统机器学习优化方法领域。所有评分关键词均围绕大模型、深度学习技术原理及其应用（如AI for Science），而本文完全不涉及这些主题。论文内容与关键词列表中的任何技术（如LLM、MoE、RLHF、RAG等）均无关联，也未提及生物信息学或化学信息学等科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于贝叶斯优化的R语言工具箱mlr3mbo，并通过基准测试证明其达到了最先进的性能。

摘要翻译

我们推出mlr3mbo——一个用于R语言贝叶斯优化的综合性模块化工具箱。该工具箱支持单目标与多目标优化、多点提案、批处理与异步并行化、输入输出转换以及鲁棒的错误处理功能。mlr3mbo不仅适用于应用场景中多种标准贝叶斯优化变体，研究人员还可通过其灵活的构建模块自定义贝叶斯优化算法。本文除介绍该软件的设计原理、架构模块外，还基于代理模型基准测试套件YAHPO Gym对软件进行了两项全面的实证评估。为确定数值优化与混合分层优化场景下的鲁棒默认配置，并深入探究各项参数设置的具体影响，我们在mlr3mbo配置空间上执行坐标下降搜索并分析其结果。此外，通过将mlr3mbo与HEBO、SMAC3、Ax、Optuna等多种优化器进行基准测试对比，我们证明该工具实现了业界领先的性能表现。

摘要 (Abstract)

We present mlr3mbo, a comprehensive and modular toolbox for Bayesian optimization in R. mlr3mbo supports single- and multi-objective optimization, multi-point proposals, batch and asynchronous parallelization, input and output transformations, and robust error handling. While it can be used for many standard Bayesian optimization variants in applied settings, researchers can also construct custom BO algorithms from its flexible building blocks. In addition to an introduction to the software, its design principles, and its building blocks, the paper presents two extensive empirical evaluations of the software on the surrogate-based benchmark suite YAHPO Gym. To identify robust default configurations for both numeric and mixed-hierarchical optimization regimes, and to gain further insights into the respective impacts of individual settings, we run a coordinate descent search over the mlr3mbo configuration space and analyze its results. Furthermore, we demonstrate that mlr3mbo achieves state-of-the-art performance by benchmarking it against a wide range of optimizers, including HEBO, SMAC3, Ax, and Optuna.

关键词: Bayesian optimization, R package, multi-objective optimization, parallelization, benchmarking, YAHPO Gym, mlr3mbo, optimization algorithms

250. ❌ Unbounded Density Ratio Estimation and Its Application to Covariate Shift Adaptation

作者: Ren-Rui Liu, Jun Fan, Lei Shi, Zheng-Chu Guo 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29725v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于统计学习中的无界密度比估计及其在协变量偏移适应中的应用，属于传统机器学习/统计学习领域。论文内容涉及密度比估计、协变量偏移、重要性加权、收敛性分析等经典统计学习主题，未涉及任何大模型、深度学习、AI for Science或相关技术原理。所有关键词均与大模型、深度学习、AI应用或相关技术相关，而该论文完全不涉及这些领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了无界密度比估计这一统计学习中的关键挑战，提出了一种三步骤估计方法并将其应用于协变量偏移适应中的重要性加权回归，建立了非渐近收敛保证并获得了最优或接近最优的收敛率。

摘要翻译

本文聚焦于无界密度比估计问题——这一统计学习中研究不足但至关重要的挑战——及其在协变量偏移适应中的应用。现有文献大多假设密度比要么一致有界，要么无界但精确已知。这些条件在实践中常被违背，导致理论保证与现实适用性之间存在差距。与此不同，本研究直接处理无界密度比，并将其整合到重要性加权中以实现有效的协变量偏移适应。我们提出一种三步估计方法，该方法利用来自源分布和目标分布的未标记数据：（1）估计相对密度比；（2）应用截断操作以控制其无界性；（3）将截断估计量转换回标准密度比。所得密度比估计量随后被用作协变量偏移下回归的重要性权重。我们为所提出的密度比估计量及由此得到的回归函数估计量建立了严格的非渐近收敛保证，证明了其达到最优或接近最优的收敛速率。我们的研究结果为协变量偏移下的密度比估计与学习提供了新的理论见解，将经典学习理论拓展至更实际且更具挑战性的场景。

摘要 (Abstract)

This paper focuses on the problem of unbounded density ratio estimation – an understudied yet critical challenge in statistical learning – and its application to covariate shift adaptation. Much of the existing literature assumes that the density ratio is either uniformly bounded or unbounded but known exactly. These conditions are often violated in practice, creating a gap between theoretical guarantees and real-world applicability. In contrast, this work directly addresses unbounded density ratios and integrates them into importance weighting for effective covariate shift adaptation. We propose a three-step estimation method that leverages unlabeled data from both the source and target distributions: (1) estimating a relative density ratio; (2) applying a truncation operation to control its unboundedness; and (3) transforming the truncated estimate back into the standard density ratio. The estimated density ratio is then employed as importance weights for regression under covariate shift. We establish rigorous, non-asymptotic convergence guarantees for both the proposed density ratio estimator and the resulting regression function estimator, demonstrating optimal or near-optimal convergence rates. Our findings offer new theoretical insights into density ratio estimation and learning under covariate shift, extending classical learning theory to more practical and challenging scenarios.

关键词: unbounded density ratio estimation, covariate shift adaptation, importance weighting, non-asymptotic convergence guarantees, statistical learning, density ratio estimator, regression function estimator, convergence rates

251. ❌ Nonnegative Matrix Factorization in the Component-Wise L1 Norm for Sparse Data

作者: Giovanni Seraghiti, Kévin Dubrulle, Arnaud Vandaele, Nicolas Gillis 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究非负矩阵分解（NMF）在L1范数下的变体及其在稀疏数据中的应用，属于传统机器学习/优化领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）无直接关联。论文未涉及任何大模型、深度学习、AI for Science等相关内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了基于L1范数的加权非负矩阵分解模型（wL1-NMF）及其高效坐标下降算法（sCD），用于处理含重尾噪声或异常值的稀疏数据，并证明了其能增强因子稀疏性且计算复杂度与数据非零项数量成比例。

摘要翻译

非负矩阵分解（Nonnegative matrix factorization, NMF）通过两个非负因子矩阵 $WH$ 的乘积来近似一个非负矩阵 $X$，其中 $W$ 有 $r$ 列，$H$ 有 $r$ 行。本文研究采用逐元素 L1 范数作为误差度量（L1-NMF）的 NMF 方法，该方法适用于受重尾噪声（如拉普拉斯噪声或椒盐噪声）污染或存在异常值的数据。我们的第一个贡献是证明了 L1-NMF 即使当 $r=1$ 时也是 NP 难问题，这与使用最小二乘的标准 NMF 形成对比。第二个贡献是证明了对于稀疏输入矩阵，L1-NMF 会强烈促使因子矩阵具有稀疏性，从而有利于模型的可解释性。然而，如果数据受到虚假零值的影响，过于稀疏的解可能会降低模型性能。我们的第三个贡献是针对稀疏数据提出了一种新的、更通用的 L1-NMF 模型，称为加权 L1-NMF（weighted L1-NMF, wL1-NMF），该模型通过对数据中零值对应的 $WH$ 元素添加惩罚参数来控制分解的稀疏性。第四个贡献是为 wL1-NMF 提出了一种新的坐标下降（coordinate descent, CD）方法，称为稀疏坐标下降（sparse CD, sCD），其中每个子问题通过加权中位数算法求解。据我们所知，sCD 是首个计算复杂度随数据中非零元素数量变化的 L1-NMF 算法，使其能高效处理大规模稀疏数据。我们在合成数据和真实数据上进行了广泛的数值实验，以验证我们提出的新模型（wL1-NMF）和算法（sCD）的有效性。

摘要 (Abstract)

Nonnegative matrix factorization (NMF) approximates a nonnegative matrix, $X$, by the product of two nonnegative factors, $WH$, where $W$ has $r$ columns and $H$ has $r$ rows. In this paper, we consider NMF using the component-wise L1 norm as the error measure (L1-NMF), which is suited for data corrupted by heavy-tailed noise, such as Laplace noise or salt and pepper noise, or in the presence of outliers. Our first contribution is an NP-hardness proof for L1-NMF, even when $r=1$, in contrast to the standard NMF that uses least squares. Our second contribution is to show that L1-NMF strongly enforces sparsity in the factors for sparse input matrices, thereby favoring interpretability. However, if the data is affected by false zeros, too sparse solutions might degrade the model. Our third contribution is a new, more general, L1-NMF model for sparse data, dubbed weighted L1-NMF (wL1-NMF), where the sparsity of the factorization is controlled by adding a penalization parameter to the entries of $WH$ associated with zeros in the data. The fourth contribution is a new coordinate descent (CD) approach for wL1-NMF, denoted as sparse CD (sCD), where each subproblem is solved by a weighted median algorithm. To the best of our knowledge, sCD is the first algorithm for L1-NMF whose complexity scales with the number of nonzero entries in the data, making it efficient in handling large-scale, sparse data. We perform extensive numerical experiments on synthetic and real-world data to show the effectiveness of our new proposed model (wL1-NMF) and algorithm (sCD).

关键词: Nonnegative Matrix Factorization, L1 norm, sparse data, weighted L1-NMF, coordinate descent, sparsity, outliers, large-scale data

252. ❌ Disentangled Graph Prompting for Out-Of-Distribution Detection

作者: Cheng Yang, Yu Hao, Qi Zhang, Chuan Shi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29644v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图神经网络（GNN）的OOD检测，提出了一种基于预训练+提示的图提示方法。虽然论文涉及预训练（pre-training）和提示（prompting）概念，但这些是针对图神经网络（GNN）而非大语言模型（LLM）。论文未涉及任何大模型、深度学习技术原理创新或大模型在不同领域的应用，也未提及评分关键词中的其他技术（如MoE、量化、推理加速、对齐等）。因此，除’Pre-training OR Continual Pre-training OR Domain Adaptation’因涉及预训练概念给予5分（中等关联）外，其余关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对图神经网络在分布外检测中的性能问题，提出了一种解耦图提示方法，通过生成类特定和类无关的提示图来捕捉细粒度分布内模式，在十个数据集上实现了3.63%的AUC相对提升。

摘要翻译

当测试数据与训练数据来自不同分布时，深度神经网络（DNNs）在实际应用中会面临显著的安全风险。因此，亟需能够在测试时识别分布外（OOD）样本并向系统发出警报的OOD检测技术。现有的图OOD检测方法通常从多角度刻画细粒度的分布内（ID）模式，并训练端到端的图神经网络（GNNs）进行预测。然而，由于训练期间无法获取OOD数据，缺乏显式监督信号可能导致端到端编码器的性能欠佳。为解决此问题，我们遵循预训练+提示（pre-training+prompting）范式，利用预训练的GNN编码器，并提出了解耦图提示（Disentangled Graph Prompting, DGP）方法，借助ID图标签来捕获细粒度的ID模式。具体而言，我们设计了两个提示生成器，分别通过修改输入图的边权重来生成类特定（class-specific）和类无关（class-agnostic）的提示图。我们还设计了若干有效的损失函数来训练提示生成器并避免平凡解。我们在十个数据集上进行了广泛实验，证明了所提出的DGP的优越性，其相对于最佳图OOD检测基线的AUC相对提升了3.63%。消融实验和超参数实验进一步验证了DGP的有效性。代码发布于https://github.com/BUPT-GAMMA/DGP。

摘要 (Abstract)

When testing data and training data come from different distributions, deep neural networks (DNNs) will face significant safety risks in practical applications. Therefore, out-of-distribution (OOD) detection techniques, which can identify OOD samples at test time and alert the system, are urgently needed. Existing graph OOD detection methods usually characterize fine-grained in-distribution (ID) patterns from multiple perspectives, and train end-to-end graph neural networks (GNNs) for prediction. However, due to the unavailability of OOD data during training, the absence of explicit supervision signals could lead to sub-optimal performance of end-to-end encoders. To address this issue, we follow the pre-training+prompting paradigm to utilize pre-trained GNN encoders, and propose Disentangled Graph Prompting (DGP), to capture fine-grained ID patterns with the help of ID graph labels. Specifically, we design two prompt generators that respectively generate class-specific and class-agnostic prompt graphs by modifying the edge weights of an input graph. We also design several effective losses to train the prompt generators and prevent trivial solutions. We conduct extensive experiments on ten datasets to demonstrate the superiority of our proposed DGP, which achieves a relative AUC improvement of 3.63% over the best graph OOD detection baseline. Ablation studies and hyper-parameter experiments further show the effectiveness of DGP. Code is available at https://github.com/BUPT-GAMMA/DGP.

关键词: out-of-distribution detection, graph neural networks, pre-training, prompting, disentangled graph prompting, OOD detection, GNN, graph prompting

253. ❌ Central limit theorems for the outputs of fully convolutional neural networks with time series input

作者: Annika Betken, Giorgio Micali, Johannes Schmidt-Hieber 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29612v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是时间序列输入下全卷积神经网络（FCN）的理论性质，特别是中心极限定理的证明。虽然涉及深度学习，但论文专注于传统卷积神经网络的理论分析，而非大语言模型（LLM）或相关技术。所有关键词均与大语言模型、其训练方法、优化技术、应用场景或特定领域（如科学AI）相关，而本文完全不涉及这些内容。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文证明了当输入来自短程依赖线性过程时，具有全局平均池化的全卷积神经网络的输出是渐近高斯的，并提出了一种带可学习系数的全局加权池化层作为泛化。

摘要翻译

深度学习被广泛应用于时间序列学习任务，如分类与预测。尽管已取得诸多实证成果，目前针对时间序列的理论研究仍相对有限。在本研究中，我们证明：若网络输入由短程依赖线性过程生成，则采用全局平均池化（Global Average Pooling，GAP）的全卷积神经网络（Fully Convolutional Neural Networks，FCNs）的输出具有渐近高斯性，且当观测时间序列长度趋于无穷时，该极限可达。该证明借鉴了理论时间序列文献中的现有工具。基于此理论，我们提出一种对GAP层的推广方法，即通过引入具有缓慢变化、可学习系数的全局加权池化步骤来实现。

摘要 (Abstract)

Deep learning is widely deployed for time series learning tasks such as classification and forecasting. Despite the empirical successes, only little theory has been developed so far in the time series context. In this work, we prove that if the network inputs are generated from short-range dependent linear processes, the outputs of fully convolutional neural networks (FCNs) with global average pooling (GAP) are asymptotically Gaussian and the limit is attained if the length of the observed time series tends to infinity. The proof leverages existing tools from the theoretical time series literature. Based on our theory, we propose a generalization of the GAP layer by considering a global weighted pooling step with slowly varying, learnable coefficients.

关键词: fully convolutional neural networks, time series, central limit theorem, global average pooling, asymptotic Gaussianity, short-range dependent linear processes, global weighted pooling

254. ❌ The Geometry of Polynomial Group Convolutional Neural Networks

作者: Yacoub Hendi, Daniel Persson, Magdalena Larfors 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29566v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多项式群卷积神经网络（PGCNNs）的数学框架，属于纯理论数学和传统神经网络架构研究，与所有评分关键词（均围绕大模型、深度学习技术原理创新及其在不同领域的应用）完全无关。论文未涉及大模型、LLMs、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、智能体、量化、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用等任何主题。

!!! tip deepseek-chat TL;DR

该论文为任意有限群的多项式群卷积神经网络（PGCNNs）建立了一个基于分级群代数的新数学框架，计算了神经流形的维度并描述了参数化的一般纤维结构。

摘要翻译

我们研究了针对任意有限群$G$的多项式群卷积神经网络（PGCNNs）。特别地，我们引入了一种基于分次群代数语言的新数学框架来描述PGCNNs。该框架基于哈达玛积（Hadamard product）和克罗内克积（Kronecker product）——两者通过一个线性映射相关联——给出了该架构的两种自然参数化方式。我们计算了相应神经流形的维度，证实其仅取决于网络层数和群的阶数。此外，我们描述了克罗内克参数化在正则群作用和缩放变换下的一般纤维结构，并对哈达玛参数化的类似结构提出了猜想。通过对小群和浅层网络的显式计算，我们的猜想得到了支持。

摘要 (Abstract)

We study polynomial group convolutional neural networks (PGCNNs) for an arbitrary finite group $G$. In particular, we introduce a new mathematical framework for PGCNNs using the language of graded group algebras. This framework yields two natural parametrizations of the architecture, based on Hadamard and Kronecker products, related by a linear map. We compute the dimension of the associated neuromanifold, verifying that it depends only on the number of layers and the size of the group. We also describe the general fiber of the Kronecker parametrization up to the regular group action and rescaling, and conjecture the analogous description for the Hadamard parametrization. Our conjecture is supported by explicit computations for small groups and shallow networks.

关键词: Polynomial Group Convolutional Neural Networks, PGCNNs, graded group algebras, neuromanifold, Hadamard product, Kronecker product, finite group, mathematical framework

255. ❌ Total Variation Guarantees for Sampling with Stochastic Localization

作者: Jakob Kellermann 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29555v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于随机定位的采样算法（SLIPS）的理论收敛性分析，属于概率采样和生成模型的理论研究领域。论文内容完全围绕扩散模型、随机定位、总变差距离和收敛分析展开，与所有评分关键词（均涉及大模型、深度学习技术原理、AI应用等）没有任何直接关联。论文没有涉及语言模型、模型训练、对齐、推理、代理、压缩、科学AI应用等任何相关主题。

!!! tip deepseek-chat TL;DR

该论文首次为基于随机定位的采样算法SLIPS建立了总变差距离的收敛保证，证明了在最小假设下达到ε精度所需的步数随维度线性增长（忽略对数因子）。

摘要翻译

受基于分数的生成模型成功的启发，近期已有多种基于扩散的算法被提出，用于从可获取未归一化密度的概率测度中进行采样。其中，Grenioux等人引入了SLIPS，一种基于随机局部化的采样算法。尽管SLIPS展现出强大的实证性能，此前尚未提供严格的收敛性分析。在本工作中，我们通过首次在总变差距离上为SLIPS建立收敛保证，填补了这一空白。在对目标测度的最小假设下，我们的界表明，达到$\varepsilon$精度所需的迭代步数在忽略对数因子后与维度呈线性比例。该分析利用了基于分数的生成模型理论中的技术，并进一步为实践中观察到的离散化点最优选择提供了理论见解。

摘要 (Abstract)

Motivated by the success of score-based generative models, a number of diffusion-based algorithms have recently been proposed for the problem of sampling from a probability measure whose unnormalized density can be accessed. Among them, Grenioux et al. introduced SLIPS, a sampling algorithm based on Stochastic Localization. While SLIPS exhibits strong empirical performance, no rigorous convergence analysis has previously been provided. In this work, we close this gap by establishing the first guarantee for SLIPS in total variation distance. Under minimal assumptions on the target, our bound implies that the number of steps required to achieve an $\varepsilon$-guarantee scales linearly with the dimension, up to logarithmic factors. The analysis leverages techniques from the theory of score-based generative models and further provides theoretical insights into the empirically observed optimal choice of discretization points.

关键词: stochastic localization, sampling algorithms, total variation distance, convergence analysis, score-based generative models, diffusion models, dimensional scaling, discretization points

256. ❌ Capturing Multivariate Dependencies of EV Charging Events: From Parametric Copulas to Neural Density Estimation

作者: Martin Výboh, Gabriela Grmanová 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29554v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于电动汽车充电事件的统计建模，使用Vine copulas和CODINE（Copula Density Neural Estimation）框架来捕捉多变量依赖关系。虽然CODINE涉及神经网络密度估计，但论文的核心是统计建模和依赖结构分析，而非大模型、深度学习技术原理或AI在科学领域的应用创新。所有关键词均与大语言模型、深度学习技术、AI科学应用等主题相关，而本文属于传统统计建模与有限神经网络应用的交叉领域，与给定关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对电动汽车充电事件中多变量（到达时间、持续时间和能量需求）之间的复杂非线性依赖关系建模问题，首次将Vine copulas和CODINE神经网络密度估计框架应用于该领域，在三个真实数据集上验证了这些方法在保持尾部行为和相关性结构方面优于传统参数化方法，并可与最先进的基准模型竞争。

摘要翻译

对电动汽车充电行为进行精确的事件建模，对于电网可靠性和智能充电设计至关重要。传统统计方法虽能捕捉边缘分布，却往往难以刻画充电变量间复杂的非线性依赖关系，特别是到达时间、持续时间和能量需求之间的关联。本文通过首次将藤Copula（Vine copulas）与Copula密度神经估计框架（CODINE, Copula Density Neural Estimation）引入电动汽车领域，填补了这一研究空白。我们在三个不同的真实世界数据集上评估了这些高容量依赖模型。结果表明，通过显式聚焦于联合依赖结构的建模，藤Copula与CODINE超越了已有的参数化模型族，并在与条件高斯混合模型网络等先进基准模型的对比中保持高度竞争力。我们证明，这些方法能更优地保留尾部行为和相关性结构，为不同基础设施场景下的合成充电事件生成提供了一个稳健的框架。

摘要 (Abstract)

Accurate event-based modeling of electric vehicle (EV) charging is essential for grid reliability and smart-charging design. While traditional statistical methods capture marginal distributions, they often fail to model the complex, non-linear dependencies between charging variables, specifically arrival times, durations, and energy demand. This paper addresses this gap by introducing the first application of Vine copulas and Copula Density Neural Estimation framework (CODINE) to the EV domain. We evaluate these high-capacity dependence models across three diverse real-world datasets. Our results demonstrate that by explicitly focusing on modeling the joint dependence structure, Vine copulas and CODINE outperform established parametric families and remain highly competitive against state-of-the-art benchmarks like conditional Gaussian Mixture Model Networks. We show that these methods offer superior preservation of tail behaviors and correlation structures, providing a robust framework for synthetic charging event generation in varied infrastructure contexts.

关键词: EV charging modeling, multivariate dependencies, Vine copulas, Copula Density Neural Estimation (CODINE), synthetic event generation, tail behavior preservation, correlation structures, non-linear dependencies

257. ❌ Learning Surrogate LPV State-Space Models with Uncertainty Quantification

作者: E. Javier Olucha, Valentin Preda, Amritam Das, Roland Tóth 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29532v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究线性参数变化（LPV）状态空间模型的贝叶斯估计方法，专注于系统建模、不确定性量化和控制理论应用。论文内容涉及非线性系统建模、贝叶斯方法、不确定性传播和控制器设计，属于传统控制工程和系统辨识领域。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种贝叶斯方法，用于从输入输出数据中联合估计LPV状态空间模型及其调度参数，同时量化模型不确定性，并在二维非线性质量-弹簧-阻尼器系统上进行了验证。

摘要翻译

线性变参数（Linear Parameter-Varying, LPV）框架能够为复杂的非线性高维系统构建代理模型，从而促进高效的稳定性与性能分析以及控制器设计。尽管数据驱动的LPV建模已取得显著进展，现有方法并未量化所获LPV模型的不确定性。因此，评估模型在分析与控制中的可靠性或检测训练范围外的运行状态，需要大量验证工作并依赖用户经验。本文提出一种贝叶斯方法，用于联合估计LPV状态空间模型及其调度变量，直接从输入-输出数据中提供模型不确定性的表征及预测响应的置信区间。该方法同时考虑了测量噪声引起的偶然性不确定性，以及有限训练数据和结构偏差导致的认知性不确定性。所得模型既保持了控制器综合所需的LPV结构，又能实现计算高效的仿真与不确定性传播。本文通过二维非线性质量-弹簧-阻尼器系统互联结构的代理建模案例验证了所提方法的有效性。

摘要 (Abstract)

The Linear Parameter-Varying (LPV) framework enables the construction of surrogate models of complex nonlinear and high-dimensional systems, facilitating efficient stability and performance analysis together with controller design. Despite significant advances in data-driven LPV modelling, existing approaches do not quantify the uncertainty of the obtained LPV models. Consequently, assessing model reliability for analysis and control or detecting operation outside the training regime requires extensive validation and user expertise. This paper proposes a Bayesian approach for the joint estimation of LPV state-space models together with their scheduling, providing a characterization of model uncertainty and confidence bounds on the predicted model response directly from input-output data. Both aleatoric uncertainty due to measurement noise and epistemic uncertainty arising from limited training data and structural bias are considered. The resulting model preserves the LPV structure required for controller synthesis while enabling computationally efficient simulation and uncertainty propagation. The approach is demonstrated on the surrogate modelling of a two-dimensional nonlinear interconnection of mass-spring-damper systems.

关键词: Linear Parameter-Varying, LPV, state-space models, uncertainty quantification, Bayesian approach, surrogate modeling, controller synthesis, nonlinear systems

258. ❌ Sampling at intermediate temperatures is optimal for training large language models in protein structure prediction

作者: L. Ghiringhelli, A. Zambon, G. Tiana 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29529v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究transformer模型在蛋白质结构预测中的训练机制，属于大模型在科学领域的应用。核心相关关键词：1) ‘Large Language Models’ (8分)：论文研究transformer模型，属于大模型范畴；2) ‘Pre-training’ (8分)：研究transformer在蛋白质序列数据上的训练过程；3) ‘Mechanistic Interpretability’ (8分)：使用统计力学框架分析损失景观，解释transformer性能机制；4) ‘AI for Science’ (10分)：直接应用于蛋白质结构预测的生物信息学领域。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过统计力学框架研究transformer模型在蛋白质序列数据上的训练机制，发现中等温度采样能优化学习性能，并揭示了嵌入维度与注意力矩阵对蛋白质接触图预测的影响。

摘要翻译

我们采用统计力学框架研究基于蛋白质序列数据训练的Transformer模型的参数空间，通过朗之万动力学在不同温度下对损失函数景观进行采样，以表征低损失流形，并理解Transformer在蛋白质结构预测中性能优越的内在机制。研究发现，与全连接网络不同，Transformer的损失函数中不存在类一阶相变，这产生了一个具有良好学习特性的中间温度区间。我们证明，当嵌入维度处于最优值时，大多数网络层的参数在这些温度下高度保守，并提供了确定该最优维度的操作方法。最后，我们发现相较于学习最优的嵌入维度，在更高温度和更大嵌入维度下，注意力矩阵对蛋白质接触图的预测能力更强。

摘要 (Abstract)

We investigate the parameter space of transformer models trained on protein sequence data using a statistical mechanics framework, sampling the loss landscape at varying temperatures by Langevin dynamics to characterize the low-loss manifold and understand the mechanisms underlying the superior performance of transformers in protein structure prediction. We find that, at variance with feedforward networks, the lack of a first–order–like transition in the loss of the transformer produces a range of intermediate temperatures with good learning properties. We show that the parameters of most layers are highly conserved at these temperatures if the dimension of the embedding is optimal, and we provide an operative way to find this dimension. Finally, we show that the attention matrix is more predictive of the contact maps of the protein at higher temperatures and for higher dimensions of the embedding than those optimal for learning.

关键词: transformer models, protein structure prediction, statistical mechanics, loss landscape, Langevin dynamics, attention matrix, contact maps, embedding dimension

259. ❌ Variational Graph Neural Networks for Uncertainty Quantification in Inverse Problems

作者: David Gonzalez, Alba Muixi, Beatriz Moya, Elias Cueto 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29515v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是变分图神经网络在计算力学逆问题不确定性量化中的应用，属于深度学习在科学计算领域的应用。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文主题相关，因为论文属于AI for Science（科学人工智能）范畴，具体应用于固体力学。其他关键词均涉及大语言模型（LLM）相关技术、训练方法、推理优化、代理系统等，而论文完全不涉及任何语言模型或文本处理，专注于图神经网络在物理参数识别中的确定性预测和不确定性估计。

!!! tip deepseek-chat TL;DR

该论文提出了一种变分图神经网络架构，用于解决固体力学中的逆问题，不仅能够高精度恢复物理参数，还能提供与问题物理特性一致的置信区间。

摘要翻译

深度机器学习技术在计算力学中的日益广泛应用，显著加速了数年前仍被视为难以处理的各类问题的仿真模拟。然而，在工程或医学数字孪生等关键应用中，快速响应并不足够；还必须提供可靠的结果。在某些情况下，传统的确定性方法可能并非最优选择，因为它们无法提供对其预测或结果的置信度度量，特别是在逆问题中——例如，当解可能不唯一，或初始数据因噪声存在而未必完全可靠时。经典的深度神经网络同样缺乏量化其预测不确定性的明确方法。本研究提出一种变分图神经网络（Variational Graph Neural Network, VGNN）架构，该架构在其结构中集成了变分层，以对权重概率分布进行建模。与计算成本高昂的完全贝叶斯网络不同，我们的方法策略性地仅在解码器中引入变分层，从而能以相对较低的成本估计认知不确定性与统计不确定性。
本研究通过两个固体力学案例对所提方法进行了验证：一是二维弹性问题中非线性分布的弹性模量值识别，二是三维超弹性梁上所受载荷的位置与量化分析。两个案例均仅使用各测试的位移场作为输入数据。结果表明，该模型不仅能高精度还原物理参数，还能提供与问题物理特性一致的置信区间，同时能够定位外加载荷的位置并估算其大小，为实验给出相应的置信区间。

摘要 (Abstract)

The increasingly wide use of deep machine learning techniques in computational mechanics has significantly accelerated simulations of problems that were considered unapproachable just a few years ago. However, in critical applications such as Digital Twins for engineering or medicine, fast responses are not enough; reliable results must also be provided. In certain cases, traditional deterministic methods may not be optimal as they do not provide a measure of confidence in their predictions or results, especially in inverse problems where the solution may not be unique or the initial data may not be entirely reliable due to the presence of noise, for instance. Classic deep neural networks also lack a clear measure to quantify the uncertainty of their predictions. In this work, we present a variational graph neural network (VGNN) architecture that integrates variational layers into its architecture to model the probability distribution of weights. Unlike computationally expensive full Bayesian networks, our approach strategically introduces variational layers exclusively in the decoder, allowing us to estimate cognitive uncertainty and statistical uncertainty at a relatively lower cost. In this work, we validate the proposed methodology in two cases of solid mechanics: the identification of the value of the elastic modulus with nonlinear distribution in a 2D elastic problem and the location and quantification of the loads applied to a 3D hyperelastic beam, in both cases using only the displacement field of each test as input data. The results show that the model not only recovers the physical parameters with high precision, but also provides confidence intervals consistent with the physics of the problem, as well as being able to locate the position of the applied load and estimate its value, giving a confidence interval for that experiment.

关键词: Variational Graph Neural Networks, Uncertainty Quantification, Inverse Problems, Computational Mechanics, Solid Mechanics, Elastic Modulus Identification, Hyperelastic Beam, Confidence Intervals

260. ❌ Model Predictive Path Integral PID Control for Learning-Based Path Following

作者: Teruki Kato, Koshi Oishi, Seigo Ito 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29499v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是机器人控制领域的模型预测路径积分PID控制方法，用于基于学习的路径跟踪。论文内容涉及控制理论、优化算法、系统识别和机器人应用，但完全不涉及大语言模型、深度学习、AI for Science或任何评分关键词中的技术。所有关键词都与大模型、深度学习、AI对齐、推理、压缩等主题相关，而本文是纯粹的机器人控制研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合模型预测路径积分控制和PID控制的MPPI-PID方法，用于优化机器人路径跟踪中的PID增益，实验表明该方法在减少输入增量的同时实现了与常规MPPI相当的跟踪性能，并显著提高了样本效率。

摘要翻译

经典比例-积分-微分（PID）控制广泛应用于工业领域，但为实现更高性能，常需采用模型预测控制（MPC）。尽管基于梯度的方法是实时优化的标准方案，但基于采样的方法近年来备受关注。特别是模型预测路径积分（MPPI）控制，它能实现无需梯度的优化，并兼容不可微模型与目标函数。然而，直接对控制输入序列进行采样可能导致输入不连续，且优化维度会随预测时域线性增加。本研究提出MPPI-PID控制方法，通过在每个控制步骤应用MPPI优化PID增益，从而以低维增益空间优化替代直接的高维输入序列优化。该框架通过PID结构提升了采样效率并生成更平滑的控制输入。本文同时提供了理论分析，包括：统一MPPI与MPPI-PID的信息论阐释、优化维度对采样效率影响的分析，以及PID结构对输入连续性的影响机制。所提方法在基于学习的微型叉车路径跟踪任务中进行了验证，采用融合物理模型与神经网络的残差学习动力学模型，并利用真实驾驶数据完成系统辨识。数值路径跟踪实验表明：相较于固定增益PID，MPPI-PID提升了跟踪性能；在显著降低输入增量的同时，其性能与传统MPPI相当。此外，即使在采样数大幅减少的情况下，该方法仍能保持优越性能，证明了其更高的采样效率。

摘要 (Abstract)

Classical proportional–integral–derivative (PID) control is widely employed in industrial applications; however, achieving higher performance often motivates the adoption of model predictive control (MPC). Although gradient-based methods are the standard for real-time optimization, sampling-based approaches have recently gained attention. In particular, model predictive path integral (MPPI) control enables gradient-free optimization and accommodates non-differentiable models and objective functions. However, directly sampling control input sequences may yield discontinuous inputs and increase the optimization dimensionality in proportion to the prediction horizon. This study proposes MPPI–PID control, which applies MPPI to optimize PID gains at each control step, thereby replacing direct high-dimensional input-sequence optimization with low-dimensional gain-space optimization. This formulation enhances sample efficiency and yields smoother inputs via the PID structure. We also provide theoretical insights, including an information-theoretic interpretation that unifies MPPI and MPPI–PID, an analysis of the effect of optimization dimensionality on sample efficiency, and a characterization of input continuity induced by the PID structure. The proposed method is evaluated on the learning-based path following of a mini forklift using a residual-learning dynamics model that integrates a physical model with a neural network. System identification is performed with real driving data. Numerical path-following experiments demonstrate that MPPI–PID improves tracking performance compared with fixed-gain PID and achieves performance comparable to conventional MPPI while significantly reducing input increments. Furthermore, the proposed method maintains favorable performance even with substantially fewer samples, demonstrating its improved sample efficiency.

关键词: Model Predictive Path Integral Control, PID Control, Path Following, Sample Efficiency, Optimization Dimensionality, Residual-learning Dynamics Model, System Identification, Robotics

261. ❌ Why not to use Cosine Similarity between Label Representations

作者: Beatrix M. G. Nielsen 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29488v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是softmax分类器中标签表示（unembeddings）的余弦相似性与模型输出概率之间的关系，属于机器学习模型解释性/可解释性AI（Explainable AI）的范畴。论文的核心是证明余弦相似性不能可靠地解释模型行为，这与’Mechanistic Interpretability OR Explainable AI’关键词有一定关联（评5分），因为它探讨了模型内部表示与输出行为之间关系的可解释性问题。然而，论文并未具体涉及大模型（LLMs）、深度学习技术原理创新、或大模型在科学领域的应用。它讨论的是一般性的softmax分类器（可包括图像分类器和自回归语言模型），但并未聚焦于大模型特有的技术（如MoE、Scaling Laws、RLHF等）、应用（如AI for Science）或部署问题（如On-device AI）。因此，除’Explainable AI’相关关键词外，其他所有关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文证明了对于任何使用softmax的分类器（包括图像分类器和自回归语言模型），标签表示之间的余弦相似性并不能提供关于模型分配概率的任何信息，因此不应将其用于解释模型行为。

摘要翻译

余弦相似度常被用于衡量向量间的相似性，这些向量可能是神经网络模型的表征。然而，模型表征的余弦相似度并不能保证反映模型的行为特征。本文证明，在使用softmax分类器（无论是图像分类器还是自回归语言模型）时，测量标签表征（本文中称为“反嵌入向量”）之间的余弦相似度无法提供模型所分配概率的任何信息。具体而言，我们证明对于任意softmax分类器模型，给定两个标签表征，总可以构建另一个模型，使其对所有标签和输入保持相同的概率输出，但此时这两个表征间的余弦相似度可被任意设定为1或-1。我们提供了表征间余弦相似度极高或极低的具体模型示例，并展示了如何构建等效模型使其余弦相似度变为-1或1。这种平移歧义可通过中心化标签表征来消除，然而，即使表征间余弦相似度较低，对应标签仍可能对相同输入具有高概率。即使固定表征长度，仍无法保证高（或低）余弦相似度会使标签对相同输入具有高（或低）概率。这意味着在使用softmax分类器时，不应通过标签表征间的余弦相似度值来解释模型概率。

摘要 (Abstract)

Cosine similarity is often used to measure the similarity of vectors. These vectors might be the representations of neural network models. However, it is not guaranteed that cosine similarity of model representations will tell us anything about model behaviour. In this paper we show that when using a softmax classifier, be it an image classifier or an autoregressive language model, measuring the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that for any softmax classifier model, given two label representations, it is possible to make another model which gives the same probabilities for all labels and inputs, but where the cosine similarity between the representations is now either 1 or -1. We give specific examples of models with very high or low cosine simlarity between representations and show how to we can make equivalent models where the cosine similarity is now -1 or 1. This translation ambiguity can be fixed by centering the label representations, however, labels with representations with low cosine similarity can still have high probability for the same inputs. Fixing the length of the representations still does not give a guarantee that high(or low) cosine similarity will give high(or low) probability to the labels for the same inputs. This means that when working with softmax classifiers, cosine similarity values between label representations should not be used to explain model probabilities.

关键词: cosine similarity, label representations, softmax classifier, model interpretation, unembeddings, probability, neural network, explainable AI

262. ❌ Survival In-Context: Prior-fitted In-context Learning Tabular Foundation Model for Survival Analysis

作者: Dmitrii Seletkov, Paul Hager, Rickmer Braren, Daniel Rueckert, Raphael Rehms 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29475v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文提出了一种用于生存分析的表格基础模型SIC，属于AI for Science在生物医学领域的应用，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。模型基于’prior-fitted paradigm’，在合成数据上进行预训练，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。模型采用in-context learning范式进行预测，与’In-context Learning OR Many-shot Learning’高度相关（10分）。论文涉及基础模型概念，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词如MoE、SFT、RAG、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于先验拟合和上下文学习的生存分析表格基础模型SIC，在合成数据上预训练后，在真实世界生存数据集上实现了与经典和深度生存模型相当或更优的性能。

摘要翻译

生存分析在众多医学应用中至关重要，但由于数据有限、删失现象以及表格协变量的异质性，其对现代机器学习仍具挑战性。尽管先验拟合范式——即依赖在大量合成数据集上进行预训练模型——近期已推动了面向分类与回归的表格基础模型的发展，但其在时间-事件建模中的适用性仍不明确。我们提出了一种灵活的生存数据生成框架，该框架定义了丰富的生存先验，并能够明确控制协变量与时间-事件分布。基于此先验，我们引入了生存上下文学习模型（Survival In-Context, SIC），这是一种专为生存分析设计的先验拟合上下文学习模型，完全基于合成数据进行预训练。SIC通过单次前向传播即可生成个体化生存预测，无需任务特定训练或超参数调优。在广泛的实际生存数据集评估中，SIC相较于经典及深度生存模型展现出竞争性或更优的性能，尤其在中等规模数据场景下表现突出，这凸显了先验拟合基础模型在生存分析中的潜力。代码将在论文发表时公开提供。

摘要 (Abstract)

Survival analysis is crucial for many medical applications but remains challenging for modern machine learning due to limited data, censoring, and the heterogeneity of tabular covariates. While the prior-fitted paradigm, which relies on pretraining models on large collections of synthetic datasets, has recently facilitated tabular foundation models for classification and regression, its suitability for time-to-event modeling remains unclear. We propose a flexible survival data generation framework that defines a rich survival prior with explicit control over covariates and time-event distributions. Building on this prior, we introduce Survival In-Context (SIC), a prior-fitted in-context learning model for survival analysis that is pretrained exclusively on synthetic data. SIC produces individualized survival prediction in a single forward pass, requiring no task-specific training or hyperparameter tuning. Across a broad evaluation on real-world survival datasets, SIC achieves competitive or superior performance compared to classical and deep survival models, particularly in medium-sized data regimes, highlighting the promise of prior-fitted foundation models for survival analysis. The code will be made available upon publication.

关键词: survival analysis, tabular foundation model, in-context learning, prior-fitted paradigm, synthetic data pretraining, medical applications, time-to-event modeling, individualized survival prediction

263. ❌ From Big Data to Fast Data: Towards High-Quality Datasets for Machine Learning Applications from Closed-Loop Data Collection

作者: Philipp Reis, Jacqueline Henle, Stefan Otten, Eric Sax 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29474v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于汽车系统工程中的数据收集策略（Fast Data概念），旨在通过实时、上下文感知的闭环数据选择来提高数据集质量，以支持机器学习应用。虽然提及了vision-language和multimodal language models，但全文核心是数据收集方法论，而非大模型技术本身或其具体应用。所有关键词均涉及大模型技术原理、训练方法、推理优化、应用范式等具体方面，与论文的数据工程主题无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对汽车系统工程中机器学习应用面临的数据质量问题，提出了Fast Data方法，通过在车辆端实施实时、上下文感知的闭环数据选择，从而生成更相关、覆盖关键场景更全面、信息密度更高的数据集，同时降低无关数据收集成本。

摘要翻译

机器学习模型（如视觉语言模型与多模态语言模型）能力的不断提升，对汽车系统工程中的数据提出了日益增长的需求，使得所采集数据的质量与相关性成为此类系统开发与验证的关键赋能因素。传统的大数据方法侧重于大规模数据采集与离线处理，而智能数据方法虽改进了数据选择策略，却仍依赖于集中式的离线后处理。
本文提出了面向汽车系统工程的快速数据概念。该方法将数据选择与记录环节转移至作为数据源的车辆端。通过在是否记录及记录何种数据的问题上实现实时、情境感知的决策，数据采集能够与数据质量目标及采集策略在闭环中直接对齐。由此产生的数据集具有更高的相关性、更完善的关键场景覆盖度以及更强的信息密度，同时减少了无关数据及相关成本。所提出的方法为设计符合现代机器学习算法需求的数据采集策略提供了结构化基础，支持高效的数据获取，并有助于在汽车系统工程中实现可扩展且经济高效的机器学习开发流程。

摘要 (Abstract)

The increasing capabilities of machine learning models, such as vision-language and multimodal language models, are placing growing demands on data in automotive systems engineering, making the quality and relevance of collected data enablers for the development and validation of such systems. Traditional Big Data approaches focus on large-scale data collection and offline processing, while Smart Data approaches improve data selection strategies but still rely on centralized and offline post-processing. This paper introduces the concept of Fast Data for automotive systems engineering. The approach shifts data selection and recording onto the vehicle as the data source. By enabling real-time, context-aware decisions on whether and which data should be recorded, data collection can be directly aligned with data quality objectives and collection strategies within a closed-loop. This results in datasets with higher relevance, improved coverage of critical scenarios, and increased information density, while at the same time reducing irrelevant data and associated costs. The proposed approach provides a structured foundation for designing data collection strategies that are aligned with the needs of modern machine learning algorithms. It supports efficient data acquisition and contributes to scalable and cost-effective ML development processes in automotive systems engineering.

关键词: Fast Data, data collection, automotive systems engineering, machine learning, data quality, closed-loop, real-time decision, dataset relevance

264. ❌ mtslearn: Machine Learning in Python for Medical Time Series

作者: Zhongheng Jiang, Yuechao Zhao, Donglin Xie, Chenxi Sun, Rongchen Lu, Silu Luo, Zisheng Liang, Shenda Hong 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29432v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文介绍了一个用于医疗时间序列数据的机器学习工具包mtslearn，主要关注数据接口统一、特征工程、模型训练和可视化等传统机器学习流程，不涉及大模型、深度学习技术原理创新或任何评分关键词中的具体技术（如LLM、MoE、RLHF等）。唯一的相关性在于它属于AI在科学（医疗）领域的应用，因此仅对’AI for Science OR Bioinformatics OR Cheminformatics’给予5分（有一定关联），其他关键词均完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对医疗时间序列数据异构且格式不一致、现有机器学习工具学习曲线陡峭的问题，开发了一个名为mtslearn的端到端集成工具包，通过统一数据接口和模块化设计简化了数据工程任务，降低了临床医生的使用门槛，加速了先进算法在临床实践中的应用。

摘要翻译

医学时间序列数据捕捉了患者病情的动态演变过程，在现代临床决策支持系统中发挥着至关重要的作用。然而，现实世界中的临床数据具有高度异质性且格式不一致。此外，现有的机器学习工具通常学习曲线陡峭，工作流程碎片化。因此，尖端人工智能技术与临床应用之间仍存在显著差距。为解决这一问题，我们推出了mtslearn——一个专为医学时间序列数据设计的端到端集成工具包。首先，该框架提供了统一的数据接口，可自动解析和对齐宽格式（wide）、长格式（long）以及扁平格式（flat）数据。这一设计显著降低了数据清洗的负担。在此基础上，mtslearn提供了从数据读取、特征工程到模型训练与结果可视化的完整流程。此外，它还提供了灵活的自定义算法接口。通过模块化设计，mtslearn将复杂的数据工程任务简化为数行代码。这显著降低了编程经验有限的临床医生的使用门槛，使他们能够更专注于探索医学假设，并加速先进算法向真实世界临床实践的转化。mtslearn已在https://github.com/PKUDigitalHealth/mtslearn公开提供。

摘要 (Abstract)

Medical time-series data captures the dynamic progression of patient conditions, playing a vital role in modern clinical decision support systems. However, real-world clinical data is highly heterogeneous and inconsistently formatted. Furthermore, existing machine learning tools often have steep learning curves and fragmented workflows. Consequently, a significant gap remains between cutting-edge AI technologies and clinical application. To address this, we introduce mtslearn, an end-to-end integrated toolkit specifically designed for medical time-series data. First, the framework provides a unified data interface that automates the parsing and alignment of wide, long, and flat data formats. This design significantly reduces data cleaning overhead. Building on this, mtslearn provides a complete pipeline from data reading and feature engineering to model training and result visualization. Furthermore, it offers flexible interfaces for custom algorithms. Through a modular design, mtslearn simplifies complex data engineering tasks into a few lines of code. This significantly lowers the barrier to entry for clinicians with limited programming experience, empowering them to focus more on exploring medical hypotheses and accelerating the translation of advanced algorithms into real-world clinical practice. mtslearn is publicly available at https://github.com/PKUDigitalHealth/mtslearn.

关键词: medical time-series, machine learning toolkit, clinical decision support, data interface, feature engineering, model training, clinical application, end-to-end pipeline

265. ❌ Multi-AUV Cooperative Target Tracking Based on Supervised Diffusion-Aided Multi-Agent Reinforcement Learning

作者: Jiaao Ma, Chuan Lin, Guangjie Han, Shengchao Zhu, Zhenyu Wang, Chen An 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于多智能体强化学习（MARL）在水下目标跟踪中的应用，与大多数大模型/深度学习技术关键词无关。仅与’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文核心是多AUV协同跟踪。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因其属于AI在海洋科学/探索领域的应用，但非生物信息学或化学信息学。其他关键词均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

本文针对多自主水下航行器协同目标跟踪中存在的非平稳性、稀疏奖励和扰动脆弱性等挑战，提出了一种基于监督扩散辅助多智能体强化学习的层次化架构，在仿真中实现了比现有方法更优的跟踪精度。

摘要翻译

近年来，水下网络与多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）的进展显著拓展了多自主水下航行器（AUV）在海洋勘探与目标跟踪领域的应用。然而，当前基于MARL的协同跟踪面临三个关键挑战：1）分散式协调中的非平稳性问题，即局部策略更新会破坏队友观测空间的稳定性，阻碍算法收敛；2）由于水下能见度有限与传感器范围受限导致的稀疏奖励探索效率低下，引发高方差学习；3）水流扰动脆弱性与手工设计奖励依赖相结合，使得在未建模水动力条件下实际部署的鲁棒性下降。为应对这些挑战，本文提出一种分层MARL架构，包含全局训练调度、多智能体协调、局部决策与实时执行四层。该架构通过层次化分解优化任务分配与AUV间协同。在此基础上，我们提出监督扩散辅助多智能体强化学习（Supervised Diffusion-Aided MARL, SDA-MARL）算法，其具备三项创新：1）采用双决策架构与分离经验池，通过结构化经验回放缓解非平稳性；2）引入监督学习机制引导扩散模型的反向去噪过程，生成高保真训练样本以加速收敛；3）融合行为克隆损失的扰动鲁棒策略学习，利用高质量回放动作指导深度确定性策略梯度网络更新，消除对手工设计奖励的依赖。本文基于SDA-MARL提出的跟踪算法在综合水下仿真实验中，相比现有先进方法实现了更优的跟踪精度。

摘要 (Abstract)

In recent years, advances in underwater networking and multi-agent reinforcement learning (MARL) have significantly expanded multi-autonomous underwater vehicle (AUV) applications in marine exploration and target tracking. However, current MARL-driven cooperative tracking faces three critical challenges: 1) non-stationarity in decentralized coordination, where local policy updates destabilize teammates’ observation spaces, preventing convergence; 2) sparse-reward exploration inefficiency from limited underwater visibility and constrained sensor ranges, causing high-variance learning; and 3) water disturbance fragility combined with handcrafted reward dependency that degrades real-world robustness under unmodeled hydrodynamic conditions. To address these challenges, this paper proposes a hierarchical MARL architecture comprising four layers: global training scheduling, multi-agent coordination, local decision-making, and real-time execution. This architecture optimizes task allocation and inter-AUV coordination through hierarchical decomposition. Building on this foundation, we propose the Supervised Diffusion-Aided MARL (SDA-MARL) algorithm featuring three innovations: 1) a dual-decision architecture with segregated experience pools mitigating nonstationarity through structured experience replay; 2) a supervised learning mechanism guiding the diffusion model’s reverse denoising process to generate high-fidelity training samples that accelerate convergence; and 3) disturbance-robust policy learning incorporating behavioral cloning loss to guide the Deep Deterministic Policy Gradient network update using high-quality replay actions, eliminating handcrafted reward dependency. The tracking algorithm based on SDA-MARL proposed in this paper achieves superior precision compared to state-of-the-art methods in comprehensive underwater simulations.

关键词: Multi-AUV, Cooperative Target Tracking, Multi-Agent Reinforcement Learning, Supervised Diffusion, Hierarchical Architecture, Underwater Simulation, Policy Learning, Experience Replay

266. ❌ Causality-inspired Federated Learning for Dynamic Spatio-Temporal Graphs

作者: Yuxuan Liu, Wenchao Xu, Haozhao Wang, Zhiming He, Zhaofeng Shi, Chongyang Xu, Peichao Wang, Boyuan Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29384v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦图学习（FGL）和动态时空图（STG），提出了一种因果启发的框架SC-FSGL来解决客户端异质性和负迁移问题。论文的核心技术是图神经网络（GNN）、联邦学习、因果干预和表示解耦，并未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是图学习领域的特定联邦学习问题，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对联邦动态时空图学习中存在的客户端异质性和负迁移问题，提出了一种因果启发的框架SC-FSGL，通过表示级干预和解耦因果知识来提升模型泛化性能，并在多个数据集上超越了现有方法。

摘要翻译

联邦图学习（Federated Graph Learning, FGL）已成为一种在保护数据隐私的同时实现图神经网络去中心化训练的强大范式。然而，现有的FGL方法主要针对静态图设计，并依赖于参数平均或分布对齐，这些方法隐含地假设所有特征在客户端间具有同等的可迁移性，忽视了现实世界图中存在的时空异质性以及客户端特定知识。在本研究中，我们发现此类假设会引发虚假表征纠缠、客户端特定干扰和负迁移的恶性循环，从而降低动态时空图联邦学习（Federated Learning over Dynamic Spatio-Temporal Graphs, FSTG）中的泛化性能。为解决这一问题，我们提出了一种新颖的因果启发框架SC-FSGL，该框架通过表征层面的干预，明确地将可迁移的因果知识与客户端特定噪声解耦。具体而言，我们引入了一个条件分离模块，通过客户端条件掩码模拟软干预，从而将不变的时空因果因子与虚假信号分离，并缓解由客户端异质性引起的表征纠缠。此外，我们提出了一个因果码本，通过聚类因果原型并利用对比学习对齐局部表征，以促进跨客户端一致性，并推动多样时空模式间的知识共享。在五个具有不同异质性的时空图（Spatio-Temporal Graph, STG）数据集上的实验表明，SC-FSGL优于现有最先进方法。

摘要 (Abstract)

Federated Graph Learning (FGL) has emerged as a powerful paradigm for decentralized training of graph neural networks while preserving data privacy. However, existing FGL methods are predominantly designed for static graphs and rely on parameter averaging or distribution alignment, which implicitly assume that all features are equally transferable across clients, overlooking both the spatial and temporal heterogeneity and the presence of client-specific knowledge in real-world graphs. In this work, we identify that such assumptions create a vicious cycle of spurious representation entanglement, client-specific interference, and negative transfer, degrading generalization performance in Federated Learning over Dynamic Spatio-Temporal Graphs (FSTG). To address this issue, we propose a novel causality-inspired framework named SC-FSGL, which explicitly decouples transferable causal knowledge from client-specific noise through representation-level interventions. Specifically, we introduce a Conditional Separation Module that simulates soft interventions through client conditioned masks, enabling the disentanglement of invariant spatio-temporal causal factors from spurious signals and mitigating representation entanglement caused by client heterogeneity. In addition, we propose a Causal Codebook that clusters causal prototypes and aligns local representations via contrastive learning, promoting cross-client consistency and facilitating knowledge sharing across diverse spatio-temporal patterns. Experiments on five diverse heterogeneity Spatio-Temporal Graph (STG) datasets show that SC-FSGL outperforms state-of-the-art methods.

关键词: Federated Graph Learning, Dynamic Spatio-Temporal Graphs, Causality-inspired Framework, Representation Disentanglement, Client Heterogeneity, Negative Transfer, Conditional Separation Module, Causal Codebook

267. ❌ Deep Learning-Assisted Improved Differential Fault Attacks on Lightweight Stream Ciphers

作者: Kok Ping Lim, Dongyang Jia, Iftekhar Salam 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29382v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究深度学习辅助的差分故障攻击在轻量级流密码上的应用，属于深度学习在密码学领域的应用研究。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词都特指大语言模型（LLM）及相关技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为密码学分析可视为AI在科学领域（具体是计算机安全/密码学）的应用，但论文并非典型的生物信息学或化学信息学，且未明确提及’AI for Science’这一宽泛概念，因此给予5分（有一定关联）。其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了深度学习辅助的差分故障攻击方法，应用于三种轻量级流密码（ACORNv3、MORUSv2和ATOM），通过训练多层感知机模型识别故障位置并优化攻击过程，显著降低了攻击复杂度并首次提供了对ATOM的实验结果。

摘要翻译

轻量级密码原语在资源受限环境中被广泛部署，尤其应用于物联网（IoT）设备。由于其公开可访问性，这些设备易受物理攻击，特别是故障攻击。近年来，基于深度学习的密码分析技术已展现出良好效果，但其在故障攻击中的应用仍有限，尤其针对流密码。本研究探讨了在宽松故障模型下，对三种轻量级流密码——ACORNv3、MORUSv2和ATOM——实施深度学习辅助差分故障攻击的可行性。该模型中，单比特翻转故障被注入未知位置。我们训练多层感知机（MLP）模型以识别故障位置。实验结果表明，训练后的模型对ACORNv3、MORUSv2和ATOM的识别准确率分别达到0.999880、0.999231和0.823568，优于传统的基于特征的方法。在秘密信息恢复阶段，我们引入基于阈值的方法以优化恢复所需故障注入次数。结果显示：ACORN的初始状态可通过21至34次故障恢复；MORUS需要213至248次故障，最多仅需猜测6比特。相较于现有研究，两种攻击均降低了攻击复杂度。对于ATOM，结果表明其具有更高的安全裕度，因为非线性反馈移位寄存器（NFSR）中的大多数状态比特仅能在精确控制模型下恢复。据我们所知，本研究首次提供了针对ATOM的差分故障攻击实验结果。

摘要 (Abstract)

Lightweight cryptographic primitives are widely deployed in resource-constraint environment, particularly in the Internet of Things (IoT) devices. Due to their public accessibility, these devices are vulnerable to physical attacks, especially fault attacks. Recently, deep learning-based cryptanalytic techniques have demonstrated promising results; however, their application to fault attacks remains limited, particularly for stream ciphers. In this work, we investigate the feasibility of deep learning assisted differential fault attack on three lightweight stream ciphers, namely ACORNv3, MORUSv2 and ATOM, under a relaxed fault model, where a single-bit bit-flipping fault is injected at an unknown location. We train multilayer perceptron (MLP) models to identify the fault locations. Experimental results show that the trained models achieve high identification accuracies of 0.999880, 0.999231 and 0.823568 for ACORNv3, MORUSv2 and ATOM, respectively, and outperform traditional signature-based methods. For the secret recovery process, we introduce a threshold-based method to optimize the number of fault injections required to recover the secret information. The results show that the initial state of ACORN can be recovered with 21 to 34 faults; while MORUS requires 213 to 248 faults, with at most 6 bits of guessing. Both attacks reduce the attack complexity compared to existing works. For ATOM, the results show that it possesses a higher security margin, as majority of state bits in the Non-linear Feedback Shift Register (NFSR) can only be recovered under a precise control model. To the best of our knowledge, this work provides the first experimental results of differential fault attacks on ATOM.

关键词: deep learning, differential fault attack, lightweight stream ciphers, cryptanalysis, fault location identification, multilayer perceptron, ACORN, MORUS, ATOM

268. ❌ Finite-time analysis of Multi-timescale Stochastic Optimization Algorithms

作者: Kaustubh Kartikey, Shalabh Bhatnagar 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29380v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多时间尺度随机优化算法的有限时间分析，属于经典优化理论领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型、深度学习、AI for Science等内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文对两种基于平滑函数估计的多时间尺度随机优化算法（梯度法和牛顿法）进行了有限时间收敛性分析，建立了误差界并验证了理论结果。

摘要翻译

本文对两种基于仿真的优化平滑函数随机逼近算法进行了有限时间分析。第一种是基于梯度的双时间尺度方法，第二种是基于牛顿法的三时间尺度算法，该算法同时估计目标函数 $J$ 的梯度和海森矩阵。两种算法均涉及梯度/海森矩阵的零阶估计。尽管已有研究证明了这些算法的渐近收敛性，但在零阶设定下，双时间尺度随机优化算法的有限时间保证此前尚未得到解决。针对我们的牛顿算法，我们推导了海森估计器的均方误差界，并建立了 $\min\limits_{0 \le m \le T} \mathbb{E}| \nabla J(θ(m)) |^2$ 的有限时间界，证明其收敛至一阶驻点。该分析明确刻画了多个时间尺度之间的相互作用以及估计误差的传播过程。我们进一步确定了能够平衡主要误差项并实现接近最优收敛速率的步长选择方案。在同一框架下，我们也为梯度算法提供了相应的有限时间保证。理论结果通过在连续山地车环境中的实验得到了进一步验证。

摘要 (Abstract)

We present a finite-time analysis of two smoothed functional stochastic approximation algorithms for simulation-based optimization. The first is a two time-scale gradient-based method, while the second is a three time-scale Newton-based algorithm that estimates both the gradient and the Hessian of the objective function $J$. Both algorithms involve zeroth order estimates for the gradient/Hessian. Although the asymptotic convergence of these algorithms has been established in prior work, finite-time guarantees of two-timescale stochastic optimization algorithms in zeroth order settings have not been provided previously. For our Newton algorithm, we derive mean-squared error bounds for the Hessian estimator and establish a finite-time bound on $\min\limits_{0 \le m \le T} \mathbb{E}| \nabla J(θ(m)) |^2$, showing convergence to first-order stationary points. The analysis explicitly characterizes the interaction between multiple time-scales and the propagation of estimation errors. We further identify step-size choices that balance dominant error terms and achieve near-optimal convergence rates. We also provide corresponding finite-time guarantees for the gradient algorithm under the same framework. The theoretical results are further validated through experiments on the Continuous Mountain Car environment.

关键词: stochastic optimization, multi-timescale algorithms, finite-time analysis, gradient estimation, Hessian estimation, convergence rates, simulation-based optimization, Continuous Mountain Car

269. ❌ AP-DRL: A Synergistic Algorithm-Hardware Framework for Automatic Task Partitioning of Deep Reinforcement Learning on Versal ACAP

作者: Enlai Li, Zhe Lin, Sharad Sinha, Wei Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29369v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于深度强化学习（DRL）的硬件加速和量化优化，特别是针对AMD Versal ACAP异构架构的自动任务划分框架。虽然涉及模型压缩（量化）和推理加速，但这些技术是针对DRL训练而非大语言模型（LLM）。所有关键词均围绕LLM技术、科学AI应用或通用大模型方法，而本文核心是DRL的硬件协同设计，与LLM无直接关联。因此，除“Quantization”和“Speculative Decoding”因涉及通用优化技术给5分外，其余关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了AP-DRL框架，通过自动任务划分和硬件感知量化优化，在AMD Versal ACAP异构平台上加速深度强化学习训练，实现了最高4.17倍的加速比同时保持训练收敛。

摘要翻译

深度强化学习已在多个领域展现出卓越成就。然而，训练与推理过程的紧密耦合使得加速DRL训练成为DRL优化的关键挑战。两大问题阻碍了高效的DRL训练：（1）不同DRL算法乃至同一算法内部操作间计算强度的显著差异，使硬件平台选择复杂化；（2）DRL的宽动态范围在传统FP16+FP32混合精度量化下可能导致严重的奖励误差。现有研究主要集中于针对特定计算单元加速DRL或优化推理阶段量化，本文提出AP-DRL框架以应对上述挑战。
AP-DRL是一种自动任务划分框架，其利用AMD Versal自适应计算加速平台（ACAP）的异构架构（集成CPU、FPGA与AI引擎），通过智能硬件感知优化加速DRL训练。我们首先对各类DRL工作负载在CPU、FPGA和AI引擎上的性能进行瓶颈分析，以此指导AP-DRL跨组件任务划分与量化优化的设计原则。该框架通过基于设计空间探索的性能剖析与基于整数线性规划（ILP）的划分模型，依据操作的计算特性将其匹配至最优计算单元，从而解决平台选择难题。针对量化挑战，AP-DRL采用硬件感知算法，通过协调FP32（CPU）、FP16（FPGA/DSP）和BF16（AI引擎）精度操作，充分利用Versal ACAP对这些精度格式的原生支持。综合实验表明，AP-DRL在保持训练收敛性的同时，相比可编程逻辑基准可实现最高4.17倍的加速，相比AI引擎基准可实现最高3.82倍的加速。

摘要 (Abstract)

Deep reinforcement learning has demonstrated remarkable success across various domains. However, the tight coupling between training and inference processes makes accelerating DRL training an essential challenge for DRL optimization. Two key issues hinder efficient DRL training: (1) the significant variation in computational intensity across different DRL algorithms and even among operations within the same algorithm complicates hardware platform selection, while (2) DRL’s wide dynamic range could lead to substantial reward errors with conventional FP16+FP32 mixed-precision quantization. While existing work has primarily focused on accelerating DRL for specific computing units or optimizing inference-stage quantization, we propose AP-DRL to address the above challenges. AP-DRL is an automatic task partitioning framework that harnesses the heterogeneous architecture of AMD Versal ACAP (integrating CPUs, FPGAs, and AI Engines) to accelerate DRL training through intelligent hardware-aware optimization. Our approach begins with bottleneck analysis of CPU, FPGA, and AIE performance across diverse DRL workloads, informing the design principles for AP-DRL’s inter-component task partitioning and quantization optimization. The framework then addresses the challenge of platform selection through design space exploration-based profiling and ILP-based partitioning models that match operations to optimal computing units based on their computational characteristics. For the quantization challenge, AP-DRL employs a hardware-aware algorithm coordinating FP32 (CPU), FP16 (FPGA/DSP), and BF16 (AI Engine) operations by leveraging Versal ACAP’s native support for these precision formats. Comprehensive experiments indicate that AP-DRL can achieve speedup of up to 4.17$\times$ over programmable logic and up to 3.82$\times$ over AI Engine baselines while maintaining training convergence.

关键词: Deep Reinforcement Learning, Automatic Task Partitioning, Heterogeneous Architecture, AMD Versal ACAP, Quantization Optimization, Hardware-aware Optimization, Training Acceleration, Mixed-precision

270. ❌ LGFNet: Local-Global Fusion Network with Fidelity Gap Delta Learning for Multi-Source Aerodynamics

作者: Qinye Zhu, Yu Xiang, Jun Zhang, Wenyong Wang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29303v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文LGFNet专注于计算流体动力学（CFD）数据融合，提出了一种用于多尺度特征分解的局部-全局融合网络和保真度间隙增量学习策略。虽然论文属于AI在科学领域的应用（空气动力学），但内容完全围绕特定领域的深度学习模型架构和训练策略，未涉及任何大语言模型（LLM）、大模型技术原理（如MoE、Scaling Laws、微调方法等）、推理技术（如CoT、RAG）、代理系统或模型优化技术（如量化、注意力机制）。唯一的相关关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文应用深度学习解决空气动力学中的科学问题，但并非核心创新点，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种局部-全局融合网络（LGFNet）和保真度间隙增量学习策略，以解决空气动力学中多源数据（CFD、风洞、飞行测试）融合时局部保真度与全局依赖性难以平衡的问题，实验表明其在准确性和不确定性降低方面达到了最先进性能。

摘要翻译

在空气动力学领域，精确融合计算流体动力学（CFD）数据、风洞试验数据与飞行试验数据，对于获取全飞行包线内局部流动结构与全局气动趋势的全面认知至关重要。然而，现有方法往往难以平衡高分辨率的局部保真度与宽范围的全局关联性，导致要么损失尖锐的流动间断特征，要么无法捕捉长程拓扑相关性。为此，我们提出局部-全局融合网络（Local-Global Fusion Network, LGFNet），通过多尺度特征分解来提取这种双重性质的气动知识。具体而言，LGFNet结合了集成滑动窗口机制的空间感知层与基于自注意力的关系推理层，在增强细粒度局部特征（如激波）连续性的同时，捕获长程流动信息。此外，本文提出保真度间隙增量学习（fidelity gap delta learning, FGDL）策略，将CFD数据视为“低频载体”，以显式逼近非线性差异。该方法在继承仿真基准基础物理趋势的同时，避免了非物理的光滑化效应。实验表明，LGFNet在多种气动场景下，均于精度与不确定性降低方面达到了最先进（state-of-the-art, SOTA）性能。

摘要 (Abstract)

The precise fusion of computational fluid dynamic (CFD) data, wind tunnel tests data, and flight tests data in aerodynamic area is essential for obtaining comprehensive knowledge of both localized flow structures and global aerodynamic trends across the entire flight envelope. However, existing methodologies often struggle to balance high-resolution local fidelity with wide-range global dependency, leading to either a loss of sharp discontinuities or an inability to capture long-range topological correlations. We propose Local-Global Fusion Network (LGFNet) for multi-scale feature decomposition to extract this dual-natured aerodynamic knowledge. To this end, LGFNet combines a spatial perception layer that integrates a sliding window mechanism with a relational reasoning layer based on self-attention, simultaneously reinforcing the continuity of fine-grained local features (e.g., shock waves) and capturing long-range flow information. Furthermore, the fidelity gap delta learning (FGDL) strategy is proposed to treat CFD data as a “low-frequency carrier” to explicitly approximate nonlinear discrepancies. This approach prevents unphysical smoothing while inheriting the foundational physical trends from the simulation baseline. Experiments demonstrate that LGFNet achieves state-of-the-art (SOTA) performance in both accuracy and uncertainty reduction across diverse aerodynamic scenarios.

关键词: computational fluid dynamics, aerodynamics, multi-source data fusion, local-global fusion, fidelity gap delta learning, deep learning, feature decomposition, self-attention

271. ❌ From Physics to Surrogate Intelligence: A Unified Electro-Thermo-Optimization Framework for TSV Networks

作者: Mohamed Gharib, Leonid Popryho, Inna Partin-Vaisband 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29268v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于电子设计自动化（EDA）领域，提出了一种用于TSV网络电热优化的物理信息图神经网络（GNN）代理模型框架。所有关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，但论文仅涉及GNN在特定工程问题中的应用，未涉及LLM、MoE、缩放定律、训练技术、推理优化、代理系统、模型压缩等主题。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI（GNN）应用于科学/工程问题（电子设计），但并非核心生物信息学或化学信息学领域，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合物理信息分析建模和图神经网络（GNN）代理的 scalable 电热建模与优化框架，用于高效探索和优化高密度TSV阵列的布局与几何结构，将每个设计的评估时间减少了六个数量级以上。

摘要翻译

高密度硅通孔（TSV）技术实现了2.5D/3D异质集成，但由于电耦合、插入损耗和自热效应，也带来了严重的信号完整性与热可靠性挑战。传统的全波有限元法（FEM）仿真精度高，但在进行大规模设计空间探索时计算成本过高。本研究提出了一种可扩展的电热建模与优化框架，该框架结合了基于物理原理的解析建模、图神经网络（GNN）代理模型以及全波签核验证。通过多导体解析模型计算TSV阵列的宽带S参数和有效各向异性热导率，在阵列尺寸高达15x15的范围内，相对弗罗贝尼乌斯误差（RFE）达到5%-10%。一个基于物理原理的图神经网络代理模型（TSV-PhGNN）在解析数据上训练，并利用HFSS仿真进行微调，可泛化至更大尺寸阵列，其RFE低于2%且方差几乎恒定。该代理模型被集成到一个多目标帕累托优化框架中，优化目标包括反射系数、插入损耗、最坏情况串扰（近端串扰/远端串扰）以及有效热导率。可在数分钟内探索数百万种TSV配置，实现全面的布局与几何优化，而仅使用FEM则无法完成。最终设计通过Ansys HFSS和Mechanical进行验证，结果高度吻合。所提出的框架能够实现TSV阵列的快速电热协同设计，同时将单次设计评估时间减少了超过六个数量级。

摘要 (Abstract)

High-density through-substrate vias (TSVs) enable 2.5D/3D heterogeneous integration but introduce significant signal-integrity and thermal-reliability challenges due to electrical coupling, insertion loss, and self-heating. Conventional full-wave finite-element method (FEM) simulations provide high accuracy but become computationally prohibitive for large design-space exploration. This work presents a scalable electro-thermal modeling and optimization framework that combines physics-informed analytical modeling, graph neural network (GNN) surrogates, and full-wave sign-off validation. A multi-conductor analytical model computes broadband S-parameters and effective anisotropic thermal conductivities of TSV arrays, achieving $5%-10%$ relative Frobenius error (RFE) across array sizes up to $15x15$. A physics-informed GNN surrogate (TSV-PhGNN), trained on analytical data and fine-tuned with HFSS simulations, generalizes to larger arrays with RFE below $2%$ and nearly constant variance. The surrogate is integrated into a multi-objective Pareto optimization framework targeting reflection coefficient, insertion loss, worst-case crosstalk (NEXT/FEXT), and effective thermal conductivity. Millions of TSV configurations can be explored within minutes, enabling exhaustive layout and geometric optimization that would be infeasible using FEM alone. Final designs are validated with Ansys HFSS and Mechanical, showing strong agreement. The proposed framework enables rapid electro-thermal co-design of TSV arrays while reducing per-design evaluation time by more than six orders of magnitude.

关键词: TSV networks, electro-thermal modeling, graph neural network (GNN), physics-informed surrogate, multi-objective optimization, high-frequency simulation, thermal conductivity, design-space exploration

272. ❌ Lie Generator Networks for Nonlinear Partial Differential Equations

作者: Shafayeth Jamil, Rehan Kapadia 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29264v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种用于非线性偏微分方程的Lie Generator Network (LGN-KM)，属于AI for Science（科学AI）领域，特别是物理系统建模和科学计算，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文强调通过架构分解实现可解释性，这与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为其目标是使学习到的动力学可解释，尽管不是专门针对大模型的可解释性。论文未涉及大模型、深度学习技术原理创新（如LLMs、MoE、训练方法、推理优化、智能体等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对非线性偏微分方程缺乏线性系统般特征谱理论的问题，提出了Lie Generator Network (LGN-KM)，通过将非线性动力学提升到线性潜在空间并学习Koopman生成器，实现了稳定、可解释的建模，并在Navier-Stokes湍流中成功恢复了耗散标度和色散关系。

摘要翻译

线性动力系统完全由其本征谱表征，该谱可直接从动力学生成子中获取。对于由偏微分方程控制的非线性系统，尚不存在等效理论。我们提出李生成子网络-库普曼（LGN-KM），这是一种神经算子，可将非线性动力学提升至线性潜在空间，并通过分解 $L_k = S - D_k$ 学习连续时间库普曼生成子（$L_k$），其中 $S$ 为斜对称矩阵，表示保守的模态间耦合，$D_k$ 为正定对角矩阵，编码模态耗散。这种架构分解确保了稳定性，并通过对所学动力学的直接谱访问实现可解释性。在二维纳维-斯托克斯湍流中，该生成子仅从轨迹数据中（无需物理监督）即可恢复已知的耗散标度律和完整的多分支色散关系。在不同流态下独立训练的模型恢复了匹配的规范不变谱结构，揭示了库普曼提升中的规范自由度。由于该生成子具有可证明的稳定性，它能保证长时程稳定性、在任意时间进行连续时间评估，并实现基于物理信息的跨粘度模型迁移。

摘要 (Abstract)

Linear dynamical systems are fully characterized by their eigenspectra, accessible directly from the generator of the dynamics. For nonlinear systems governed by partial differential equations, no equivalent theory exists. We introduce Lie Generator Network–Koopman (LGN-KM), a neural operator that lifts nonlinear dynamics into a linear latent space and learns the continuous-time Koopman generator ($L_k$) through a decomposition $L_k = S - D_k$, where $S$ is skew-symmetric representing conservative inter-modal coupling, and $D_k$ is a positive-definite diagonal encoding modal dissipation. This architectural decomposition enforces stability and enables interpretability through direct spectral access to the learned dynamics. On two-dimensional Navier–Stokes turbulence, the generator recovers the known dissipation scaling and a complete multi-branch dispersion relation from trajectory data alone with no physics supervision. Independently trained models at different flow regimes recover matched gauge-invariant spectral structure, exposing a gauge freedom in the Koopman lifting. Because the generator is provably stable, it enables guaranteed long-horizon stability, continuous-time evaluation at arbitrary time, and physics-informed cross-viscosity model transfer.

关键词: Lie Generator Network, Koopman operator, nonlinear partial differential equations, neural operator, interpretability, Navier-Stokes turbulence, stability, spectral analysis

273. ❌ Real-Time Surrogate Modeling for Fast Transient Prediction in Inverter-Based Microgrids Using CNN and LightGBM

作者: Osasumwen Cedric Ogiesoba-Eguakun, Kaveh Ashenayi, Suman Rath 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29255v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用CNN和LightGBM进行微电网实时预测的深度学习应用，属于AI在工程科学领域的应用，与"AI for Science"有一定关联（5分），但未涉及大语言模型（LLMs）、MoE、SLMs、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、模型压缩、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等大模型技术关键词，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于CNN和LightGBM的数据驱动代理建模框架，用于快速预测逆变器微电网的动态行为，实现了超过1000倍的加速和实时性能，适用于监控、故障分析和控制应用。

摘要翻译

基于逆变器的微电网实时监测对于系统稳定性、故障响应和运行决策至关重要。然而，捕捉快速逆变器动态所需的电磁暂态（EMT）仿真计算量大，不适合实时应用。本文提出了一种数据驱动的代理建模框架，利用卷积神经网络（CNN）和轻量梯度提升机（LightGBM）快速预测微电网行为。模型在一个高保真EMT数字孪生数据集上训练，该数据集模拟了包含十个分布式发电单元的微电网在十一种运行和扰动场景（包括故障、噪声和通信延迟）下的行为。采用滑动窗口方法来预测关键系统变量，包括电压幅值、频率、总有功功率和电压暂降。结果表明，模型性能随预测变量类型的不同而变化。CNN对电压等时间相关信号表现出高精度，其$R^2$值达到0.84；而LightGBM在结构化和扰动相关变量上表现更优，频率预测的$R^2$达到0.999，电压暂降预测达到0.75。结合使用的CNN+LightGBM混合模型在所有变量上均展现出稳定的性能。除了准确性，这些代理模型在计算效率上也实现了显著提升：LightGBM获得了超过$1000\times$的加速比，运行速度快于实时；混合模型则实现了超过$500\times$的加速比，并具备近实时性能。这些发现表明，数据驱动的代理模型能有效表征微电网动态，支持实时及超实时预测，因此非常适用于基于逆变器的电力系统中的监测、故障分析和控制等应用。

摘要 (Abstract)

Real-time monitoring of inverter-based microgrids is essential for stability, fault response, and operational decision-making. However, electromagnetic transient (EMT) simulations, required to capture fast inverter dynamics, are computationally intensive and unsuitable for real-time applications. This paper presents a data-driven surrogate modeling framework for fast prediction of microgrid behavior using convolutional neural networks (CNN) and Light Gradient Boosting Machine (LightGBM). The models are trained on a high-fidelity EMT digital twin dataset of a microgrid with ten distributed generators under eleven operating and disturbance scenarios, including faults, noise, and communication delays. A sliding-window method is applied to predict important system variables, including voltage magnitude, frequency, total active power, and voltage dip. The results show that model performance changes depending on the type of variable being predicted. The CNN demonstrates high accuracy for time-dependent signals such as voltage, with an $R^2$ value of 0.84, whereas LightGBM shows better performance for structured and disturbance-related variables, achieving an $R^2$ of 0.999 for frequency and 0.75 for voltage dip. A combined CNN+LightGBM model delivers stable performance across all variables. Beyond accuracy, the surrogate models also provide major improvements in computational efficiency. LightGBM achieves more than $1000\times$ speedup and runs faster than real time, while the hybrid model achieves over $500\times$ speedup with near real-time performance. These findings show that data-driven surrogate models can effectively represent microgrid dynamics. They also support real-time and faster-than-real-time predictions. As a result, they are well-suited for applications such as monitoring, fault analysis, and control in inverter-based power systems.

关键词: surrogate modeling, inverter-based microgrids, convolutional neural networks, LightGBM, real-time prediction, electromagnetic transient simulation, computational efficiency, fault analysis

274. ❌ Stochastic Dimension Implicit Functional Projections for Exact Integral Conservation in High-Dimensional PINNs

作者: Zhangyong Liang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29237v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用神经网络（特别是PINNs）求解高维偏微分方程（PDEs）时，如何精确强制执行宏观守恒律（如质量和能量）的计算方法学问题。它提出了一个名为SDIFP的随机维度隐式函数投影框架，并引入了双重随机无偏梯度估计器（DS-UGE）来提升训练可扩展性。论文的核心是PDE求解的数值方法、计算效率和守恒约束，属于计算科学和科学计算领域。所有关键词均与大模型、深度学习技术原理、训练对齐方法、推理优化、智能体系统或特定AI应用（如生物信息学）直接相关。论文未涉及任何大语言模型（LLM）、基础模型、MoE、训练技术（如预训练、微调、对齐、RLHF）、推理加速技术（如注意力优化、量化）、智能体、工具使用或多智能体系统。唯一略有相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于“AI for Science”的广义范畴（使用AI/神经网络解决科学计算问题），但并非其核心的生物信息学或化学信息学子领域，因此给予5分（有一定关联）。其他所有关键词与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了在高维物理信息神经网络（PINNs）中精确强制执行积分守恒律（如质量和能量）的计算可扩展性问题，提出了随机维度隐式函数投影（SDIFP）框架和双重随机无偏梯度估计器（DS-UGE），实现了网格无关、内存高效且保持推理效率的守恒型高维PDE求解方法。

摘要翻译

在高维环境中，强制神经偏微分方程（PDE）求解器精确满足宏观守恒律（如质量和能量）在计算上具有挑战性。传统的离散投影方法依赖于确定性求积，其扩展性差且限制了无网格方法（如物理信息神经网络，PINNs）的运用。此外，高阶算子会带来沉重的内存开销，而通用优化方法对于非凸守恒流形往往缺乏收敛保证。为解决这些问题，我们提出了随机维度隐式函数投影（SDIFP）框架。SDIFP并非对离散向量进行投影，而是对连续网络输出施加全局仿射变换。通过分离蒙特卡洛（MC）求积，该方法为积分约束提供了闭式解，从而绕过了对空间网格的依赖。为实现可扩展的训练，我们引入了一种双重随机无偏梯度估计器（DS-UGE）。通过将空间采样与微分算子子采样解耦，DS-UGE将内存复杂度从 $\mathcal{O}(M \times N_{\mathcal{L}})$ 降低至 $\mathcal{O}(N \times |\mathcal{I}|)$。SDIFP 能够减轻采样方差，保持解的规律性，并维持 $\mathcal{O}(1)$ 的推理效率，为求解保守的高维偏微分方程提供了一种可扩展的无网格方法。

摘要 (Abstract)

Enforcing exact macroscopic conservation laws, such as mass and energy, in neural partial differential equation (PDE) solvers is computationally challenging in high dimensions. Traditional discrete projections rely on deterministic quadrature that scales poorly and restricts mesh-free formulations like PINNs. Furthermore, high-order operators incur heavy memory overhead, and generic optimization often lacks convergence guarantees for non-convex conservation manifolds. To address this, we propose the Stochastic Dimension Implicit Functional Projection (SDIFP) framework. Instead of projecting discrete vectors, SDIFP applies a global affine transformation to the continuous network output. This yields closed-form solutions for integral constraints via detached Monte Carlo (MC) quadrature, bypassing spatial grid dependencies. For scalable training, we introduce a doubly-stochastic unbiased gradient estimator (DS-UGE). By decoupling spatial sampling from differential operator subsampling, the DS-UGE reduces memory complexity from $\mathcal{O}(M \times N_{\mathcal{L}})$ to $\mathcal{O}(N \times |\mathcal{I}|)$. SDIFP mitigates sampling variance, preserves solution regularity, and maintains $\mathcal{O}(1)$ inference efficiency, providing a scalable, mesh-free approach for solving conservative high-dimensional PDEs.

关键词: Physics-Informed Neural Networks (PINNs), High-Dimensional PDEs, Integral Conservation Laws, Stochastic Projection, Monte Carlo Quadrature, Memory-Efficient Training, Mesh-Free Methods, Doubly-Stochastic Gradient Estimator

275. ❌ Robust and Consistent Ski Rental with Distributional Advice

作者: Jihwan Kim, Chenglin Fan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29233v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是经典的在线决策问题（滑雪租赁问题），属于运筹学、算法设计和在线优化领域。论文主要关注如何利用分布预测来改进确定性算法和随机算法的性能，并提出了Clamp Policy和Water-Filling Algorithm等方法。所有给定的关键词都涉及大模型、深度学习、AI技术原理或AI在科学领域的应用，而该论文完全不涉及这些主题，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对滑雪租赁这一经典在线决策问题，提出了一个利用分布预测的系统框架，设计了既能保持鲁棒性又能提高一致性的算法，并在多种分布下验证了其优于现有点预测方法的性能。

摘要翻译

滑雪租赁问题是研究不确定性下在线决策的经典模型，其核心在于权衡重复租赁成本与一次性购买之间的取舍。传统算法聚焦于最坏情况下的竞争比，而近期“学习增强”方法则利用点估计预测，但两者均未能在保持严格鲁棒性保证的同时充分利用完整分布预测所蕴含的信息。本文通过建立一个系统性框架来填补这一空白，该框架将未知质量的分布性建议整合到确定性算法与随机化算法中。
在确定性场景下，我们首先在完美分布预测的假设下形式化该问题，并推导出一种高效算法以计算最优的阈值购买日。通过严格的性能分析，我们明确了预测分布所需满足的充分条件，使得在此条件下期望竞争比（ECR）能够达到经典的最优随机化算法界限。为处理不完美的预测，我们提出钳位策略（Clamp Policy），该策略将购买阈值限制在一个由可调参数控制的安全区间内。我们证明该策略兼具鲁棒性与一致性：即使预测误差较大时仍能保持良好的性能，且在预测趋于准确时能够逼近最优性能。
在随机化场景下，我们通过注水算法（Water-Filling Algorithm）刻画最优停止分布，该算法在严格满足鲁棒性约束的同时优化期望成本。基于多种分布（高斯分布、几何分布及双峰分布）的实验结果表明，相较于现有的点预测基线方法，我们的框架在保持相当鲁棒性的同时，显著提升了算法的一致性。

摘要 (Abstract)

The ski rental problem is a canonical model for online decision-making under uncertainty, capturing the fundamental trade-off between repeated rental costs and a one-time purchase. While classical algorithms focus on worst-case competitive ratios and recent “learning-augmented” methods leverage point-estimate predictions, neither approach fully exploits the richness of full distributional predictions while maintaining rigorous robustness guarantees. We address this gap by establishing a systematic framework that integrates distributional advice of unknown quality into both deterministic and randomized algorithms. For the deterministic setting, we formalize the problem under perfect distributional prediction and derive an efficient algorithm to compute the optimal threshold-buy day. We provide a rigorous performance analysis, identifying sufficient conditions on the predicted distribution under which the expected competitive ratio (ECR) matches the classic optimal randomized bound. To handle imperfect predictions, we propose the Clamp Policy, which restricts the buying threshold to a safe range controlled by a tunable parameter. We show that this policy is both robust, maintaining good performance even with large prediction errors, and consistent, approaching the optimal performance as predictions become accurate. For the randomized setting, we characterize the stopping distribution via a Water-Filling Algorithm, which optimizes expected cost while strictly satisfying robustness constraints. Experimental results across diverse distributions (Gaussian, geometric, and bi-modal) demonstrate that our framework improves consistency significantly over existing point-prediction baselines while maintaining comparable robustness.

关键词: ski rental problem, online decision-making, distributional advice, robustness, consistency, competitive ratio, deterministic algorithms, randomized algorithms

276. ❌ Derived Fields Preserve Fine-Scale Detail in Budgeted Neural Simulators

作者: Wenshuo Wang, Fan Zhang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29224v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是在固定存储预算下，用于偏微分方程（PDE）模拟的神经网络的精细尺度保真度问题，提出了一个名为Derived-Field Optimization (DerivOpt)的状态设计框架。论文的核心是神经网络模拟器（Neural Simulators）在科学计算（具体是流体动力学模拟）中的应用和优化，属于深度学习在科学领域的应用。因此，它与关键词列表中的绝大多数项目（这些项目主要围绕大语言模型LLMs的技术原理、训练、对齐、推理、应用范式等）完全无关。唯一有微弱关联的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在科学计算（具体是计算流体力学）中的应用范畴，但这并非论文的核心创新点（其创新在于模拟器的状态设计方法），因此给予5分（有一定关联）。其他所有关键词均与论文内容无直接关系，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对固定存储预算下神经模拟器会丢失精细尺度细节的问题，提出了一个通过优化携带的物理场（Derived-Field Optimization）来设计模拟状态的新框架，在PDEBench数据集上显著提升了模拟的精细尺度保真度，并证明状态设计是此类预算约束下的关键瓶颈。

摘要翻译

在固定存储预算下实现精细尺度保真的神经模拟仍然具有挑战性。现有许多方法通过改进架构、训练目标或推演策略来降低高频误差。然而，在预算受限的粗化-量化-解码流程中，精细细节往往在构建承载状态时便已丢失。在经典的周期性不可压缩纳维-斯托克斯场景中，我们证明原始场与派生场在同一算子作用下会经历系统性不同的保留频带畸变。基于这一观察，我们提出了派生场优化（DerivOpt）——一个通用的状态设计框架，该框架在校准的信道模型下，选择承载哪些物理场以及如何在各场间分配存储预算。在PDEBench全时间相关前向子集上的实验表明，DerivOpt不仅提升了整体平均推演归一化均方根误差，更在精细尺度保真度上对一系列强基线模型取得了决定性优势。更重要的是，这些改进在输入时刻（即推演学习开始前）已清晰可见。这表明在严格存储预算下，承载状态往往是主导性瓶颈。这些结果揭示了一个更广泛的结论：在预算受限的神经模拟中，承载状态设计应与架构、损失函数和推演策略共同被视为首要设计维度。

摘要 (Abstract)

Fine-scale-faithful neural simulation under fixed storage budgets remains challenging. Many existing methods reduce high-frequency error by improving architectures, training objectives, or rollout strategies. However, under budgeted coarsen-quantize-decode pipelines, fine detail can already be lost when the carried state is constructed. In the canonical periodic incompressible Navier-Stokes setting, we show that primitive and derived fields undergo systematically different retained-band distortions under the same operator. Motivated by this observation, we formulate Derived-Field Optimization (DerivOpt), a general state-design framework that chooses which physical fields are carried and how storage budget is allocated across them under a calibrated channel model. Across the full time-dependent forward subset of PDEBench, DerivOpt not only improves pooled mean rollout nRMSE, but also delivers a decisive advantage in fine-scale fidelity over a broad set of strong baselines. More importantly, the gains are already visible at input time, before rollout learning begins. This indicates that the carried state is often the dominant bottleneck under tight storage budgets. These results suggest a broader conclusion: in budgeted neural simulation, carried-state design should be treated as a first-class design axis alongside architecture, loss, and rollout strategy.

关键词: neural simulation, budgeted simulation, fine-scale fidelity, derived-field optimization, PDE simulation, state design, storage budget, Navier-Stokes

277. ❌ Software Vulnerability Detection Using a Lightweight Graph Neural Network

作者: Miles Farmer, Ekincan Ufuktepe, Anne Watson, Hialo Muniz Carvalho, Vadim Okun, Zineb Maasaoui, Kannappan Palaniappan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29216v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究使用图神经网络（GNN）进行软件漏洞检测，与LLMs相关度较高（8分），因为摘要明确提到LLMs在漏洞检测中的流行性，并作为性能对比基准。与’Small Language Models OR SLMs OR On-device AI’有一定关联（5分），因为论文提出的VulGNN模型被描述为轻量级、可部署在边缘，符合小型/边缘AI的特点。与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为论文强调模型尺寸小（比LLMs小100倍）、高效，涉及模型轻量化思想。其他关键词与论文内容无关（0分），因为论文专注于GNN在特定应用（漏洞检测）中的设计、评估和效率，未涉及MoE、训练方法、推理优化、代理系统等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图神经网络的轻量级模型VulGNN用于软件漏洞检测，在性能接近大型语言模型的同时，模型尺寸缩小100倍且更易于快速重训练和定制化部署。

摘要翻译

鉴于其基础能力、开源可用性及模型多样性，大语言模型（LLMs）已成为漏洞检测研究中的热门选择，但其庞大的计算需求限制了可扩展性。利用代码天然的图关系结构，我们提出的基于图神经网络（GNN）的漏洞检测深度学习模型VulGNN，在性能上可达到与大语言模型几乎相当的水平，但模型尺寸缩小了100倍，且能够快速重新训练与定制。本文阐述了VulGNN的架构，并对各组件、学习率以及在不同代码数据集上的泛化能力进行了消融研究。作为一种轻量化的漏洞分析模型，VulGNN高效且可部署于边缘，能够集成到现实世界的软件开发流程中。

摘要 (Abstract)

Large Language Models (LLMs) have emerged as a popular choice in vulnerability detection studies given their foundational capabilities, open source availability, and variety of models, but have limited scalability due to extensive compute requirements. Using the natural graph relational structure of code, we show that our proposed graph neural network (GNN) based deep learning model VulGNN for vulnerability detection can achieve performance almost on par with LLMs, but is 100 times smaller in size and fast to retrain and customize. We describe the VulGNN architecture, ablation studies on components, learning rates, and generalizability to different code datasets. As a lightweight model for vulnerability analysis, VulGNN is efficient and deployable at the edge as part of real-world software development pipelines.

关键词: software vulnerability detection, graph neural network, lightweight model, large language models, edge deployment, deep learning, code analysis, efficiency

278. ❌ Improving Ensemble Forecasts of Abnormally Deflecting Tropical Cyclones with Fused Atmosphere-Ocean-Terrain Data

作者: Qixiang Li, Shuwei Huo, Chong Wang, Xiaofeng Li, Yuan Zhou 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29200v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于热带气旋预测，使用深度学习技术，属于AI for Science（科学AI）的应用范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（评分5分）。然而，论文并未涉及大语言模型（LLMs）、模型架构（如MoE）、训练方法（如预训练、微调、对齐）、推理技术（如CoT、RAG）、模型优化（如量化、加速）、代理系统或任何其他列出的具体大模型技术关键词。其核心是气象领域的专用深度学习模型，而非通用大模型研究。

!!! tip deepseek-chat TL;DR

该研究解决了现有深度学习方法无法准确预测异常偏转热带气旋的问题，通过构建首个融合大气-海洋-地形多源异构数据的AOT-TCs数据集，并提出了一个显式耦合架构的预测模型，在西北太平洋所有热带气旋案例上实现了最先进的预测性能，显著提高了正常和异常偏转热带气旋的预测精度。

摘要翻译

基于深度学习的台风预报方法展现出显著潜力与应用优势，其计算成本远低于数值天气预报模型且运行速度更快。然而现有深度学习方法仍存在关键局限：仅能处理单一类型的序列轨迹数据或同质气象变量，且无法实现对异常转向台风的精准预报。为应对这些挑战，我们提出两项突破性贡献。首先，我们构建了面向西北太平洋海域台风预报的多模态多源数据集AOT-TCs。作为该领域首个此类数据集，其创新性地融合了来自大气、海洋和陆地的异质变量，从而获得全面且信息丰富的气象数据集。其次，基于AOT-TCs数据集，我们提出了一种能同时处理正常与异常转向台风的预报模型。这是首个采用显式大气-海洋-地形耦合架构的台风预报模型，使其能有效捕捉跨物理域的复杂相互作用。对2017至2024年西北太平洋全部台风案例的广泛实验表明，我们的模型在台风预报中实现了最先进的性能：不仅显著提升了正常台风的预报精度，更突破了异常转向台风预报的技术瓶颈。

摘要 (Abstract)

Deep learning-based tropical cyclone (TC) forecasting methods have demonstrated significant potential and application advantages, as they feature much lower computational cost and faster operation speed than numerical weather prediction models. However, existing deep learning methods still have key limitations: they can only process a single type of sequential trajectory data or homogeneous meteorological variables, and fail to achieve accurate forecasting of abnormal deflected TCs. To address these challenges, we present two groundbreaking contributions. First, we have constructed a multimodal and multi-source dataset named AOT-TCs for TC forecasting in the Northwest Pacific basin. As the first dataset of its kind, it innovatively integrates heterogeneous variables from the atmosphere, ocean, and land, thus obtaining a comprehensive and information-rich meteorological dataset. Second, based on the AOT-TCs dataset, we propose a forecasting model that can handle both normal and abnormally deflected TCs. This is the first TC forecasting model to adopt an explicit atmosphere-ocean-terrain coupling architecture, enabling it to effectively capture complex interactions across physical domains. Extensive experiments on all TC cases in the Northwest Pacific from 2017 to 2024 show that our model achieves state-of-the-art performance in TC forecasting: it not only significantly improves the forecasting accuracy of normal TCs but also breaks through the technical bottleneck in forecasting abnormally deflected TCs.

关键词: tropical cyclone forecasting, deep learning, multimodal dataset, atmosphere-ocean-terrain coupling, abnormal deflection, AOT-TCs, Northwest Pacific, state-of-the-art performance

279. ❌ Biomimetic PINNs for Cell-Induced Phase Transitions: UQ-R3 Sampling with Causal Gating

作者: Anci Lin, Xiaohong Liu, Zhiwen Zhang, Weidong Zhao, Wenju Zhao 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于细胞诱导相变中的物理信息神经网络（PINNs）方法，属于AI for Science（科学AI）领域，特别是生物物理模拟。然而，论文内容与绝大多数关键词（如LLMs、MoE、SFT、RLHF、RAG、推理加速、幻觉缓解等）完全无关，这些关键词主要涉及大语言模型及其相关技术（训练、对齐、推理、应用等）。论文未提及任何语言模型、深度学习基础模型技术原理创新，或大模型在不同领域的应用。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文应用AI（具体是PINNs）解决生物物理问题（细胞相变），但并非核心生物信息学或化学信息学，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对细胞诱导相变中非凸多阱能量导致的尖锐界面和微结构等挑战，提出了生物拟态物理信息神经网络（Bio-PINNs），通过渐进距离门控和不确定性驱动的自适应采样策略，有效恢复了尖锐过渡层和形态，显著优于现有基线方法。

摘要翻译

细胞诱导相变中的非凸多势阱能量会导致尖锐界面、精细尺度微结构以及距离依赖的细胞间耦合，这些均为物理信息学习带来了显著挑战。现有方法常在近场模式中出现过度平滑现象。为此，我们提出仿生物理信息神经网络（Bio-PINNs），该变分框架通过渐进式距离门控将时间因果性编码为显式空间因果性。此外，Bio-PINNs利用界面长度尺度的形变-不确定性代理来定位易产生微结构的区域，为显式二阶导数正则化提供了计算高效的替代方案。我们为由此产生的不确定性驱动型“保留-重采样-释放”自适应配点策略提供了理论保证，该策略确保在门控机制下实现持续覆盖，并建立了定量的近-远场增长界限。在单细胞与多细胞基准测试、不同间距条件及多种正则化体系中，Bio-PINNs均能稳定恢复尖锐的过渡层和束缚形态，显著优于当前最先进的自适应及无门控基线方法。

摘要 (Abstract)

Nonconvex multi-well energies in cell-induced phase transitions give rise to sharp interfaces, fine-scale microstructures, and distance-dependent inter-cell coupling, all of which pose significant challenges for physics-informed learning. Existing methods often suffer from over-smoothing in near-field patterns. To address this, we propose biomimetic physics-informed neural networks (Bio-PINNs), a variational framework that encodes temporal causality into explicit spatial causality via a progressive distance gate. Furthermore, Bio-PINNs leverage a deformation-uncertainty proxy for the interfacial length scale to target microstructure-prone regions, providing a computationally efficient alternative to explicit second-derivative regularization. We provide theoretical guarantees for the resulting uncertainty-driven ``retain-resample-release" adaptive collocation strategy, which ensures persistent coverage under gating and establishing a quantitative near-to-far growth bound. Across single- and multi-cell benchmarks, diverse separations, and various regularization regimes, Bio-PINNs consistently recover sharp transition layers and tether morphologies, significantly outperforming state-of-the-art adaptive and ungated baselines.

关键词: Biomimetic PINNs, cell-induced phase transitions, physics-informed neural networks, uncertainty quantification, adaptive collocation, nonconvex multi-well energies, sharp interfaces, microstructure recovery

280. ❌ IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection

作者: Xiaohui Zhou, Yijie Wang, Hongzuo Xu, Weixuan Liang, Xiaoli Li, Guansong Pang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于时间序列异常检测，提出了一种名为IMPACT的新框架，利用影响建模来处理开放集异常检测和训练数据污染问题。论文内容涉及时间序列分析、异常检测、影响函数、数据生成和去污染，但未提及任何大模型、深度学习技术原理或AI for Science的具体应用。所有关键词均与大模型技术、深度学习原理或特定科学领域AI应用相关，而该论文的研究领域和方法论与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为IMPACT的新框架，利用影响建模来解决开放集时间序列异常检测中训练数据污染和生成真实未见异常模式的挑战，实验表明该方法显著优于现有最先进方法。

摘要翻译

开放集异常检测（Open-set anomaly detection, OSAD）是一种新兴范式，旨在利用训练中见过的异常类别的有限标记数据，以在测试阶段识别已见及未见异常。现有方法依赖简单的增强技术生成模拟未见异常的伪异常样本。尽管在图像数据中表现出潜力，但这些方法在时间序列数据中效果不佳，因其未能保持序列的时序特性，导致生成无意义或不现实的异常模式。当训练数据被未标记的异常污染时，这些问题进一步加剧。本研究提出一种新颖框架 $\textbf{IMPACT}$，其基于$\underline{\textbf{i}}$nfluence $\underline{\textbf{m}}$odeling（影响建模）实现开放集时间序列$\underline{\textbf{a}}$nomaly dete$\underline{\textbf{ct}}$ion（异常检测），以应对上述挑战。其核心思想在于：$\textbf{i)}$ 学习一个能够准确估计单个训练样本对建模过程影响的影响函数，随后$\textbf{ii)}$ 利用这些影响分数生成语义上具有差异性且符合现实的时间序列未见异常，同时将高影响样本重新用作监督异常数据以进行异常净化。大量实验表明，IMPACT 显著优于现有先进方法，在不同OSAD设置及污染率下均展现出更优的检测准确性。

摘要 (Abstract)

Open-set anomaly detection (OSAD) is an emerging paradigm designed to utilize limited labeled data from anomaly classes seen in training to identify both seen and unseen anomalies during testing. Current approaches rely on simple augmentation methods to generate pseudo anomalies that replicate unseen anomalies. Despite being promising in image data, these methods are found to be ineffective in time series data due to the failure to preserve its sequential nature, resulting in trivial or unrealistic anomaly patterns. They are further plagued when the training data is contaminated with unlabeled anomalies. This work introduces $\textbf{IMPACT}$, a novel framework that leverages $\underline{\textbf{i}}$nfluence $\underline{\textbf{m}}$odeling for o$\underline{\textbf{p}}$en-set time series $\underline{\textbf{a}}$nomaly dete$\underline{\textbf{ct}}$ion, to tackle these challenges. The key insight is to $\textbf{i)}$ learn an influence function that can accurately estimate the impact of individual training samples on the modeling, and then $\textbf{ii)}$ leverage these influence scores to generate semantically divergent yet realistic unseen anomalies for time series while repurposing high-influential samples as supervised anomalies for anomaly decontamination. Extensive experiments show that IMPACT significantly outperforms existing state-of-the-art methods, showing superior accuracy under varying OSAD settings and contamination rates.

关键词: open-set anomaly detection, time series, influence modeling, anomaly generation, data contamination, IMPACT, influence function, anomaly decontamination

281. ❌ Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

作者: Yunrui Yu, Xuxiang Feng, Pengda Qin, Pengyang Wang, Kafeng Wang, Cheng-zhong Xu, Hang Su, Jun Zhu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29182v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于对抗性机器学习和防御评估方法，研究Dummy Classes防御的漏洞并提出新的攻击方法DAWA。所有评分关键词均涉及大模型、深度学习技术原理或AI科学应用，而本文研究的是传统图像分类模型（CIFAR-10）的对抗性防御评估，与大模型技术、AI科学应用等关键词完全无关。论文未涉及任何大模型相关技术、训练方法、推理优化、AI代理或科学AI应用。

!!! tip deepseek-chat TL;DR

本文揭示了基于Dummy Classes的对抗防御方法在现有评估策略下存在鲁棒性高估问题，并提出DAWA攻击方法有效降低了该防御的测量鲁棒性。

摘要翻译

随着能够利用现有评估方法局限性的新型防御范式不断涌现，对抗鲁棒性评估面临严峻挑战。本文揭示，基于虚拟类（Dummy Classes）的防御方法——即引入一个额外的“虚拟”类别作为对抗样本的安全接收池——在AutoAttack等传统评估策略下会获得显著高估的鲁棒性。其根本局限源于这些攻击方法仅专注于误导真实类别标签，而这恰恰与防御机制相契合：成功的攻击仅仅被虚拟类所捕获。为弥补这一缺陷，我们提出了虚拟类感知加权攻击（Dummy-Aware Weighted Attack, DAWA），这是一种新颖的评估方法，在合成对抗样本时通过自适应加权同时针对真实标签和虚拟标签进行攻击。大量实验表明，DAWA能有效突破此类防御范式：在CIFAR-10数据集上，针对l_infty扰动（epsilon=8/255），DAWA将一种领先的基于虚拟类的防御方法所测得的鲁棒性从58.61%降至29.52%。本研究为评估此类新兴防御提供了更可靠的基准，并凸显了鲁棒性评估方法需要持续演进的重要性。

摘要 (Abstract)

Adversarial robustness evaluation faces a critical challenge as new defense paradigms emerge that can exploit limitations in existing assessment methods. This paper reveals that Dummy Classes-based defenses, which introduce an additional “dummy” class as a safety sink for adversarial examples, achieve significantly overestimated robustness under conventional evaluation strategies like AutoAttack. The fundamental limitation stems from these attacks’ singular focus on misleading the true class label, which aligns perfectly with the defense mechanism–successful attacks are simply captured by the dummy class. To address this gap, we propose Dummy-Aware Weighted Attack (DAWA), a novel evaluation method that simultaneously targets both the true label and dummy label with adaptive weighting during adversarial example synthesis. Extensive experiments demonstrate that DAWA effectively breaks this defense paradigm, reducing the measured robustness of a leading Dummy Classes-based defense from 58.61% to 29.52% on CIFAR-10 under l_infty perturbation (epsilon=8/255). Our work provides a more reliable benchmark for evaluating this emerging class of defenses and highlights the need for continuous evolution of robustness assessment methodologies.

关键词: Adversarial Robustness, Dummy Classes Defense, DAWA Attack, Evaluation Method, CIFAR-10, AutoAttack, Safety Sink, Adversarial Examples

282. ❌ Segmentation of Gray Matters and White Matters from Brain MRI data

作者: Chang Sun, Rui Shi, Tsukasa Koike, Tetsuro Sekine, Akio Morita, Tetsuya Sakai 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29171v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种改进的MedSAM模型用于脑组织多类分割，属于AI for Science（医学影像分析）领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文基于MedSAM（一种基础模型），涉及预训练和微调，与’Large Language Models OR LLMs OR Foundation Models’、‘Pre-training OR Continual Pre-training OR Domain Adaptation’、‘Post-training OR Supervised Fine-tuning OR SFT’相关（各8分）。论文提到冻结预训练图像编码器并微调提示编码器和解码器，这体现了参数高效微调的思想，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有一定关联（5分）。其他关键词主要涉及大模型技术原理（如MoE、推理加速、对齐等）或特定应用（如代理、工具使用），与本文的医学影像分割任务无关，均给0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种改进的MedSAM基础模型，通过扩展其掩码解码器为三类并微调，实现了脑MRI中灰质和白质的多类分割，在IXI数据集上取得了高达0.8751的Dice分数。

摘要翻译

从磁共振成像中精确分割灰质与白质等脑组织，对于研究大脑解剖结构、诊断神经系统疾病及监测病程进展至关重要。传统方法（如FSL FAST）能够生成组织概率图，但通常需要针对特定任务进行调整，且在多样化的成像条件下面临挑战。近期的基础模型（如MedSAM）提供了一种基于提示的方法，能够利用大规模预训练优势。本文提出一种改进的MedSAM模型，专用于多类别脑组织分割。我们的预处理流程包括：使用FSL BET进行颅骨剥离，采用FSL FAST生成组织概率图，并将其转换为带有多类别标签（背景、灰质和白质）的二维轴向、矢状和冠状切片。我们将MedSAM的掩码解码器扩展至三个类别，冻结预训练的图像编码器，并对提示编码器与解码器进行微调。在IXI数据集上的实验取得了最高0.8751的Dice分数。本研究表明，仅需最小限度的结构修改，MedSAM等基础模型即可适用于多类别医学图像分割任务。我们的研究结果提示，此类模型在未来工作中可进一步扩展至更广泛的医学影像应用场景。

摘要 (Abstract)

Accurate segmentation of brain tissues such as gray matter and white matter from magnetic resonance imaging is essential for studying brain anatomy, diagnosing neurological disorders, and monitoring disease progression. Traditional methods, such as FSL FAST, produce tissue probability maps but often require task-specific adjustments and face challenges with diverse imaging conditions. Recent foundation models, such as MedSAM, offer a prompt-based approach that leverages large-scale pretraining. In this paper, we propose a modified MedSAM model designed for multi-class brain tissue segmentation. Our preprocessing pipeline includes skull stripping with FSL BET, tissue probability mapping with FSL FAST, and converting these into 2D axial, sagittal, coronal slices with multi-class labels (background, gray matter, and white matter). We extend MedSAM’s mask decoder to three classes, freezing the pre-trained image encoder and fine-tuning the prompt encoder and decoder. Experiments on the IXI dataset achieve Dice scores up to 0.8751. This work demonstrates that foundation models like MedSAM can be adapted for multi-class medical image segmentation with minimal architectural modifications. Our findings suggest that such models can be extended to more diverse medical imaging scenarios in future work.

关键词: brain tissue segmentation, gray matter, white matter, MRI, MedSAM, foundation model, fine-tuning, multi-class segmentation

283. ❌ Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification

作者: Guan Wang, Shuyin Xia, Lei Qian, Guoyin Wang, Yi Liu, Yi Wang, Wei Wang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29148v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图卷积网络（GCN）的图粗化方法，以提高大规模图节点分类的训练效率和可扩展性。所有关键词均与大模型、深度学习技术原理或科学应用相关，而论文未涉及任何大模型（如LLM）、深度学习技术原理（如MoE、Scaling Laws、PEFT等）或科学应用（如AI for Science）。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种高效可扩展的粒球图粗化方法，通过线性时间复杂度的多粒度图粗化算法和随机采样子图来训练GCN，显著提高了大规模图节点分类的训练效率和性能。

摘要翻译

图卷积网络（Graph Convolutional Network, GCN）是一种能够有效处理图数据任务的模型，并已成功获得应用。然而，针对大规模图数据集，GCN仍面临计算开销高的挑战，尤其是在图的卷积层数较多时。目前已有多种先进方法采用各类采样技术或图粗化技术来缓解训练过程中的不便，但这些方法中，有的忽略了图结构中的多粒度信息，而部分粗化方法的时间复杂度仍然较高。针对这些问题，基于我们先前的工作，本文提出了一种名为“面向大规模图节点分类的高效可扩展粒度球图粗化方法”的新框架。具体而言，该方法首先利用一种多粒度粒度球图粗化算法对原始图进行粗化，得到多个子图。此阶段的时间复杂度为线性，远低于现有图粗化方法。随后，由这些粒度球构成的子图被随机采样以形成小批量数据，用于训练GCN。我们的算法能够自适应地显著缩小原始图的规模，从而提升GCN的训练效率与可扩展性。最终，在多个数据集上的节点分类实验结果表明，本文所提方法展现出优越的性能。代码可在 https://anonymous.4open.science/r/1-141D/ 获取。

摘要 (Abstract)

Graph Convolutional Network (GCN) is a model that can effectively handle graph data tasks and has been successfully applied. However, for large-scale graph datasets, GCN still faces the challenge of high computational overhead, especially when the number of convolutional layers in the graph is large. Currently, there are many advanced methods that use various sampling techniques or graph coarsening techniques to alleviate the inconvenience caused during training. However, among these methods, some ignore the multi-granularity information in the graph structure, and the time complexity of some coarsening methods is still relatively high. In response to these issues, based on our previous work, in this paper, we propose a new framework called Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification. Specifically, this method first uses a multi-granularity granular-ball graph coarsening algorithm to coarsen the original graph to obtain many subgraphs. The time complexity of this stage is linear and much lower than that of the exiting graph coarsening methods. Then, subgraphs composed of these granular-balls are randomly sampled to form minibatches for training GCN. Our algorithm can adaptively and significantly reduce the scale of the original graph, thereby enhancing the training efficiency and scalability of GCN. Ultimately, the experimental results of node classification on multiple datasets demonstrate that the method proposed in this paper exhibits superior performance. The code is available at https://anonymous.4open.science/r/1-141D/.

关键词: Graph Convolutional Network, GCN, graph coarsening, granular-ball, large-scale graph, node classification, training efficiency, scalability

284. ❌ Quality-Controlled Active Learning via Gaussian Processes for Robust Structure-Property Learning in Autonomous Microscopy

作者: Jawad Chowdhury, Ganesh Narasimha, Jan-Chi Yang, Yongtao Liu, Rama Vasudevan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29135v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于自主显微镜中的材料科学研究，提出了一种结合好奇心驱动采样和基于物理模型的质量控制过滤器的门控主动学习框架，用于处理噪声数据并提高结构-属性学习任务的可靠性。所有关键词均与大模型和深度学习技术原理直接相关，但论文未涉及任何大模型、深度学习或相关技术（如LLMs、MoE、训练方法、推理优化、代理系统等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学领域的应用（材料科学），但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对自主显微镜中因噪声数据导致的结构-属性学习性能下降问题，提出了一种结合好奇心驱动采样和物理模型质量控制的门控主动学习框架，在实验数据上验证了其优于随机采样和标准主动学习，并成功部署于实时实验，提高了预测可靠性。

摘要翻译

自主实验系统在材料研究中日益普及以加速科学发现，但其性能常受限于低质量、高噪声数据。这一问题在数据密集型结构-性能学习任务中尤为突出，例如图像到谱（Image-to-Spectrum, Im2Spec）与谱到图像（Spectrum-to-Image, Spec2Im）转换任务，传统的主动学习策略可能错误地优先选择低质量测量数据。本文提出一种门控主动学习框架，将好奇心驱动采样与基于简谐振荡模型拟合的物理信息质量控制过滤器相结合，使系统能在数据采集过程中自动排除低保真度数据。通过对预先获取的、含空间局部噪声的钛酸铅薄膜带激励压电响应谱（band-excitation piezoresponse spectroscopy, BEPS）数据集进行评估，结果表明该方法优于随机采样、标准主动学习及多任务学习策略。门控方法通过在训练和采集阶段处理噪声，提升了Im2Spec和Spec2Im任务的性能，实现了更可靠的正向与逆向预测。相比之下，标准主动学习器常将噪声误判为不确定性，最终采集损害模型性能的劣质样本。鉴于其良好的适用性，我们进一步将该框架部署于铋铁氧体薄膜的实时实验中，证明了其在真实自主显微实验中的有效性。总体而言，本研究推动了自驱动实验室向混合自主模式的转变，其中物理信息质量评估与主动决策协同工作，以实现更可靠的科学发现。

摘要 (Abstract)

Autonomous experimental systems are increasingly used in materials research to accelerate scientific discovery, but their performance is often limited by low-quality, noisy data. This issue is especially problematic in data-intensive structure-property learning tasks such as Image-to-Spectrum (Im2Spec) and Spectrum-to-Image (Spec2Im) translations, where standard active learning strategies can mistakenly prioritize poor-quality measurements. We introduce a gated active learning framework that combines curiosity-driven sampling with a physics-informed quality control filter based on the Simple Harmonic Oscillator model fits, allowing the system to automatically exclude low-fidelity data during acquisition. Evaluations on a pre-acquired dataset of band-excitation piezoresponse spectroscopy (BEPS) data from PbTiO3 thin films with spatially localized noise show that the proposed method outperforms random sampling, standard active learning, and multitask learning strategies. The gated approach enhances both Im2Spec and Spec2Im by handling noise during training and acquisition, leading to more reliable forward and inverse predictions. In contrast, standard active learners often misinterpret noise as uncertainty and end up acquiring bad samples that hurt performance. Given its promising applicability, we further deployed the framework in real-time experiments on BiFeO3 thin films, demonstrating its effectiveness in real autonomous microscopy experiments. Overall, this work supports a shift toward hybrid autonomy in self-driving labs, where physics-informed quality assessment and active decision-making work hand-in-hand for more reliable discovery.

关键词: autonomous microscopy, active learning, quality control, structure-property learning, physics-informed model, noise handling, real-time experiments, self-driving labs

285. ❌ Adaptive Delayed-Update Cyclic Algorithm for Variational Inequalities

作者: Yi Wei, Xufeng Cai, Jelena Diakonikolas 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是变分不等式问题的循环块坐标算法（ADUCA），属于优化算法领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐技术、AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于单调Lipschitz算子的变分不等式问题的自适应延迟更新循环算法（ADUCA），该算法无需参数调优，并证明了其达到（接近）最优的全局oracle复杂度。

摘要翻译

循环块坐标法是一类基础的一阶算法，因其简洁性和强大的实证性能而在实践中被广泛使用。然而，其理论行为仍难以解释，并且步长的设置——除了经典的最小化坐标下降法之外——通常需要仔细调整或线搜索机制。在本研究中，我们开发了 $\texttt{ADUCA}$（自适应延迟更新循环算法），这是一种用于解决具有单调Lipschitz算子的广泛Minty变分不等式类的循环算法。$\texttt{ADUCA}$ 是无参数的：它不需要全局或块级Lipschitz常数，并且除了初始化阶段外，不使用每轮线搜索。该算法的一个关键特征是使用延迟一个完整周期的算子信息，这使得算法与并行和分布式实现兼容，并因块间同步要求减弱而具有吸引力。我们证明，$\texttt{ADUCA}$ 实现了（近似）最优的全局预言机复杂度，作为目标误差 $ε>0$ 的函数，对于单调算子其复杂度按 $1/ε$ 缩放，而对于强单调算子则按 $\log^2(1/ε)$ 缩放。

摘要 (Abstract)

Cyclic block coordinate methods are a fundamental class of first-order algorithms, widely used in practice for their simplicity and strong empirical performance. Yet, their theoretical behavior remains challenging to explain, and setting their step sizes – beyond classical coordinate descent for minimization – typically requires careful tuning or line-search machinery. In this work, we develop $\texttt{ADUCA}$ (Adaptive Delayed-Update Cyclic Algorithm), a cyclic algorithm addressing a broad class of Minty variational inequalities with monotone Lipschitz operators. $\texttt{ADUCA}$ is parameter-free: it requires no global or block-wise Lipschitz constants and uses no per-epoch line search, except at initialization. A key feature of the algorithm is using operator information delayed by a full cycle, which makes the algorithm compatible with parallel and distributed implementations, and attractive due to weakened synchronization requirements across blocks. We prove that $\texttt{ADUCA}$ attains (near) optimal global oracle complexity as a function of target error $ε>0,$ scaling with $1/ε$ for monotone operators, or with $\log^2(1/ε)$ for operators that are strongly monotone.

关键词: cyclic block coordinate methods, variational inequalities, monotone Lipschitz operators, parameter-free algorithm, adaptive delayed-update, oracle complexity, distributed implementations

286. ❌ Sampling-Horizon Neural Operator Predictors for Nonlinear Control under Delayed Inputs

作者: Luke Bhan, Peter Quawas, Miroslav Krstic, Yuanyuan Shi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29119v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究非线性控制系统中的延迟输入和采样测量问题，提出两种基于神经算子的预测器-反馈设计，属于控制理论和神经算子应用领域。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对具有延迟输入和采样测量的非线性控制系统，提出了两种神经算子预测器-反馈设计，实现了半全局实际稳定性，并在6连杆非线性机器人操纵器上展示了25倍的计算加速和精确跟踪性能。

摘要翻译

现代控制系统常在输入延迟与采样状态测量下运行。一种常见的延迟补偿策略是预测器反馈；然而，其实用实现需要在线求解隐式常微分方程，导致难以承受的计算成本。此外，预测器公式通常假设状态测量可连续获取，而实际中测量值可能因硬件故障而呈现采样化、非均匀或暂时缺失的特点。在本研究中，我们针对具有延迟输入和采样测量的非线性系统，提出了两种神经算子预测器反馈设计方案。在第一种设计中，我们引入了一种采样区间预测算子，该算子将当前测量值与输入历史映射至下一采样区间内的预测状态轨迹。在第二种设计中，神经算子仅近似延迟补偿预测器，再将其与测量间隔内的闭环流相结合。第一种方法要求均匀采样，但其残差边界可直接随算子近似误差成比例缩放。相比之下，第二种方法能够适应非均匀但有界的采样时序，其代价是放大近似误差，这揭示了控制工程师在采样灵活性与近似敏感性之间面临的实际权衡。针对两种方案，我们建立了半全局实际稳定性，并给出了显式的神经算子误差相关边界。在六连杆非线性机器人操纵器上的数值实验表明，相较于基准方法，所提方案实现了精确跟踪，并获得了高达25倍的计算加速。

摘要 (Abstract)

Modern control systems frequently operate under input delays and sampled state measurements. A common delay-compensation strategy is predictor feedback; however, practical implementations require solving an implicit ODE online, resulting in intractable computational cost. Moreover, predictor formulations typically assume continuously available state measurements, whereas in practice measurements may be sampled, irregular, or temporarily missing due to hardware faults. In this work, we develop two neural-operator predictor-feedback designs for nonlinear systems with delayed inputs and sampled measurements. In the first design, we introduce a sampling-horizon prediction operator that maps the current measurement and input history to the predicted state trajectory over the next sampling interval. In the second design, the neural operator approximates only the delay-compensating predictor, which is then composed with the closed-loop flow between measurements. The first approach requires uniform sampling but yields residual bounds that scale directly with the operator approximation error. In contrast, the second accommodates non-uniform, but bounded sampling schedules at the cost of amplified approximation error, revealing a practical tradeoff between sampling flexibility and approximation sensitivity for the control engineer. For both schemes, we establish semi-global practical stability with explicit neural operator error-dependent bounds. Numerical experiments on a 6-link nonlinear robotic manipulator demonstrate accurate tracking and substantial computational speedup of 25$\times$ over a baseline approach.

关键词: nonlinear control, delayed inputs, sampled measurements, neural operator, predictor feedback, computational speedup, robotic manipulator, stability analysis

287. ❌ Predictor-Based Output-Feedback Control of Linear Systems with Time-Varying Input and Measurement Delays via Neural-Approximated Prediction Horizons

作者: Luke Bhan, Miroslav Krstic, Yuanyuan Shi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29117v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	2.0/10	0.0

评分理由: 该论文研究的是控制理论中具有时变输入和测量延迟的线性系统的预测器反馈控制，主要贡献是提出了两种近似预测范围的方法（数值积分和基于神经算子的数据驱动方法），并证明了闭环系统的全局指数稳定性。论文的核心是控制理论、时滞系统和神经网络近似，与评分关键词列表中的大模型、深度学习技术原理、AI应用等主题基本无关。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文使用了神经网络（一种AI方法）解决科学（控制理论）问题，但相关性很弱，不是核心内容，因此给2分。其他所有关键词均与论文内容完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文针对具有时变输入和测量延迟的线性系统，提出了基于数值积分和神经算子的两种预测范围近似方法，用于设计输出反馈预测器，并证明了近似误差足够小时闭环系统的全局指数稳定性。

摘要翻译

由于简洁性和强稳定性保证，自20世纪50年代以来，预测器反馈方法一直是时滞系统中广受欢迎的研究手段。然而，对于时变时滞系统，其实现需要计算由时滞函数反函数定义的预测时域，而该反函数通常难以以闭合形式获得，必须进行近似处理。在本研究中，我们将时滞反函数映射构建为一个算子学习问题，并研究预测时域近似下的预测器反馈。我们提出了两种方法：（i）基于等效常微分方程时间积分的数值方法，以及（ii）利用神经算子学习反函数映射的数据驱动方法。我们证明，这两种方法都能在紧集上达到任意逼近精度，并在计算成本与可扩展性方面具有互补性的权衡。基于这些近似方法，我们进一步为输入和测量均存在时滞的系统设计了一种输出反馈预测器。我们证明，当预测时域的近似误差足够小时，所得到的闭环系统具有全局指数稳定性。最后，数值实验验证了所提方法，并阐明了其在精度与计算效率之间的权衡关系。

摘要 (Abstract)

Due to simplicity and strong stability guarantees, predictor feedback methods have stood as a popular approach for time delay systems since the 1950s. For time-varying delays, however, implementation requires computing a prediction horizon defined by the inverse of the delay function, which is rarely available in closed form and must be approximated. In this work, we formulate the inverse delay mapping as an operator learning problem and study predictor feedback under approximation of the prediction horizon. We propose two approaches: (i) a numerical method based on time integration of an equivalent ODE, and (ii) a data-driven method using neural operators to learn the inverse mapping. We show that both approaches achieve arbitrary approximation accuracy over compact sets, with complementary trade-offs in computational cost and scalability. Building on these approximations, we then develop an output-feedback predictor design for systems with delays in both the input and the measurement. We prove that the resulting closed-loop system is globally exponentially stable when the prediction horizon is approximated with sufficiently small error. Lastly, numerical experiments validate the proposed methods and illustrate their trade-offs between accuracy and computational efficiency.

关键词: predictor feedback, time-varying delays, linear systems, neural operators, output-feedback control, global exponential stability, inverse delay mapping, approximation accuracy

288. ❌ Efficient Bilevel Optimization with KFAC-Based Hypergradients

作者: Disen Liao, Felix Dangel, Yaoliang Yu 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于双层优化（Bilevel Optimization）的算法改进，提出了一种基于KFAC（Kronecker-factored approximate curvature）的超梯度计算方法，以提高计算效率和性能。论文内容涉及优化算法、二阶导数近似、元学习和AI安全等通用机器学习问题，但未涉及任何大模型（LLM）相关技术、训练方法、推理优化、对齐、代理系统或科学AI应用等关键词。所有关键词均与大模型或深度学习在特定领域的应用直接相关，而本文是通用的优化方法研究，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于KFAC的二阶近似方法，用于高效计算双层优化中的超梯度，在保持性能的同时显著提升了计算效率，并在包括元学习和AI安全在内的多个任务上验证了其有效性。

摘要翻译

双层优化（BO）广泛应用于许多机器学习问题。然而，扩展BO需要重复计算超梯度，这涉及求解逆海森-向量积（IHVPs）。在实践中，这些操作通常使用粗略的近似方法进行估算，例如单步梯度展开或恒等/短诺依曼展开，这些方法忽略了曲率信息。我们基于隐函数定理算法，提出引入克罗内克分解近似曲率（KFAC），从而获得具有曲率感知的超梯度，其在性能与效率的权衡上优于共轭梯度（CG）或诺依曼方法，并持续超越梯度展开法。我们在多种任务中评估了该方法，包括元学习和人工智能安全问题。在包括BERT在内的模型上，我们证明了曲率信息在大规模场景中具有重要价值，而KFAC能够以适度的内存和运行时开销提供该信息。我们的实现可在https://github.com/liaodisen/NeuralBo获取。

摘要 (Abstract)

Bilevel optimization (BO) is widely applicable to many machine learning problems. Scaling BO, however, requires repeatedly computing hypergradients, which involves solving inverse Hessian-vector products (IHVPs). In practice, these operations are often approximated using crude surrogates such as one-step gradient unrolling or identity/short Neumann expansions, which discard curvature information. We build on implicit function theorem-based algorithms and propose to incorporate Kronecker-factored approximate curvature (KFAC), yielding curvature-aware hypergradients with a better performance efficiency trade-off than Conjugate Gradient (CG) or Neumann methods and consistently outperforming unrolling. We evaluate this approach across diverse tasks, including meta-learning and AI safety problems. On models up to BERT, we show that curvature information is valuable at scale, and KFAC can provide it with only modest memory and runtime overhead. Our implementation is available at https://github.com/liaodisen/NeuralBo.

关键词: Bilevel Optimization, Hypergradients, KFAC, Inverse Hessian-vector Products, Meta-learning, AI Safety, Curvature-aware, Optimization Algorithms

289. ❌ HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

作者: Jaber Jaber, Osama Jaber 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29090v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于世界模型架构（HCLSM），与关键词’World Models AND General World Models’高度相关（10分），因为这是论文的核心主题。其他关键词主要涉及大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等），而本文研究的是用于视频预测和机器人操作的世界模型，不涉及语言模型或文本处理，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了HCLSM（分层因果潜在状态机）世界模型架构，通过对象中心分解、分层时间动态和因果结构学习来解决现有世界模型中对象纠缠、因果结构缺失和时间动态扁平化的问题，在机器人操作基准测试中实现了低预测损失和高效推理。

摘要翻译

能够从视频中预测未来状态的世界模型，目前仍受限于扁平的潜在表征：这些表征往往将物体纠缠在一起、忽略因果结构，并将时间动态压缩至单一尺度。我们提出HCLSM，一种基于三个相互关联原则构建的世界模型架构：通过带有空间广播解码的槽注意力实现以物体为中心的分解；通过结合选择性状态空间模型（用于连续物理过程）、稀疏变换器（用于离散事件）和压缩变换器（用于抽象目标）的三级引擎实现分层时间动态；以及通过图神经网络交互模式进行因果结构学习。HCLSM引入了一种两阶段训练协议：在动态预测开始之前，通过空间重建迫使槽实现功能专门化。我们在Open X-Embodiment数据集中的PushT机器人操作基准上训练了一个6800万参数的模型，实现了0.008的下一状态预测均方误差损失，同时展现出涌现的空间分解能力（SBD损失：0.0075）和习得的事件边界。为SSM扫描定制的Triton内核相比顺序PyTorch实现了38倍的加速。整个系统包含51个模块，共计8478行Python代码，并配有171个单元测试。代码：https://github.com/rightnow-ai/hclsm

摘要 (Abstract)

World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M-parameter model on the PushT robotic manipulation benchmark from the Open X-Embodiment dataset, achieving 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: https://github.com/rightnow-ai/hclsm

关键词: world models, object-centric decomposition, hierarchical temporal dynamics, causal structure learning, slot attention, state space models, robotic manipulation, next-state prediction

290. ❌ Realistic Market Impact Modeling for Reinforcement Learning Trading Environments

作者: Lucas Riera Abbade, Anna Helena Reali Costa 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29086v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习在金融交易环境中的应用，特别是整合非线性市场影响模型来改进交易成本建模。论文内容完全围绕强化学习算法（如A2C、PPO、DDPG、SAC、TD3）、交易环境（如MACE、margin trading、portfolio optimization）和金融建模（如Almgren-Chriss框架、square-root impact law）展开。所有给定的关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文未涉及任何大模型、语言模型、模型训练/微调技术、推理优化、AI代理或科学AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过整合基于Almgren-Chriss框架的非线性市场影响模型，开发了三个Gymnasium兼容的交易环境，以解决强化学习交易代理在现实执行中因忽略交易成本而失败的问题，结果显示成本模型显著改变了算法性能排名和交易行为，并强调了超参数优化的重要性。

摘要翻译

强化学习（RL）在交易领域展现出潜力，然而大多数开源回测环境假设交易成本可忽略或固定，导致智能体习得的交易行为在真实执行中失效。本文介绍了三个与Gymnasium兼容的交易环境——MACE（市场调整成本执行）股票交易、保证金交易和投资组合优化——这些环境整合了基于Almgren-Chriss框架及经实证验证的平方根冲击定律的非线性市场冲击模型。每个环境均提供可插拔的成本模型、具有指数衰减的永久冲击追踪功能，以及全面的交易级日志记录。我们在纳斯达克100指数上评估了五种深度强化学习算法（A2C、PPO、DDPG、SAC、TD3），将固定的10个基点基准与采用Optuna调参的AC模型进行对比。研究结果表明：（i）成本模型在所有三种环境中均显著改变了绝对性能及算法的相对排名；（ii）AC模型产生了截然不同的交易行为，例如日交易成本从20万美元降至8千美元，换手率从19%下降至1%；（iii）超参数优化对于约束异常交易行为至关重要，可使成本降低高达82%；（iv）算法与成本模型的交互作用具有强烈的环境特异性，例如在保证金交易中，DDPG的样本外夏普比率在AC模型下从-2.1跃升至0.3，而SAC的夏普比率则从-0.5下降至-1.2。我们将整套环境作为FinRL-Meta的开源扩展公开发布。

摘要 (Abstract)

Reinforcement learning (RL) has shown promise for trading, yet most open-source backtesting environments assume negligible or fixed transaction costs, causing agents to learn trading behaviors that fail under realistic execution. We introduce three Gymnasium-compatible trading environments – MACE (Market-Adjusted Cost Execution) stock trading, margin trading, and portfolio optimization – that integrate nonlinear market impact models grounded in the Almgren-Chriss framework and the empirically validated square-root impact law. Each environment provides pluggable cost models, permanent impact tracking with exponential decay, and comprehensive trade-level logging. We evaluate five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on the NASDAQ-100, comparing a fixed 10 bps baseline against the AC model with Optuna-tuned hyperparameters. Our results show that (i) the cost model materially changes both absolute performance and the relative ranking of algorithms across all three environments; (ii) the AC model produces dramatically different trading behavior, e.g., daily costs dropping from $200k to $8k with turnover falling from 19% to 1%; (iii) hyperparameter optimization is essential for constraining pathological trading, with costs dropping up to 82%; and (iv) algorithm-cost model interactions are strongly environment-specific, e.g., DDPG’s OOS Sharpe jumps from -2.1 to 0.3 under AC in margin trading while SAC’s drops from -0.5 to -1.2. We release the full suite as an open-source extension to FinRL-Meta.

关键词: Reinforcement Learning, Trading Environments, Market Impact Models, Almgren-Chriss Framework, Transaction Costs, DRL Algorithms, Hyperparameter Optimization, FinRL-Meta

291. ❌ Is the Modality Gap a Bug or a Feature? A Robustness Perspective

作者: Rhea Chowers, Oshri Naparstek, Udi Barzelay, Yair Weiss 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29080v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多模态模型（如CLIP）中的模态间隙现象及其与鲁棒性的关系，属于多模态表示学习领域。虽然涉及深度学习技术，但论文内容与所有评分关键词（主要针对大语言模型LLMs的技术原理、训练方法、推理优化、应用等）均无直接关联。论文未提及LLMs、MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理技术、代理系统、模型压缩、幻觉缓解、可解释性等任何关键词相关概念。论文关注的是视觉-语言多模态模型的表示空间特性，而非大语言模型技术。

!!! tip deepseek-chat TL;DR

该论文研究了多模态模型中图像和文本嵌入分布分离的模态间隙现象，发现该间隙与模型鲁棒性正相关，并提出通过简单的后处理调整模态间隙可以显著提高模型鲁棒性而不损失准确率。

摘要翻译

许多现代多模态模型（如CLIP）致力于寻求一个使两种模态对齐的嵌入空间。令人略感意外的是，几乎所有现有模型都表现出显著的模态鸿沟：在共享嵌入空间中，图像的分布与文本的分布明显分离。尽管近期有一系列论文探讨此现象，但鸿沟存在的原因以及在后处理中消除鸿沟是否会提升下游任务性能，仍不明确。本文证明，在一定条件下，最小化对比损失会生成一种表征，其中两种模态被一个全局鸿沟向量分隔，该向量与它们的嵌入表示正交。我们还表明，在此条件下，模态鸿沟与模型鲁棒性呈单调相关：缩小鸿沟不会改变模型在原始数据上的准确率，但会降低嵌入表示受扰动时模型输出发生改变的可能性。实验表明，对于许多实际视觉语言模型（VLMs），通过简单的后处理步骤——将一种模态的表示向另一模态的均值方向移动——即可显著提升模型鲁棒性，且不损失原始准确率。

摘要 (Abstract)

Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.

关键词: modality gap, multi-modal models, CLIP, embedding space, robustness, contrastive loss, post-processing, vision-language models

292. ❌ How much of persistent homology is topology? A quantitative decomposition for spin model phase transitions

作者: Matthew Loftus 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29072v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是拓扑数据分析（TDA）在经典自旋模型相变检测中的应用，具体分析了持久同调（PH）信号中拓扑成分的比例。论文内容属于计算物理和拓扑数据分析领域，与深度学习、大模型技术无关。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文涉及AI/计算方法在科学问题（相变检测）中的应用，但并非核心的大模型或深度学习技术，因此给予5分（有一定关联）。其他所有关键词均完全无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文研究了在经典自旋模型相变检测中，持久同调信号有多少是真正的拓扑成分，通过引入密度匹配的洗牌零模型进行定量分解，发现H_0统计量主要由密度驱动，而H_1统计量具有部分拓扑性，且拓扑分数随系统尺寸增长。

摘要翻译

点云持续同调（PH）——通过计算自旋位置点云上的阿尔法复形或里普斯复形——自Donato等人（2016）以来已被广泛应用于检测经典自旋模型中的相变，后续研究将这种检测能力归因于持续图的拓扑内容。我们提出了一个尚未被提及的简单问题：PH信号中究竟有多大比例是真正拓扑的？我们引入了f_topo，这是一种定量分解方法，通过将真实自旋构型与密度匹配的随机重排零模型进行比较，分离出任何PH统计量中密度驱动与拓扑贡献的部分。通过对二维伊辛模型（系统尺寸L = 16-128，十个温度）和波茨模型（q = 3, 5）的研究，我们发现H_0统计量——包括总持续性、持续熵、特征数量——有94-100%是密度驱动的（f_topo < 0.07）。密度匹配的随机重排零模型在相同位置并以与真实构型相当的峰值高度检测到临界温度T_c，这表明仅凭密度信息就足以实现相变检测。然而，H_1统计量则部分具有拓扑性：其拓扑分数随系统尺寸增长，满足delta(TP_{H_1}) ~ L^{0.53}，并遵循有限尺寸标度坍塌规律delta(T, L) = L^{0.53} g(tL^{1/nu})，坍塌质量CV = 0.27。最长的持续条具有强烈的拓扑性（f_topo > 1），且与关联长度标度相关。尺度分辨分析表明，随着L增大，拓扑超额会从大尺度特征向小尺度特征转移。我们建议相变拓扑数据分析（TDA）领域将随机重排零模型采纳为标准实践方法，并在寻求真实拓扑信息时使用H_1而非H_0统计量。

摘要 (Abstract)

Point-cloud persistent homology (PH) – computing alpha or Rips complexes on spin-position point clouds – has been widely applied to detect phase transitions in classical spin models since Donato et al. (2016), with subsequent studies attributing the detection to the topological content of the persistence diagram. We ask a simple question that has not been posed: what fraction of the PH signal is genuinely topological? We introduce f_topo, a quantitative decomposition that separates the density-driven and topological contributions to any PH statistic by comparing real spin configurations against density-matched shuffled null models. Across the 2D Ising model (system sizes L = 16-128, ten temperatures) and Potts models (q = 3, 5), we find that H_0 statistics – total persistence, persistence entropy, feature count – are 94-100% density-driven (f_topo < 0.07). The density-matched shuffled null detects T_c at the identical location and with comparable peak height as real configurations, showing that density alone is sufficient for phase transition detection. However, H_1 statistics are partially topological: the topological fraction grows with system size as delta(TP_{H_1}) ~ L^{0.53} and follows a finite-size scaling collapse delta(T, L) = L^{0.53} g(tL^{1/nu}) with collapse quality CV = 0.27. The longest persistence bar is strongly topological (f_topo > 1) and scales with the correlation length. A scale-resolved analysis reveals that the topological excess shifts from large-scale to small-scale features as L increases. We propose that the TDA-for-phase-transitions community adopt shuffled null models as standard practice, and that H_1 rather than H_0 statistics be used when genuine topological information is sought.

关键词: persistent homology, topological data analysis, phase transitions, spin models, Ising model, Potts models, density-driven signal, topological fraction

293. ❌ On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

作者: Zichao Wei 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29069v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究整数乘法在神经网络中的表示问题，挑战了长程依赖是乘法固有难度的观点，并提出通过改变计算时空表示（2D外积网格）可将操作局部化。论文内容聚焦于神经网络架构（如Transformer、Mamba）在特定任务上的表现分析，以及计算表示的理论探讨。所有评分关键词均与大模型技术原理、训练方法、应用领域或特定技术（如MoE、RLHF、RAG等）直接相关，而本论文未涉及任何大模型、深度学习技术原理创新或科学领域应用，也未提及任何评分关键词中的具体技术。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文挑战了整数乘法因长程依赖而难以学习的观点，证明通过2D外积网格表示可将乘法操作局部化，使小型神经细胞自动机实现完美长度泛化，而Transformer等架构在该表示下失败。

摘要翻译

整数乘法长期以来被视为神经网络的难题，其困难性被广泛归因于进位链引发的O(n)长程依赖。我们认为这一诊断是错误的：长程依赖并非乘法的固有属性，而是由计算时空（computational spacetime）选择所产生的幻象。我们形式化了幻象的概念并给出构造性证明：当两个n位二进制整数以二维外积网格形式排布时，长乘法的每一步运算均可坍缩为$3 \times 3$局部邻域操作。在此表示下，仅含321个可学习参数的神经细胞自动机（neural cellular automaton）实现了高达训练范围$683\times$的完美长度泛化。五种替代架构——包括Transformer（6,625参数）、Transformer+RoPE以及Mamba——在相同表示下均告失败。我们进一步分析了局部成功案例如何将研究社区锁定在错误诊断中，并主张任何被诊断为需要长程依赖的任务，都应首先检验该依赖是任务固有的，还是由计算时空所诱发的。

摘要 (Abstract)

Integer multiplication has long been considered a hard problem for neural networks, with the difficulty widely attributed to the O(n) long-range dependency induced by carry chains. We argue that this diagnosis is wrong: long-range dependency is not an intrinsic property of multiplication, but a mirage produced by the choice of computational spacetime. We formalize the notion of mirage and provide a constructive proof: when two n-bit binary integers are laid out as a 2D outer-product grid, every step of long multiplication collapses into a $3 \times 3$ local neighborhood operation. Under this representation, a neural cellular automaton with only 321 learnable parameters achieves perfect length generalization up to $683\times$ the training range. Five alternative architectures – including Transformer (6,625 params), Transformer+RoPE, and Mamba – all fail under the same representation. We further analyze how partial successes locked the community into an incorrect diagnosis, and argue that any task diagnosed as requiring long-range dependency should first be examined for whether the dependency is intrinsic to the task or induced by the computational spacetime.

关键词: integer multiplication, long-range dependency, neural networks, computational spacetime, neural cellular automaton, length generalization, Transformer, Mamba

294. ❌ ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning

作者: Tushar Dhananjay Pathak 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29068v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文ARCS专注于使用深度学习（图VAE、流匹配模型、图Transformer）和强化学习（GRPO）进行模拟电路设计的自动化生成，属于AI在科学计算/工程领域的应用。它不涉及任何大语言模型（LLM）技术、训练方法、推理优化、对齐、代理系统等关键词。唯一的相关性是“AI for Science”，因为电路设计属于科学/工程应用，但论文未明确提及生物信息学或化学信息学，因此给5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

ARCS提出了一种基于深度学习和强化学习的快速模拟电路自动生成系统，通过拓扑感知图注意力和GRPO优化方法，在保持高仿真有效性的同时实现了比传统搜索方法快600-1000倍的设计速度。

摘要翻译

本文提出ARCS系统，一种用于摊销式模拟电路生成的方法，能够在毫秒级时间内生成完整且可进行SPICE仿真的设计（包括拓扑结构与元件参数），而基于搜索的方法通常需要数分钟。该系统采用混合流程，结合了两个学习生成器（图变分自编码器与流匹配模型）和基于SPICE的排序机制，仅需8次SPICE评估即可在32种拓扑结构上实现99.9%的仿真有效性（奖励分数6.43/8.0），其评估次数比遗传算法减少40倍。在单模型推理场景中，采用拓扑感知图变换器（Graph Transformer）结合“三选一”候选方案选择策略，可在97毫秒内达到85%的仿真有效性，速度比随机搜索提升600倍以上。核心技术创新是组相对策略优化（Group Relative Policy Optimization, GRPO）：本文揭示了REINFORCE算法的一个关键失效模式（跨拓扑奖励分布失配），并通过拓扑分组优势归一化方法解决了该问题，仅用500步强化学习（比常规少10倍）就将仿真有效性较REINFORCE提升了9.6个百分点。此外，通过语法约束解码结合拓扑感知的令牌掩码机制，从结构上保证了100%的设计结构有效性。虽然ARCS在单设计质量上尚未完全达到基于搜索的优化水平（奖励分数5.48对比7.48），但其超过1000倍的速度优势为快速原型设计、设计空间探索以及为搜索方法提供热启动创造了条件（仅用49%的仿真次数即可恢复遗传算法96.6%的性能质量）。

摘要 (Abstract)

I present ARCS, a system for amortized analog circuit generation that produces complete, SPICE-simulatable designs (topology and component values) in milliseconds rather than the minutes required by search-based methods. A hybrid pipeline combining two learned generators (a graph VAE and a flow-matching model) with SPICE-based ranking achieves 99.9% simulation validity (reward 6.43/8.0) across 32 topologies using only 8 SPICE evaluations, 40x fewer than genetic algorithms. For single-model inference, a topology-aware Graph Transformer with Best-of-3 candidate selection reaches 85% simulation validity in 97ms, over 600x faster than random search. The key technical contribution is Group Relative Policy Optimization (GRPO): I identify a critical failure mode of REINFORCE (cross-topology reward distribution mismatch) and resolve it with per-topology advantage normalization, improving simulation validity by +9.6pp over REINFORCE in only 500 RL steps (10x fewer). Grammar-constrained decoding additionally guarantees 100% structural validity by construction via topology-aware token masking. ARCS does not yet match the per-design quality of search-based optimization (5.48 vs. 7.48 reward), but its >1000x speed advantage enables rapid prototyping, design-space exploration, and warm-starting search methods (recovering 96.6% of GA quality with 49% fewer simulations).

关键词: analog circuit synthesis, graph transformer, reinforcement learning, GRPO, topology-aware generation, SPICE simulation, autoregressive generation, design automation

295. ❌ Data-informed lifting line theory

作者: Arjun Sharma, Jonas A. Actor, Peter A. Bosler 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29051v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种数据驱动的框架，通过结合更高保真度的气动数据来扩展经典升力线理论（LLT）的预测能力，使用卷积层和全连接层的神经网络架构来学习对LLT输出的修正。论文的核心是空气动力学模拟和神经网络在科学计算中的应用，属于AI for Science的范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、Pre-training等）、模型优化技术（如PEFT、Quantization）、推理技术（如RAG、CoT）、对齐技术（如RLHF、Instruction Tuning）或代理系统（如LLM Agents）等关键词，因此这些关键词均评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种数据驱动的神经网络框架，通过整合高保真气动数据来修正经典升力线理论（LLT），有效捕捉了低展弦比和高后掠角等LLT不准确区域的高阶三维效应，并保持了计算效率，适用于气动优化和早期飞机设计。

摘要翻译

本文提出一种数据驱动框架，通过融合面元法模拟获得的高精度气动数据，将经典升力线理论（LLT）的预测能力拓展至更广泛的气动工况。我们开发了一种包含卷积层与全连接层的神经网络架构，该架构由两个并行子网络构成，分别处理展向配置点数据以及全局几何/气动输入参数（如攻角、弦长、扭转角、翼型分布和后掠角）。在测试的多种构型中，此架构在学习对LLT输出结果的修正方面表现最为有效。训练后的模型能够捕捉LLT失效工况（如低展弦比、大后掠角）下展向升阻力分布的高阶三维效应，并对超出LLT适用范围及训练数据范围的机翼构型展现出良好的泛化能力。该方法保持了LLT的计算效率，便于集成至气动优化循环与早期飞机设计研究中。该框架为将高精度修正嵌入低阶方法提供了实用路径，并可扩展至其他气动预测任务（如螺旋桨性能预测）。

摘要 (Abstract)

We present a data-driven framework that extends the predictive capability of classical lifting-line theory (LLT) to a wider aerodynamic regime by incorporating higher-fidelity aerodynamic data from panel method simulations. A neural network architecture with a convolutional layer followed by fully connected layers is developed, comprising two parallel subnetworks to separately process spanwise collocation points and global geometric/aerodynamic inputs such as angle of attack, chord, twist, airfoil distribution, and sweep. Among several configurations tested, this architecture is most effective in learning corrections to LLT outputs. The trained model captures higher-order three-dimensional effects in spanwise lift and drag distributions in regimes where LLT is inaccurate, such as low aspect ratios and high sweep, and generalizes well to wing configurations outside both the LLT regime and the training data range. The method retains LLT’s computational efficiency, enabling integration into aerodynamic optimization loops and early-stage aircraft design studies. This approach offers a practical path for embedding high-fidelity corrections into low-order methods and may be extended to other aerodynamic prediction tasks, such as propeller performance.

关键词: data-driven framework, lifting-line theory, neural network, aerodynamic prediction, computational efficiency, wing configurations, higher-fidelity data, aerodynamic optimization

296. ❌ A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank

作者: Iness Halimi, Emmanuel Piffo, Oumnia Boudersa, Yvan Marcel Carre Vilmorin, Melissa Ait-ikhlef, Karima Kone, Andy Tan, Augustin Medina, Juliette Hernando, Sheila Ernest, Vatche Bartekian, Karine Lalonde, Mireille E Schnitzer, Gianolli Dorcelus 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用传统机器学习方法（XGBoost、CatBoost、Explainable Boosting Machines）预测临床试验的操作成功，未涉及大模型、深度学习或任何列出的具体大模型技术关键词。唯一的相关性是它属于“AI for Science”的广义范畴，因为它将人工智能应用于生物医学研究（临床试验），因此该关键词得5分（有一定关联），其余关键词均完全无关得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于潜在风险感知的机器学习框架，利用TrialsBank数据库中的180多个特征来前瞻性预测临床试验的操作成功率，在I-III期试验中实现了0.91-0.93的F1分数，证明了早期风险评估的可行性。

摘要翻译

临床试验具有成本高昂、周期漫长和操作风险显著的特点，然而在试验启动前可靠地预测其成功率的先验方法仍然有限。现有的人工智能方法通常聚焦于孤立指标或特定开发阶段，且常依赖于试验设计阶段无法获取的变量，限制了实际应用性。我们提出了一种分层潜在风险感知机器学习框架，用于前瞻性预测临床试验的操作成功率。该框架使用由Sorintellis开发的专有AI就绪数据库TrialsBank中精选的13,700项试验数据子集。操作成功定义为能够按照计划时间表、招募目标和方案要求，完成从启动、执行到数据库锁定的完整临床试验过程。该方法将操作成功预测分解为两个建模阶段：首先，利用试验启动前可获取的180余项药物与试验层面特征，预测中间潜在操作风险因子；随后将这些预测的潜在风险整合至下游模型中，以估算操作成功概率。研究采用分阶段数据划分策略以防止信息泄露，并使用XGBoost、CatBoost和可解释增强机（Explainable Boosting Machines）进行模型基准测试。该框架在I-III期临床试验中均表现出优异的样本外预测性能，F1分数分别达到0.93、0.92和0.91。潜在风险驱动因素的引入提升了对操作失败的判别能力，且在独立推理评估中保持稳健性能。这些结果表明，通过潜在风险感知人工智能框架能够前瞻性预测临床试验操作成功率，从而实现早期风险评估并支持数据驱动的临床开发决策。

摘要 (Abstract)

Clinical trials are characterized by high costs, extended timelines, and substantial operational risk, yet reliable prospective methods for predicting trial success before initiation remain limited. Existing artificial intelligence approaches often focus on isolated metrics or specific development stages and frequently rely on variables unavailable at the trial design phase, limiting real-world applicability. We present a hierarchical latent risk-aware machine learning framework for prospective prediction of clinical trial operational success using a curated subset of TrialsBank, a proprietary AI-ready database developed by Sorintellis, comprising 13,700 trials. Operational success was defined as the ability to initiate, conduct, and complete a clinical trial according to planned timelines, recruitment targets, and protocol specifications through database lock. This approach decomposes operational success prediction into two modeling stages. First, intermediate latent operational risk factors are predicted using more than 180 drug- and trial-level features available before trial initiation. These predicted latent risks are then integrated into a downstream model to estimate the probability of operational success. A staged data-splitting strategy was employed to prevent information leakage, and models were benchmarked using XGBoost, CatBoost, and Explainable Boosting Machines. Across Phase I-III, the framework achieves strong out-of-sample performance, with F1-scores of 0.93, 0.92, and 0.91, respectively. Incorporating latent risk drivers improves discrimination of operational failures, and performance remains robust under independent inference evaluation. These results demonstrate that clinical trial operational success can be prospectively forecasted using a latent risk-aware AI framework, enabling early risk assessment and supporting data-driven clinical development decision-making.

关键词: clinical trial operational success, latent risk-aware machine learning, prospective prediction, TrialsBank database, XGBoost CatBoost, hierarchical modeling, operational risk factors, data-driven clinical development

297. ❌ From Astronomy to Astrology: Testing the Illusion of Zodiac-Based Personality Prediction with Machine Learning

作者: Abhinna Sundar Samantaray, Finnja Annika Fluhrer, Dhruv Saini, Omkar Charaple, Anish Kumar Singh, Dhruv Vansraj Rathore 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29033v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文使用机器学习方法（逻辑回归、随机森林、神经网络）测试星座性格预测的有效性，是一个批判性社会实验。论文内容与所有评分关键词（均涉及大模型/深度学习技术原理、优化方法或科学应用）完全无关：未涉及任何大模型技术、训练方法、推理优化、对齐技术、高效微调、RAG、长上下文、注意力机制、推理方法、智能体、量化、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或AI for Science的具体应用。论文仅使用传统机器学习分类器进行实验，属于基础机器学习应用，而非大模型或深度学习创新研究。

!!! tip deepseek-chat TL;DR

该研究使用机器学习分类器测试星座性格预测的有效性，发现预测性能与随机猜测相当，表明星座系统缺乏可靠的预测基础，其表面成功源于认知偏差和解释灵活性。

摘要翻译

占星学长期以来被用于解读人类性格、评估匹配度并指导社会决策。基于黄道十二宫的系统在世界许多地区——包括南亚社会——仍具有文化影响力，在这些社会中，占星推理可能影响婚姻配对、命名习俗、仪式择时以及更广泛的人生规划。尽管其持续存在，占星学从未建立起物理上合理的机制或统计上可靠的预测基础。在本研究中，我们采用受控的机器学习框架检验基于黄道十二宫的性格预测。我们构建了一个合成数据集，其中个体被分配从100项广泛人类特征池中抽取的黄道星座（zodiac signs）和性格标签。每个星座与10个常见描述词的子集相关联，这些描述词有意与其他星座分配的词汇重叠，从而复现了实际占星系统的模糊性特征。随后，我们训练逻辑回归（Logistic Regression）、随机森林（Random Forest）和神经网络分类器，以基于星座特征及干扰协变量推断性格标签。在所有实验中，预测性能均处于或接近随机预期水平，而打乱标签的对照组产生了相当的准确度。我们认为，占星学表面上的成功并非源于可测量的预测结构，而是源于特质的普遍性、类别重叠、巴纳姆效应（Barnum effect）和确认偏误（confirmation bias）等认知偏差，以及占星师与评论家的解释灵活性。我们得出结论：基于黄道十二宫的系统无法为预测人类行为提供可靠信息，而是作为一种文化上持久的叙事框架发挥作用。本文旨在进行一次幽默的学术演练。

摘要 (Abstract)

Astrology has long been used to interpret human personality, estimate compatibility, and guide social decision-making. Zodiac-based systems in particular remain culturally influential across much of the world, including in South Asian societies where astrological reasoning can shape marriage matching, naming conventions, ritual timing, and broader life planning. Despite this persistence, astrology has never established either a physically plausible mechanism or a statistically reliable predictive foundation. In this work, we examine zodiac-based personality prediction using a controlled machine-learning framework. We construct a synthetic dataset in which individuals are assigned zodiac signs and personality labels drawn from a shared pool of 100 broadly human traits. Each sign is associated with a subset of 10 common descriptors, intentionally overlapping with those assigned to other signs, thereby reproducing the ambiguity characteristic of practical astrological systems. We then train Logistic Regression, Random Forest, and neural-network classifiers to infer personality labels from zodiac-based features and nuisance covariates. Across all experiments, predictive performance remains at or near random expectation, while shuffled-label controls yield comparable accuracies. We argue that the apparent success of astrology arises not from measurable predictive structure, but from trait universality, category overlap, cognitive biases such as the Barnum effect and confirmation bias, and the interpretive flexibility of astrologers and pundits. We conclude that zodiac-based systems do not provide reliable information for predicting human behavior and instead function as culturally durable narrative frameworks. This paper is intended as a humorous academic exercise.

关键词: astrology, zodiac, personality prediction, machine learning, synthetic dataset, cognitive biases, Barnum effect, cultural narrative

298. ❌ Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

作者: Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani, Qijing Huang, Christos Kozyrakis 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.29010v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理在GPU内核优化中的应用，属于大模型在不同领域的研究应用，具有技术创新性。与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文明确使用LLM代理进行优化。与’Chain of Thought’和’System 2 Thinking’有一定关联（5分），因为涉及代理的推理过程。与’Tool Use’和’In-context Learning’有一定关联（5分），因为使用DSL作为工具并涉及上下文学习。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过设计领域特定语言（DSL）和光速指导（SOL）来提高LLM代理优化GPU内核的效率，实验表明该方法能显著提升性能并降低计算成本。

摘要翻译

利用LLM智能体优化GPU内核是一个在庞大设计空间中进行迭代的过程。每个候选方案都必须经过生成、编译、验证和分析，因此减少尝试次数可以同时节省运行时间和成本。我们得出两个关键观察。首先，智能体所操作的抽象层级至关重要。若层级过低，大语言模型会将推理资源浪费在影响甚微的细节上；若层级过高，则可能遗漏重要的优化选项。其次，智能体难以轻易判断何时达到收益递减点，从而在持续搜索中浪费资源。
这些观察催生了两个旨在提升效率的设计原则：(1) 一种紧凑的领域特定语言，它可在上下文中学习，使模型能在更高层级进行推理，同时保留关键的优化控制手段；(2) 光速性能边界指导，它利用第一性原理性能界限来引导搜索并分配预算。我们在$μ$CUTLASS中实现了这些原则。$μ$CUTLASS是一种面向CUTLASS支持的GPU内核的领域特定语言，配有编译器，涵盖内核配置、收尾融合以及多级流水线。我们运用光速性能边界指导来预估性能提升空间并引导优化尝试，对接近光速边界的问题降低优先级，并标记那些在基准测试中取巧的内核。
在59个KernelBench问题上，在保持相同迭代预算的情况下，使用GPT-5-mini从生成低级代码切换为生成DSL代码，将相对于PyTorch的0.40倍几何平均性能衰退转变为1.27倍加速。加入光速边界引导的搜索策略后，加速比进一步提升至1.56倍。在不同模型层级上，$μ$CUTLASS结合光速边界引导能使较弱的模型以更低的令牌成本超越更强的基线智能体。采用光速边界引导的预算分配策略可节省19-43%的令牌，同时保留至少95%的几何平均加速比，其中最佳策略实现了1.68倍的效率提升。最后，光速边界分析有助于检测基准测试取巧的情况，即某些内核可能看似运行很快，却未能执行预期的计算。

摘要 (Abstract)

Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain-specific language (DSL) that can be learned in context and lets the model reason at a higher level while preserving important optimization levers, and (2) Speed-of-Light (SOL) guidance that uses first-principles performance bounds to steer and budget search. We implement these principles in $μ$CUTLASS, a DSL with a compiler for CUTLASS-backed GPU kernels that covers kernel configuration, epilogue fusion, and multi-stage pipelines. We use SOL guidance to estimate headroom and guide optimization trials, deprioritize problems that are near SOL, and flag kernels that game the benchmark. On 59 KernelBench problems with the same iteration budgets, switching from generating low-level code to DSL code using GPT-5-mini turns a 0.40x geomean regression into a 1.27x speedup over PyTorch. Adding SOL-guided steering raises this to 1.56x. Across model tiers, $μ$CUTLASS + SOL-guidance lets weaker models outperform stronger baseline agents at lower token cost. SOL-guided budgeting saves 19-43% of tokens while retaining at least 95% of geomean speedup, with the best policy reaching a 1.68x efficiency gain. Lastly, SOL analysis helps detect benchmark-gaming cases, where kernels may appear fast while failing to perform the intended computation.

关键词: LLM agents, GPU kernel optimization, domain-specific language, Speed-of-Light guidance, efficiency improvement, CUTLASS, in-context learning, performance bounds

299. ❌ ParetoEnsembles.jl: A Julia Package for Multiobjective Parameter Estimation Using Pareto Optimal Ensemble Techniques

作者: Jeffrey D. Varner 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于开发一个用于多目标参数估计的Julia软件包（ParetoEnsembles.jl），采用Pareto最优集成技术，应用于生物系统建模（如基因表达和血液凝固模型）。所有关键词均与大模型、深度学习技术原理或AI应用直接相关，但论文未涉及任何大模型、深度学习、LLM、MoE、训练方法、推理优化、AI代理等技术。唯一的相关性在于论文属于科学计算和生物信息学领域，因此仅对’AI for Science OR Bioinformatics OR Cheminformatics’给予5分（有一定关联），其他关键词均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文开发了ParetoEnsembles.jl，一个用于多目标参数估计的Julia软件包，通过改进的Pareto最优集成技术生成参数集合，以表征模型不确定性，并成功应用于细胞基因表达和血液凝固模型的实验验证。

摘要翻译

自然系统与人造系统的数学模型通常包含许多可调参数，这些参数必须从多个可能相互冲突的数据集中进行估计。相较于仅报告单一的最优拟合参数向量，生成一组能够共同映射竞争目标间权衡关系的参数集合往往更具信息价值。本文介绍ParetoEnsembles.jl——一个开源的Julia软件包，它采用帕累托最优集合技术生成此类参数集合。该技术基于模拟退火算法，无需梯度信息。本实现将原始优势关系从弱帕累托优势修正为严格帕累托优势，通过增量更新方案将每次迭代的排序成本从$O(n^2 m)$降低至$O(nm)$，并增加了多链并行执行以提升前沿覆盖度。我们通过两个案例展示该软件包的功能：一个拟合实验数据的无细胞基因表达模型，以及一个包含十个待估速率常数和三个目标的血液凝固级联模型。一项受控合成数据研究揭示了参数可识别性结构：尽管单个速率常数存在数倍偏差，但模型预测精度仍可达7%。五次重复的覆盖度分析证实，时序特征能被可靠覆盖，而峰值振幅存在系统性过度自信问题。基于已发表实验性凝血酶生成数据的验证表明，尽管存在固有模型近似误差，该参数集合对未参与训练的实验条件的预测误差仍能控制在10%以内。通过使集合生成变得轻量化且易于使用，ParetoEnsembles.jl旨在降低机理建模中常规不确定性表征的技术门槛。

摘要 (Abstract)

Mathematical models of natural and man-made systems often have many adjustable parameters that must be estimated from multiple, potentially conflicting datasets. Rather than reporting a single best-fit parameter vector, it is often more informative to generate an ensemble of parameter sets that collectively map out the trade-offs among competing objectives. This paper presents ParetoEnsembles.jl, an open-source Julia package that generates such ensembles using Pareto Optimal Ensemble Techniques (POETs), a simulated-annealing-based algorithm that requires no gradient information. The implementation corrects the original dominance relation from weak to strict Pareto dominance, reduces the per-iteration ranking cost from $O(n^2 m)$ to $O(nm)$ through an incremental update scheme, and adds multi-chain parallel execution for improved front coverage. We demonstrate the package on a cell-free gene expression model fitted to experimental data and a blood coagulation cascade model with ten estimated rate constants and three objectives. A controlled synthetic-data study reveals parameter identifiability structure, with individual rate constants off by several-fold yet model predictions accurate to 7%. A five-replicate coverage analysis confirms that timing features are reliably covered while peak amplitude is systematically overconfident. Validation against published experimental thrombin generation data demonstrates that the ensemble predicts held-out conditions to within 10% despite inherent model approximation error. By making ensemble generation lightweight and accessible, ParetoEnsembles.jl aims to lower the barrier to routine uncertainty characterization in mechanistic modeling.

关键词: Pareto optimal ensemble, multiobjective parameter estimation, Julia package, simulated annealing, uncertainty characterization, biological modeling, gene expression model, blood coagulation cascade

300. ❌ Growth-rate distributions at stationarity

作者: Edgardo Brigatti 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29916v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是生态学/统计学中平稳时间序列生成的增长率分布，提出了新的分析工具和广义逻辑分布模型，并识别了宏观生态模式。论文内容完全属于传统统计学和生态学领域，不涉及任何大模型、深度学习、AI技术或相关应用。所有关键词均与大模型技术、AI方法或科学AI应用相关，与该论文的统计生态学研究无任何关联。

!!! tip deepseek-chat TL;DR

该论文研究了平稳时间序列中增长率分布的统计特性，提出了广义逻辑分布作为描述这些分布的有效模型，并识别了能够重现这些模式的宏观生态规律和随机微分方程。

摘要翻译

我们提出了一套新的分析工具，用于描述由平稳时间序列生成的增长率分布。分析表明，偏离正态性的现象并非如某些传统观点所暗示的异常行为，而是可以通过清晰且普适的统计原理加以解释。相反，严格的正态性反而是特定建模选择所导致的结果。以平稳伽马分布或重尾丰度分布为特征的系统，其对数增长率分布能够很好地由广义逻辑分布描述；该分布既能刻画帐篷状数据集，也能描述接近正态的数据集，并可作为这些观测值的一个有效零模型。这些结果证明，对于足够大的时间滞后，增长率分布在实践中将不再具有时间依赖性，并呈现出有限方差。基于此分析，我们识别出一些关键的程式化宏观生态模式，以及能够复现这些模式的特定随机微分方程。随后，我们引入了一种启发式模型选择的实用工作流程。该方法尤其适用于数据追踪质量有限的系统，在这些系统中应用复杂的推断方法具有挑战性。

摘要 (Abstract)

We propose new analytical tools for describing growth-rate distributions generated by stationary time-series. Our analysis shows how deviations from normality are not pathological behaviour, as suggested by some traditional views, but instead can be accounted for by clean and general statistical considerations. In contrast, strict normality is the effect of specific modelling choices. Systems characterized by stationary Gamma or heavy-tailed abundance distributions produce log-growth-rate distributions well described by a generalized logistic distribution, which can describe tent-shaped or nearly normal datasets and serves as a useful null model for these observables. These results prove that, for large enough time lags, in practice, growth-rate distributions cease to be time-dependent and exhibit finite variance. Based on this analysis, we identify some key stylized macroecological patterns and specific stochastic differential equations capable of reproducing them. A pragmatic workflow for heuristic selection between these models is then introduced. This approach is particularly useful for systems with limited data-tracking quality, where applying sophisticated inference methods is challenging.

关键词: growth-rate distributions, stationary time-series, generalized logistic distribution, macroecological patterns, stochastic differential equations, statistical analysis, null model, finite variance

301. ❌ FcsIT: An Open-Source, Cross-Platform Tool for Correlation and Analysis of Fluorescence Correlation Spectroscopy Data

作者: Tomasz Kalwarczyk 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29684v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FcsIT: An Open-Source, Cross-Platform Tool for Correlation and Analysis of Fluorescence Correlation Spectroscopy Data》专注于开发一个用于荧光相关光谱数据分析的软件工具，涉及Python编程、GUI界面、数据读取、相关性计算、拟合模型等。所有评分关键词均与大模型、深度学习、AI技术原理或AI在科学领域的应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文开发了一个名为FcsIT的开源跨平台工具，用于计算和拟合荧光相关光谱数据，并通过验证实验证实了其可用性和准确性。

摘要翻译

FcsIT是一款平台无关的开源工具，用于计算荧光相关光谱数据的相关性与拟合分析。该软件基于Python编写，并采用强大的Dear PyGUI引擎构建用户界面。它支持读取及关联TTTR（时间标记时间分辨）数据，并能对光子时间轨迹数据进行TCSPC（时间相关单光子计数）滤波处理。通过将循环区块自助法应用于相关数据及其方差的计算，所得数据质量可与商业软件相媲美。直观的拟合界面为大规模数据集提供了高效分析能力，并内置九种预定义数学模型用于拟合相关曲线。此外，该工具允许用户以友好的方式自定义添加模型。通过对模拟FCS数据及真实FCS实验的验证，FcsIT工具的实用性和其对广泛FCS用户的潜在吸引力得到了证实。

摘要 (Abstract)

FcsIT is a platform-independent, open-source tool for calculating the correlation and fitting fluorescence correlation spectroscopy data. The software is written in Python and uses a powerful Dear PyGUI engine for its interface. It provides reading and correlating the TTTR data, as well as TCSPC filtering of the photon time-trace data. The circular-block bootstrap method applied to the calculation of correlation data and its variance results in data quality comparable to that obtained with commercially available software. An intuitive fitting interface provides efficient analysis of large datasets and includes nine predefined mathematical models for fitting correlation curves. Moreover, it allows users to add their own models in a user-friendly manner. Validation of the FcsIT tool against simulated FCS data and real FCS experiments confirms its usability and potential appeal to a wide variety of FCS users.

关键词: Fluorescence Correlation Spectroscopy, FCS, open-source tool, correlation analysis, data fitting, Python, Dear PyGUI, circular-block bootstrap

302. ❌ Sampling from the Solution Space and Metabolic Environments of Genome-Scale Metabolic Models

作者: Haris Zafeiropoulos, Daniel Rios Garza 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于基因组规模代谢模型的通量采样方法及其应用，属于生物信息学领域。所有关键词均与大模型、深度学习技术原理或AI应用相关，但论文内容完全不涉及这些技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学范畴，但论文本身并未明确使用AI或机器学习方法，而是基于传统的约束优化和统计采样方法，因此给予5分（有一定关联）。其他关键词与论文主题完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文综述了基因组规模代谢模型中通量采样的先进方法，展示了如何通过随机采样探索物种表型谱，并在不同条件下应用这些方法。

摘要翻译

通量采样是一种基于分布、从代谢模型的解空间中随机选取有效数量点的分析方法。与大多数基于约束的分析不同，通量采样无需优化目标函数，从而能够探索物种可能表现出的全部表型谱。然而，采样也可被限制在某个子空间内，其中所选目标至少达到其最优值的特定比例。在研究特定功能最优表型时，这种定向方法具有重要价值。与仅返回单一解的通量平衡分析相反，采样利用统计效力来揭示原本可能被掩盖的表型。这在改变物种生存条件（培养基）时尤为有用。本文重点介绍在不同场景下对基因组尺度代谢模型应用通量采样的一些前沿方法，并展示通量采样的具体应用实例。

摘要 (Abstract)

Flux sampling is an analysis that, based on a distribution, picks randomly an efficient number of points from the solution space of a metabolic model. Unlike most constraint-based analyses, flux sampling does not require an objective function to optimize, allowing for the exploration of the whole spectrum of the phenotypes a species can exhibit. However, sampling can also be restricted to a subspace where a chosen objective reaches at least a specified fraction of its optimum. This targeted approach adds value when investigating phenotypes that are optimal for a specific function. Contrary to Flux Balance Analysis, which returns a single solution, sampling leverages statistical power to uncover phenotypes that otherwise would be masked. This can be especially useful when changing the conditions (medium) in which a species lives. Here, we highlight some state-of-the-art methods for applying flux sampling at Genome-Scale Metabolic Models in different scenarios, and we showcase flux sampling applications

关键词: flux sampling, genome-scale metabolic models, solution space, phenotype exploration, constraint-based analysis, statistical power, metabolic environments

303. ❌ Retrospective Economic Evaluation of Group Testing in the COVID-19 Pandemic

作者: Michael Balzer, Kainat Khowaja, Christiane Fuchs 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28930v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是COVID-19大流行期间群体检测的经济评估，属于公共卫生经济学和流行病学领域。论文内容完全围绕数学建模、经济成本分析、群体检测算法和蒙特卡洛实验展开，没有涉及任何大模型、深度学习、AI技术或相关技术原理。所有评分关键词均与大模型技术相关，而本文是纯粹的经济学/流行病学研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文开发了一个数学模型来评估COVID-19群体检测的经济成本，发现考虑收入损失时，较短隔离时间的检测算法更优，而传统仅考虑确定性成本的方法会低估总经济成本。

摘要翻译

大流行期间的疾病监测是公共卫生政策的重要组成部分。由于资源限制，个体层面的诊断检测往往难以实施。为突破这些限制，可采用分组检测方法。从支付方视角进行的经济成本评估通常仅关注确定性成本，忽视了因隔离和工作场所中断导致的生产力损失所带来的重大经济影响。本文旨在建立一个用于分组检测回顾性经济评估的数学模型，该模型同时纳入确定性成本和基于收入的经济损失。研究重新审视了分组检测算法，并在优化池规模下进行模拟以确定所需检测数量。德国社会经济面板的收入数据被整合到数学模型中，用以捕捉经济损失。随后，通过评估2019冠状病毒病大流行在德国的经济成本，进行了混合蒙特卡罗实验。蒙特卡罗实验表明，当纳入基于收入的经济损失时，分组检测算法的最优选择会发生显著变化。仅考虑确定性成本的评估会系统性低估总经济成本。若计入基于收入的经济损失，隔离持续时间较长的算法相比隔离时间较短的算法吸引力更低。研究结果表明，当前的评估低估了真实经济成本。即使需要更多检测次数，持续时间更短、阶段更少的分组检测算法仍更受青睐。这些结果凸显了将基于收入的经济损失纳入数学模型的重要性。

摘要 (Abstract)

Surveillance of diseases in a pandemic is an important part of public health policy. Diagnostic testing at the individual level is often infeasible due to resource constraints. To circumvent these constraints, group testing can be applied. The economic cost evaluation from the payer’s perspective typically focuses only on deterministic costs which overlooks the substantial economic impact of productivity losses resulting from quarantine and workplace disruptions. The objective of this article is to develop a mathematical model for a retrospective economic evaluation of group testing that incorporates both deterministic costs and income-based economic loss. Group testing algorithms are revisited and simulated at optimized pool sizes to determine the required number of tests. Income data from the German Socio-Economic Panel are integrated into a mathematical model to capture the economic loss. Afterward, hybrid Monte Carlo experiments are conducted by evaluating the economic cost in the Coronavirus disease 2019 pandemic in Germany. Monte Carlo experiments show that the optimal choice of group testing algorithms changes substantially when income-based economic losses are included. Evaluations considering only deterministic costs systematically underestimate the total economic cost. Algorithms with a longer quarantine duration are less attractive than shorter quarantine duration if income-based economic loss is accounted for. The findings show that current evaluations underestimate the true economic cost. Group testing algorithms with shorter duration and fewer stages are preferred, even when they require a larger number of tests. These results underscore the importance of incorporating income-based economic loss into a mathematical model.

关键词: group testing, economic evaluation, COVID-19 pandemic, Monte Carlo experiments, income-based economic loss, mathematical model, quarantine duration, public health policy

304. ❌ Negative Electronic Friction and Non-Markovianity in Nonequilibrium Systems

作者: Riley J. Preston, Samuel L. Rudge, Daniel S. Kosov, Michael Thoss 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29951v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究非平衡条件下分子与金属表面相互作用中的负电子摩擦和非马尔可夫效应，属于凝聚态物理和分子动力学领域，与所有评分关键词（均涉及大模型、深度学习及相关技术）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了非平衡条件下分子与金属表面相互作用中负电子摩擦与非马尔可夫效应之间的联系，发现导致负马尔可夫电子摩擦的机制也会引入显著的非马尔可夫贡献，并通过分子纳米结模型验证了这些效应对非平衡动力学的重要影响。

摘要翻译

我们探讨了非平衡条件下分子与金属表面相互作用时，其非绝热振动动力学中的负电子摩擦与非马尔可夫效应之间的关联。研究表明，导致负马尔可夫电子摩擦的一种普遍非平衡机制——即分子振动直接与非弹性电子跃迁耦合——同样会引入显著的非马尔可夫贡献至电子摩擦中。为阐明这一观点，我们研究了通过一个包含振动耦合供体-受体模型的分子纳米结的非平衡电荷输运过程，其中负电子摩擦反映了振动模式受到超越传统焦耳加热机制的驱动。通过与数值精确的完全量子层级运动方程模拟结果进行对比，我们证实这些非马尔可夫效应对非平衡动力学乃至所得朗之万方程的稳定性均具有重要影响。

摘要 (Abstract)

We address the connection between negative electronic friction and non-Markovian effects in the nonadiabatic vibrational dynamics of molecules interacting with metal surfaces under nonequilibrium conditions. We show that a generic nonequilibrium mechanism leading to negative Markovian electronic friction, where molecular vibrations couple directly to inelastic electronic transitions, also introduces significant non-Markovian contributions to the electronic friction. To demonstrate these ideas, we investigate nonequilibrium charge transport through a molecular nanojunction containing a vibrationally coupled donor-acceptor model, where negative electronic friction reflects driving of the vibrational mode beyond standard Joule heating. By comparison to numerically exact, fully quantum hierarchical equations of motion simulations, we verify that these non-Markovian effects have a significant impact on the nonequilibrium dynamics and even on the stability of the resulting Langevin equation.

关键词: negative electronic friction, non-Markovian effects, nonequilibrium systems, molecular nanojunction, vibrational dynamics, hierarchical equations of motion, Langevin equation, charge transport

305. ❌ TD$Δ$SCF: Time-Dependent Density Functional Theory with a Non-Aufbau Reference for near-degenerate states

作者: Shuto Shibasaki, Fumiya Mohri, Takashi Tsuchimochi 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29948v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是计算化学领域的时间相关密度泛函理论（TDDFT）方法改进，专注于解决近简并电子结构的挑战。论文内容与绝大多数关键词（涉及大模型、深度学习、AI技术原理等）完全无关，因为这些关键词都是关于人工智能、机器学习和大语言模型的技术，而本文是纯粹的量子化学计算方法研究。唯一可能的相关性是"AI for Science OR Bioinformatics OR Cheminformatics"，因为该论文属于计算化学领域，是科学计算的一部分，但论文本身并未使用AI方法，而是传统的量子化学方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对近简并电子结构这一传统密度泛函理论的挑战，提出了一种新的时间相关ΔSCF方法，通过使用非Aufbau参考态改善了单重态的描述，在多个测试案例中表现出比传统方法更弱的泛函依赖性和更好的性能，但也存在系统性高估单重态能量等局限性。

摘要翻译

近简并电子结构仍是传统单参考态密度泛函理论（DFT）面临的主要挑战。为解决这一问题，我们提出了含时$Δ$SCF（TD$Δ$SCF）方法——一种新颖的线性响应方案，其中采用非奥夫鲍（non-Aufbau）$Δ$SCF行列式作为后续含时密度泛函理论（TDDFT）计算的参考态。与共线自旋翻转（SF）-TDDFT相比，该框架在保持常规库仑与交换关联响应贡献的同时，能够从电子激发参考态的角度描述目标态。我们考察了TD$Δ$SCF在若干典型近简并问题中的表现，包括乙烯的扭转势能面、代表性双自由基的单重态-三重态能隙、苯炔异构体的几何结构优化，以及氟化氢和F$_2$的键解离曲线。在这些测试中，TD$Δ$SCF表现出比SF-TDDFT显著更弱的泛函依赖性，且往往能对具有挑战性的单重态提供更平衡的描述。具体而言，该方法能给出平滑的扭转势能面、改进的单重态-三重态能隙、对单重态间位苯炔（singlet $m$-benzyne）保持一致的单调环结构，并在描述键解离过程时避免了SF-TDDFT中出现的虚假低能态，从而提供更令人满意的结果。与此同时，该方法存在系统性高估单重态能量的倾向，且当基础$Δ$SCF参考态与终态匹配不佳时可能损失精度。我们还发现非奥夫鲍计算中可能出现的数值不稳定性，并将其根源追溯至未补偿节点区域附近的交换关联势。这些结果既揭示了TD$Δ$SCF作为一种低成本处理近简并电子结构单重态方法的潜力，也指出了其实际应用中的局限性。

摘要 (Abstract)

Near-degenerate electronic structures remain a major challenge for conventional single-reference density functional theory (DFT). To address this problem, we propose time-dependent $Δ$SCF (TD$Δ$SCF), a novel linear-response scheme in which a non-Aufbau $Δ$SCF determinant serves as the reference for a subsequent TDDFT calculation. In contrast to collinear spin-flip (SF)-TDDFT, this formulation preserves the usual Coulomb and exchange-correlation response contributions while describing the target states from an electronically excited reference. We examine the performance of TD$Δ$SCF for several prototypical problems involving near-degeneracy, including the torsional potential of ethylene, singlet–triplet gaps of representative diradicals, geometry optimizations of benzyne isomers, and bond-dissociation curves of hydrogen fluoride and F$_2$. Across these tests, TD$Δ$SCF shows markedly weaker functional dependence than SF-TDDFT and often yields a more balanced description of challenging singlet states. In particular, it provides smooth torsional potentials, improved singlet–triplet gaps, a consistent monocyclic structure for singlet $m$-benzyne, and a more satisfactory description of bond dissociation without the spurious low-lying states found in SF-TDDFT. At the same time, the method exhibits a systematic tendency to overestimate singlet energies and can lose accuracy when the underlying $Δ$SCF reference is not well suited to the final state. We also identify a numerical instability that can arise in non-Aufbau calculations and trace its origin to the exchange-correlation potential near uncompensated nodal regions. These results highlight both the promise and the practical limitations of TD$Δ$SCF as a low-cost method for singlet states with near-degenerate electronic structures.

关键词: Time-Dependent Density Functional Theory, TDΔSCF, non-Aufbau reference, near-degenerate states, singlet states, linear-response scheme, spin-flip TDDFT, electronic structure

306. ❌ Gap edge eigenpairs from density matrix purification using moments of the Dirac distribution

作者: Lionel Alexandre Truflandier 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29849v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究电子结构计算中的密度矩阵纯化方法，用于求解能带边缘的本征态，属于计算化学/物理领域。所有关键词均与大模型、深度学习、AI技术原理或应用无关，仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’与科学计算有一定关联，但论文未涉及AI方法，因此给5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于狄拉克分布矩的密度矩阵纯化方法，用于高效求解电子能谱中带隙边缘的本征态，并通过分子基准测试证明了方法的鲁棒性和计算效率。

摘要翻译

本研究提出一种仅需准纯化单粒子密度矩阵作为输入，即可求解位于电子能谱带隙边缘本征态的简易方法。该理论框架基于对占据数方差分解为粒子矩与空穴矩的构建。通过幂次窄化迭代对这些矩进行纯化后，可分离出最高占据与最低未占据的单态投影算符，从而直接获取相应的本征对。我们证明，在遇到简并情况时，幂次窄化仍能给出相关的混合态。通过对选定分子的基准测试，表明该方法具有鲁棒性和高效性，在最坏情况下仅需不超过十余次矩阵-矩阵乘法运算。文中探讨了利用Lanczos子空间方法降低计算成本的可能性。幂次窄化极低的算法复杂度使其易于在已具备费米算符展开或密度矩阵纯化功能的电子结构代码或库中实现。

摘要 (Abstract)

In this work, we propose a simple method to resolve the eigenstates located at the band gap edges of an electronic eigenspectrum using only the quasi-purified one-particle density matrix as input. The theoretical framework relies on the decomposition of the occupation number variance into a particle and hole moment. These moments, when purified using power narrowing iterations, allow to isolate the higher occupied and lower unoccupied single state projectors, giving readily access to the corresponding eigenpairs. We demonstrate that when degeneracy is encountered, power narrowing remains able to deliver relevant mixed states. From a benchmark of selected molecules, we show that the method is robust and efficient since it requires no more that a dozen of matrix-matrix multiplications at worst. The possibility of reducing the computational cost using Lanczos subspace approach is discussed. The very low algorithmic complexity of power narrowing makes it very easy to implement in electronic structure codes or libraries already featuring Fermi operator expansion or density matrix purifications.

关键词: density matrix purification, eigenstates, band gap edges, electronic eigenspectrum, occupation number variance, power narrowing, Lanczos subspace, computational efficiency

307. ❌ Short-lived memory in multidimensional spectra encodes full signal evolution

作者: Thomas Sayer, Ethan H. Fink, Zachary R. Wiethorn, Devin R. Williams, Anthony J. Dominic, Luke Guerrieri, Yi Ji, Veronica Policht, Jennifer Ogilvie, Gabriela Schlau-Cohen, Amber Krummel, Andrés Montoya-Castillo 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29814v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于开发一种新的光谱技术（spectral generalized master equation，GME），用于改进二维光谱测量，属于实验物理/化学光谱学领域。论文内容涉及光谱学、化学动力学、材料科学，与所有大模型/深度学习技术关键词完全无关。仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于科学仪器/方法创新，可视为广义的’AI for Science’（科学中的技术应用），但论文并未使用AI或大模型，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对当前多维光谱技术实验成本高、时间长的问题，开发了一种新的光谱广义主方程（GME）技术，能够利用短等待时间的二维光谱准确预测任意等待时间的完整光谱演化，大幅降低了实验成本并提高了时间分辨率。

摘要翻译

超快多维光谱是能够探测复杂材料中电荷与能量流动、追踪化学动力学过程乃至关联物质中多体相互作用的强大工具。然而，当前的技术实现通常依赖复杂装置和长时间信号平均，导致这些方法仅限于对少数样品进行细致的机理研究，难以广泛应用于分子系统的普适性表征或材料体系的性能优化。例如，随着等待时间的增加，二维光谱中统计噪声的收敛成本呈指数级上升，而长时间的激光照射也提高了样品降解的概率。为应对这一根本性挑战，我们开发了一种新技术——谱广义主方程（spectral Generalized Master Equation, GME），该技术使得利用短等待时间的二维光谱来高时间分辨率地确定任意等待时间下二维光谱的完整演化成为可能。我们的方法不仅将实验成本降低了数个数量级，还能精确消除统计噪声，减少时间平均的需求，同时规避了随等待时间延长而急剧增加的收敛成本。我们为谱广义主方程提供了严格的理论基础，并通过理论生成和实验测得的二维电子光谱与二维红外光谱验证了其适用性。我们预期这一进展有望推动对当前多维光谱技术而言过于脆弱之体系的研究，并加速基于二维光谱的显微技术发展，从而在空间尺度上实现对非均质环境中高分辨激发动力学的观测。

摘要 (Abstract)

Ultrafast multidimensional spectroscopies are powerful tools that can access charge and energy flow in complex materials, shifting chemical kinetics, and even many-body interactions in correlated matter. However, current implementations typically involve complex apparatuses and long averaging times. As a result, these methods have been limited to detailed mechanistic investigations of a few samples, precluding the broad characterization of molecular systems and/or the optimization of material ones. For example, converging the statistical noise in 2D spectra becomes exponentially expensive with increasing waiting times, and extended laser exposure heightens the probability of sample degradation. We address this fundamental challenge by developing a new technique, the spectral generalized master equation (GME), that enables one to employ short-waiting time 2D spectra to determine the full evolution of 2D spectra over arbitrary waiting times with high temporal resolution. In addition to reducing the cost of experiments by multiple orders of magnitude, our approach accurately removes statistical noise, reducing the need for time averaging, while circumventing the increasing convergence costs with longer waiting times. We provide a rigorous theoretical footing for the spectral GME and demonstrate its applicability on theoretically generated and experimentally measured 2D electronic and 2D infrared spectra. We anticipate that this advance has the potential to enable the investigation of systems that are too delicate for study with current multidimensional spectroscopies and accelerate the progress of 2D spectroscopy-based microscopies that can offer highly resolved excitation dynamics with spatial resolution over heterogeneous environments.

关键词: ultrafast multidimensional spectroscopy, 2D spectra, spectral generalized master equation, waiting time, temporal resolution, experimental cost reduction, statistical noise removal, excitation dynamics

308. ❌ Investigating the Electrochemical Double Layer with Quantum-Chemical Simulations and Implicit Solvation Models

作者: Alessandro Mangiameli, Christopher J. Stein 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29674v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究电化学双层的量子化学模拟和隐式溶剂模型，属于计算化学领域，与大多数关键词（涉及大模型、深度学习、AI技术原理）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于科学计算应用，但论文未明确使用AI或机器学习方法，而是基于传统物理模型（DRISM、Poisson-Boltzmann、分子动力学），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文评估了DRISM隐式电解质模型在模拟电化学双层中的应用，发现通过使用特定金属-离子参数而非默认混合规则，可以改善模型的准确性和对称充电行为。

摘要翻译

我们评估了介电一致参考相互作用位点模型（DRISM）作为模拟电化学双电层的隐式电解质框架，并将其与泊松-玻尔兹曼模型及文献中的显式分子动力学结果进行比较。以金-电解质界面为主要测试案例，我们分析了溶剂与离子密度分布、微分电容以及CO吸附的溶剂化贡献。结果表明，模型对金属-离子和金属-水相互作用的伦纳德-琼斯参数化具有高度敏感性。特别指出，默认的洛伦兹-贝瑟洛混合规则并不适用，会导致界面处Na+过度积累，从而在负电极电位下引起微分电容升高。我们证明，引入针对特定金属-离子对的参数能够产生更对称的充电行为，并提供更高的灵活性。这些发现表明，使用特定配对参数而非依赖洛伦兹-贝瑟洛混合规则，可提升模型的准确性，并为未来使用这一改进后性能相当的模型开展研究开辟了道路。

摘要 (Abstract)

We assess the dielectrically consistent reference interaction site model (DRISM) as an implicit electrolyte framework for modeling the electrochemical double layer, and compare it with the Poisson-Boltzmann model and explicit molecular dynamics results from the literature. We use the gold-electrolyte interface as the main test case and analyze solvent and ionic density profiles, the differential capacitance, and the solvation contribution to CO adsorption. The results show a strong sensitivity to the Lennard-Jones parametrization of metal-ion and metal-water interactions. In particular, we find that the default Lorentz-Berthelot mixing rules to be inadequate and lead to excessive Na+ accumulation at the interface, which results in an increase of the differential capacitance at negative electrode potentials. We demonstrate that introducing pair-specific metal-ion parameters yields more symmetric charging behavior and provides greater flexibility. Our findings suggest that using pair-specific parameters, rather than relying on Lorentz-Berthelot mixing rules, improves the accuracy of the model and opens the way for future studies with this improved yet equally performant model.

关键词: electrochemical double layer, quantum-chemical simulations, implicit solvation models, DRISM, Poisson-Boltzmann, differential capacitance, Lorentz-Berthelot mixing rules, metal-ion parameters

309. ❌ Quantum Sensing with Triplet Pair States: A Theoretical Study

作者: Maria Grazia Concilio, Yiwen Wang, Siyuan Wang, Xueqian Kong 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29509v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子传感，属于物理/量子计算领域，与所有大模型/深度学习关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为量子传感可视为科学应用，但论文未使用AI方法，仅使用理论建模和模拟，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该理论研究比较了基于五苯二聚体与单体的量子传感架构，发现二聚体在检测核自旋小系综时具有更优的相互作用截面，为利用高自旋多激子态作为化学可调谐的高灵敏度量子探针建立了理论基础。

摘要翻译

分子量子传感器代表了在纳米尺度上检测核磁共振信号与交流磁场的前沿方向，其灵敏度有望达到单质子水平。尽管并五苯分子的三重态提供了一种可行的传感架构，但通过并五苯二聚体单线态裂变产生的三重态对态，有望借助纠缠实现更灵活的量子操控。本研究模拟了经由分子内单线态裂变产生的光激发并五苯二聚体中，自旋极化的五重态流形用于量子传感的性能。采用林德布拉德主方程方法，我们模拟了三重态对态在标准动态解耦序列（包括自旋回波、XY4和XY8）下的演化过程，并与传统的并五苯单体基准进行了直接性能比较。尽管两种架构在孤立单自旋检测中表现出相近的灵敏度，但我们的研究结果表明，二聚体架构在检测小规模核自旋系综时具有更优的相互作用截面。通过对荧光调制推导出的解析表达式表明，灵敏度在低磁场条件下达到最优，并随传感协议中的脉冲数量增加而提升。本研究为利用高自旋多激子态作为化学可调谐的高灵敏度量子探针奠定了理论基础。

摘要 (Abstract)

Molecular quantum sensors represent a promising frontier for the detection of nuclear magnetic resonance signals and alternating current magnetic fields at the nanoscale, potentially reaching single-proton sensitivity. Although the triplet states of molecular pentacene provide a viable sensing architecture, the triplet pair states produced by singlet fission of pentacene dimers could enable more flexible quantum manipulations through entanglement. In this work, we model the quantum sensing efficacy of a spin-polarized quintet manifold in a photoexcited pentacene dimer generated via intramolecular singlet fission. Using a Lindblad master equation approach, we simulate the evolution of the triplet pair state under standard dynamical decoupling sequences, including spin echo, XY4, and XY8 and provide a direct performance comparison to the traditional pentacene monomer benchmark. While both architectures exhibit comparable sensitivity for isolated single-spin detection, our findings indicate that the dimer architecture provides a superior interaction cross-section for detecting small ensembles of nuclear spins. Analytical expressions derived for fluorescence modulation demonstrate that sensitivity is optimized in the low-magnetic field regime and scales with the number of pulses in the sensing protocol. This study establishes a theoretical baseline for utilizing high-spin multi-excitonic states as chemically tunable, high-sensitivity quantum probes.

关键词: quantum sensing, triplet pair states, pentacene dimer, singlet fission, Lindblad master equation, dynamical decoupling, fluorescence modulation, high-spin multi-excitonic states

310. ❌ Local thermal probe in a one-dimensional chain: An efficient dissipaton-based approach

作者: Hao-Yang Qi, Zi-Fan Zhu, Yao Wang, Rui-Xue Xu, YiJing Yan 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29458v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究一维分子链与局部探针的热输运问题，采用基于耗散子的量子方法，属于凝聚态物理和量子输运领域。所有评分关键词均涉及大模型、深度学习、AI技术及其应用，而该论文完全不涉及任何人工智能、机器学习或大语言模型相关内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了一维分子链中局部热探针的热输运问题，开发了一种完全非微扰、非马尔可夫的基于耗散子的量子方法，并分析了温度、频率、位点能量修正和高阶耦合对热输运的影响。

摘要翻译

我们研究一个由无限长一维分子链与局部耦合探针组成的系统。从链-探针复合体系的哈密顿量及相应谱密度出发，我们计算了探针与分子链之间的热流。为此，我们发展了一种基于耗散子的量子方法，该方法完全是非微扰且非马尔可夫的。耗散子代数导出了一组关于耗散子矩的层级耦合运动方程，若包含高阶链-探针相互作用，则通过迭代方式产生跨层级关联。数值结果展示了温度、频率、在位能修正以及高阶耦合对热输运的影响。本工作为局部探针体系中的热输运及其他性质提供了一个普适框架，并可直接推广至高维材料及具有强多体效应的电子输运问题。

摘要 (Abstract)

We study a system consisting of an infinite one-dimensional molecular chain and a locally coupled probe. Starting from the Hamiltonian of the chain-probe composite and the corresponding spectral densities, we evaluate the heat current between the probe and the chain. For this purpose, we develop a dissipaton-based quantum approach that is fully nonperturbative and non-Markovian. The dissipaton algebra yields a set of hierarchically coupled equations of motion for the dissipaton moments, with cross-tier connections in an iterative manner if higher-order chain-probe interactions are included. Numerical results demonstrate the effects of temperature, frequency, onsite energy modification and higher-order couplings on heat transport. This work provides a general framework for thermal transport and other properties in locally probed systems and can be straightforwardly extended to higher-dimensional materials and electronic transport problems with strong many-body effects.

关键词: thermal transport, one-dimensional chain, local probe, dissipaton-based approach, nonperturbative, non-Markovian, heat current, quantum approach

311. ❌ Layer-selective hydrogenation and proton transport in twisted bilayer graphene

作者: J. Tong, G. Chen, H. Li, E. Hoenig, M. Alhashmi, X. Zhang, D. Bahamon, G. R. Tainton, S. Sullivan-Allsop, Y. Mayamei, D. R. da Costa, L. F. Vega, S. J. Haigh, D. Domaretskiy, F. M. Peeters, M. Lozada-Hidalgo 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29342v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是扭曲双层石墨烯中的层选择性氢化和质子传输，属于凝聚态物理和材料科学领域。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、对齐、应用等），而论文内容完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在扭曲双层石墨烯中，通过电场和电荷密度控制实现层选择性氢化及质子传输，从而构建可配置逻辑门的新机制。

摘要翻译

近期研究通过独立调控石墨烯晶体中的电场E与电荷密度n，探究了其氢化过程，并证明该过程受n控制。本文展示了在强电场E下、固定n条件下，通过氢化驱动扭转双层石墨烯发生层选择性导体-绝缘体转变。该过程伴随质子穿越双层的传输，从而在器件中实现多个并行可配置的逻辑门。选择性产生的机理在于：大扭转角使两层电子系统解耦合，从而可独立调控其电荷密度；电场诱导的极化在总n固定时引发电荷失衡，当其中一层的电荷密度达到单层氢化阈值时即触发氢化反应。我们的研究结果提出了一种新型电极-电解质界面，其中电化学过程由两个解耦合的二维电子气调控，为能量与信息处理器件开辟了新的设计可能。

摘要 (Abstract)

Recent work investigated graphene’s hydrogenation with independent control of the electric field, E, and charge density, n, in the crystal and showed that the process is controlled by n. Here, we demonstrate layer-selective conductor-insulator transitions in twisted bilayer graphene, driven by hydrogenation at fixed n under strong E. This process is accompanied by proton transport through the bilayer, enabling several parallel and configurable logic gates in the devices. Selectivity arises because the large twist angle decouples the two layers’ electronic systems, enabling independent control of their charge densities. Polarisation by the field then induces a charge imbalance at fixed total n, triggering hydrogenation when one of the layers’ charge densities reaches the threshold for monolayer hydrogenation. Our results introduce a new type of electrode-electrolyte interface in which electrochemical processes are controlled with two decoupled 2D electron gases, opening new design opportunities for energy and information processing devices.

关键词: twisted bilayer graphene, layer-selective hydrogenation, proton transport, electric field control, charge density, conductor-insulator transition, logic gates, 2D electron gases

312. ❌ GPU Accelerated Minimal Auxiliary Basis Approach TDDFT for Large Organic Molecules

作者: Zehao Zhou, Xiaojie Wu, Yanheng Li, Xinran Wei, Cheng Fan, Fusong Ju, Qiming Sun, Yi Qin Gao 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29257v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学领域，开发了一种GPU加速的TDDFT方法用于大分子激发态计算。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐技术等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学/科学计算领域，是AI/计算在科学（具体是化学/材料科学）中的应用，但论文本身并未明确使用’AI’或’Bioinformatics’等术语，而是传统的量子化学计算方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种GPU加速的TDDFT-risp方法，实现了对包含300至3000个原子的大型有机分子的激发态计算，在单块A100 GPU上可在数分钟到数小时内完成。

摘要翻译

我们在GPU4PySCF中引入了基于最小辅助基组方法的时间依赖密度泛函理论（TDDFT-risp）的GPU加速实现，并利用Tamm–Dancoff近似（TDA-risp）进行了大体系计算演示。该方法结合了GPU加速的三中心积分计算、张量缩并、交换空间截断、辅助基组中氢原子的省略以及主机内存辅助的Davidson求解器。在EXTEST42基准测试集上，采用保守的40电子伏特交换截断能时，相对于标准TDA方法，低激发态激发能的误差约为0.03–0.05电子伏特。对于包含300至3000个原子的体系，我们展示了采用$ω$B97XD/def2-SVP级别对15个低激发态进行TDA-risp计算可在单张A100 GPU上完成，实际计算时间从数分钟到数小时不等。这些结果表明GPU-TDDFT-risp为包含数千个原子的大型有机及生物分子体系的激发态计算提供了一条实用路径。

摘要 (Abstract)

We introduce a GPU-accelerated implementation of time-dependent density functional theory with the minimal auxiliary basis approach (TDDFT-risp) in GPU4PySCF, together with large system demonstrations carried out using the Tamm–Dancoff approximation (TDA-risp). The method combines GPU-accelerated three-center integral evaluation, tensor contractions, exchange-space truncation, omission of hydrogen atoms from the auxiliary basis, and a host memory assisted Davidson solver. On the EXTEST42 benchmark set, a conservative 40 eV exchange cutoff yields excitation-energy errors relative to standard TDA of about 0.03–0.05 eV for low-lying states. For systems of 300 to 3000 atoms, we demonstrate that TDA-risp calculations of 15 low-lying excited states with $ω$B97XD/def2-SVP complete on a single A100 GPU with wall times ranging from minutes to hours. These results position GPU-TDDFT-risp as a practical route toward excited-state calculations for large organic and biomolecular systems with thousands of atoms.

关键词: GPU-accelerated, TDDFT, minimal auxiliary basis, large organic molecules, excited-state calculations, Tamm-Dancoff approximation, A100 GPU, computational chemistry

313. ❌ Electronic Collective Variables for Chemical Reactions

作者: YaoKun Lei, Yi Isaac Yang 期刊/来源: arxiv 发布日期: 2026-03-31 arXiv链接: http://arxiv.org/abs/2603.29143v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于化学反应的电子集体变量研究，使用神经网络模型处理QM/MM数据，属于AI在科学领域的应用。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文涉及化学信息学领域的AI应用，但并非核心大模型技术，故给5分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于原子电荷的电子集体变量框架，用于描述化学反应中的电子重分布过程，并通过神经网络模型在QM/MM数据上训练，实现了在多种反应环境中有效表征反应进展，减少了对手工几何描述符的依赖。

摘要翻译

化学反应采样关键依赖于捕获控制反应转变的慢自由度的集体变量（CVs）。然而，现有的反应CVs通常在几何空间中定义或以系统特定的方式学习，这限制了其可迁移性，并使“反应进程应如何表征”这一更根本的问题悬而未决。从物理视角看，化学反应由电子重分布定义。本文引入了一种电荷空间电子集体变量，它基于原子电荷，以通用的线性形式描述反应进程的电子组分。为使其能用于增强采样，我们通过一个在迭代采样-训练工作流程中基于QM/MM数据训练的神经网络模型，提供原子电荷及相应的CV梯度。在水溶液和酶催化环境下的多个反应中，我们证明该电子CV可以构建为通用的电荷空间形式，其相应系数可通过相关态间的电荷差异以简单方式确定。我们的模拟进一步表明，反应进程通常涉及电子与构象组分的耦合，且同一框架也可扩展用于抑制副反应。这些发现支持基于电荷的电子CVs作为一个物理驱动的框架，用于描述化学反应进程的电子组分，从而减少对手工构建几何描述符的依赖。

摘要 (Abstract)

Chemical reaction sampling critically depends on collective variables (CVs) that capture the slow degrees of freedom governing reactive transformations. However, existing reaction CVs are often defined in geometric space or learned in a system-specific manner, which limits their transferability and leaves open the more fundamental question of how reaction progress should be represented. From a physical perspective, chemical reactions are defined by electron redistribution. Here, we introduce a charge-space electronic collective variable that describes the electronic component of reaction progress in a common linear form based on atomic charges. To enable its use in enhanced sampling, atomic charges and the corresponding CV gradients are provided by a neural-network model trained on QM/MM data within an iterative sampling-training workflow. Across multiple reactions in aqueous and enzymatic environments, we show that this electronic CV can be constructed in a common charge-space form, with the corresponding coefficients assigned in a simple manner from charge differences between relevant states. Our simulations further show that reaction progress generally involves coupled electronic and conformational components, and that the same framework can also be extended to restrain side reactions. These findings support charge-based electronic CVs as a physically motivated framework for describing the electronic component of chemical reaction progress with reduced reliance on handcrafted geometric descriptors.

关键词: electronic collective variables, chemical reactions, atomic charges, neural-network model, QM/MM data, enhanced sampling, reaction progress, charge-space

314. ❌ Hydrogen-helium immiscibility boundary in planets

作者: Xiaoyu Wang, Sebastien Hamel, Bingqing Cheng 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28927v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究氢-氦不混溶边界在行星内部的位置，使用机器学习势能驱动的大规模分子动力学模拟进行计算。论文的核心是计算物理学和行星科学，虽然使用了机器学习方法（机器学习势能），但所有关键词都专注于大语言模型（LLM）及其相关技术（如微调、推理优化、代理系统等）。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学领域的应用（计算物理学），但并非核心内容，只是工具使用，因此给予5分（有一定关联）。其他关键词均与大语言模型技术直接相关，而论文未涉及任何LLM内容，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文通过机器学习势能驱动的大规模分子动力学模拟，确定了氢-氦不混溶边界在行星内部的位置，发现该边界温度比先前模拟低约2000K，并表明氦雨在土星中可能发生而在木星中不太可能。

摘要翻译

氢-氦（H/He）不混溶边界的位置决定了巨行星中是否发生以及何处发生氦雨，但由于高压实验难度高，且从头算模拟受限于系统尺寸和模拟时间，该边界至今仍不确定。我们通过机器学习势函数驱动的大规模分子动力学模拟，计算了成分依赖的化学势，从而绘制出这一边界；所用势函数基于三种密度泛函近似（PBE、vdW-DF 和杂化泛函 HSE）训练得到。三种泛函给出了一致的不混溶边界，且在 100-1000 GPa 的压力范围内，其分相温度普遍比先前基于小尺度系统的从头算模拟结果低约 2000 K。通过将 H/He 混合自由能拟合至 Redlich-Kister 规则溶液模型，我们阐释了相分离的热力学驱动力，并为边界提供了可预测的表达式。与当前行星内部结构剖面的比较表明，氦雨可能在土星内部发生，但在温度更高的木星内部则不太可能出现。我们的研究缩小了 H/He 不混溶边界的不确定性，并为气态巨行星中耦合分相、热传输与成分梯度的行星模型提供了输入参数。

摘要 (Abstract)

The location of the hydrogen-helium (H/He) immiscibility boundary controls whether and where helium rain occurs in giant planets, yet it remains uncertain because high-pressure experiments are challenging and ab initio simulations are limited in system size and simulation time. We map this boundary by computing composition-dependent chemical potentials from large-scale molecular dynamics driven by machine learning potentials trained on three density functional approximations (PBE, vdW-DF, and the hybrid HSE). The three functionals yield consistent immiscibility boundaries, and the demixing temperatures are typically ~2000 K lower than previous ab initio simulations using small system sizes across the pressure range of 100-1000 GPa. Fitting the H/He mixing free energy to a Redlich-Kister regular solution model rationalizes the thermodynamic driving force for phase separation and provides a predictive representation of the boundary. Comparison with current planetary interior profiles indicates that helium rain is plausible in Saturn but unlikely in the warmer interior of Jupiter. Our results narrow the uncertainty in the H/He immiscibility boundary and provide inputs for planetary models that couple demixing, heat transport, and composition gradients in gas giants.

关键词: hydrogen-helium immiscibility, giant planets, molecular dynamics, machine learning potentials, helium rain, phase separation, planetary interior, thermodynamic modeling

315. ❌ Quantum coherence governs macroscopic polymorphism in organic semiconductors

作者: Hai Wang, Tianhong Huang, Jiawei Chang 期刊/来源: arxiv 发布日期: 2026-03-30 arXiv链接: http://arxiv.org/abs/2603.28834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子相干性对有机半导体宏观多晶型的影响，属于材料科学和量子物理交叉领域。所有关键词均涉及大模型、深度学习及相关技术，而论文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学领域（材料科学），但并未使用AI方法，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文揭示了量子相干性是决定有机半导体铜酞菁宏观多晶型的关键因素，并提出了DIME框架来调控环境退相干，从而实现了对新多晶型ω-CuPc的理性设计和合成。

摘要翻译

大质量大分子（例如富勒烯C${60}$）的波粒二象性是一种公认的量子现象。然而，大型有机分子的量子行为是否能够主动决定合成材料的宏观结构与功能，目前仍属未知。在有机半导体中，晶体多晶型现象从根本上决定了光电性能，但经典热力学模型始终无法解释相选择的微观起源。这包括一个长期存在的异常现象：在不同反应器规模下，相同的热力学参数却导致了截然不同的多晶型形成。本文研究表明，酞菁铜（CuPc，576 Da）——一种质量与C${60}$相当的大平面分子——的多晶型受控于常压有机气相沉积（OVPD）过程中的量子相干性。我们建立了耗散结构场诱导的多体纠缠（DIME）理论框架，该框架整合了环境黑体辐射、分子德布罗意波长以及前沿轨道方向性，以模拟场驱动的量子纠缠。我们证明，室温下极弱的环境退相干效应能够维持分子物质波的相干性，从而实现了超长（>1 cm）单晶$η$-CuPc纳米线的自组装。通过利用DIME框架调控环境退相干，我们合理设计了一种OVPD反应器，合成出此前未被发现的多晶型，命名为$ω$-CuPc。我们的研究揭示，多体量子纠缠是有机晶体组装的决定性调控因素，为有机半导体多晶型工程开辟了一条确定性的量子级调控路径。

摘要 (Abstract)

The wave-particle duality of massive macromolecules – such as the fullerene C${60}$ – is a well-established quantum phenomenon. However, whether the quantum behavior of large organic molecules actively dictates the macroscopic structure and function of synthetic materials remains unknown. In organic semiconductors, crystal polymorphism fundamentally determines optoelectronic performance, yet classical thermodynamic models consistently fail to resolve the microscopic origins of phase selection. This includes the long-standing anomaly of divergent polymorph formation under identical thermodynamic parameters across different reactor scales. Here we show that the polymorphism of copper phthalocyanine (CuPc, 576 Da) – a planar macromolecule comparable in mass to C${60}$ – is governed by quantum coherence during atmospheric-pressure organic vapor phase deposition (OVPD). We establish the Dissipative structure field-Induced Multipartite Entanglement (DIME) framework, which integrates ambient blackbody radiation, molecular de Broglie wavelengths, and frontier orbital directionality to model field-driven quantum entanglement. We demonstrate that exceptionally weak environmental decoherence at room temperature preserves the coherence of molecular matter waves, enabling the self-assembly of ultralong ($>1$ cm) single-crystalline $η$-CuPc nanowires. By leveraging the DIME framework to manipulate environmental decoherence, we rationally designed an OVPD reactor to synthesize a previously undiscovered polymorph, designated $ω$-CuPc. Our findings reveal that multipartite quantum entanglement acts as the decisive regulator of organic crystal assembly, opening a deterministic, quantum-level pathway for engineering organic semiconductor polymorphism.

关键词: quantum coherence, organic semiconductors, polymorphism, copper phthalocyanine, molecular entanglement, crystal assembly, DIME framework, decoherence

316. ❌ Enhancing Spin Coherence of Optically-Addressed Molecular Qubit by Nuclear Spin Hyperpolarization

作者: Boning Li, Patrick Hautle, Duhan Zhang, Liangping Zhu, Ashley Beers, Zeyu Wang, Paola Cappellaro, Tom Wenckebach, Yifan Quan 期刊/来源: arxiv 发布日期: 2026-03-29 arXiv链接: http://arxiv.org/abs/2603.27872v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究分子量子比特的自旋相干性增强，属于实验物理和量子信息科学领域。所有关键词均与大模型、深度学习、AI技术原理或应用相关，而本文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于科学应用（量子技术），但论文未使用AI方法，因此仅给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过核自旋超极化技术抑制分子量子比特中核自旋浴引起的退相干，实验观察到质子自旋浴极化60%时自旋回波衰减时间延长了25%，为高相干分子自旋系统提供了可调控的设计框架。

摘要翻译

光学可寻址分子三重态自旋为量子应用提供了一个化学可调平台，但其相干性常受限于与周围自旋浴的相互作用。本文展示了在高纯度萘单晶中共晶并五苯的光激发三重态自旋中，对核浴诱导退相干的可控抑制。通过三重态动态核极化技术对质子自旋浴进行超极化，核自旋产生的磁噪声被有效压制，从而延长了电子自旋的横向相干时间。实验上，我们观察到在质子自旋浴达到60%极化率时，自旋回波衰减时间提升了25%。测得的自旋回波衰减时间（$T_2$）随核极化的变化规律，在定量上符合基于极化调控核二阶矩理论推导的预测关系。相干时间的提升幅度与绝对值均通过团簇关联展开（CCE）模拟得到了定量复现。这些结果确立了核自旋超极化作为一种普适且主动可调的分子量子比特相干性调控方法。本工作为高相干分子与固态自旋系统提供了一个广泛适用的设计框架。

摘要 (Abstract)

Optically addressable molecular triplet spins provide a chemically tunable platform for quantum application, but their coherence is often limited by interactions with surrounding spin baths. Here we demonstrate controlled suppression of nuclear-bath-induced decoherence in photoexcited triplet spins of pentacene co-crystallized in high-purity naphthalene single crystals. By hyperpolarizing the proton spin bath through triplet dynamic nuclear polarization (triplet-DNP), magnetic noise generated by the nuclear spins is suppressed, leading to an extension of the electron spin transverse coherence time. Experimentally, we observe a 25% enhancement of the spin-echo decay time with $60%$ polarization of the proton spin bath. The measured scaling of the spin-echo decay time ($T_2$) with nuclear polarization quantitatively follows the predicted dependence derived from the polarization-controlled nuclear second moment. Both the enhancement and the absolute value of the coherence time are quantitatively reproduced by cluster correlation expansion (CCE) simulations. These results establish nuclear spin hyperpolarization as a general and actively tunable approach to engineering coherence in molecular qubits. This work provides a broadly applicable design framework for high-coherence molecular and solid-state spin systems.

关键词: molecular qubit, spin coherence, nuclear spin hyperpolarization, triplet-DNP, decoherence suppression, spin-echo decay time, cluster correlation expansion, solid-state spin systems

Token 消耗统计

总计: 1,005,540 tokens（输入 694,011 / 输出 311,529）

模型	输入	输出	合计
deepseek-chat	563,304	311,529	874,833
glm-4.7	130,707	0	130,707

📊 ArXiv 研究报告 (2026-04-02)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vi

2. One-for-All: A Lightweight Stabilized and Parameter-Efficient Pre-trained LLM for Time Series Foreca

3. Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermed

4. SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

5. 6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management

6. Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

7. Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse

8. DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

9. Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

10. An Empirical Study of Multi-Agent Collaboration for Automated Research

11. Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

12. Concept frustration: Aligning human concepts and machine representations

📋 所有论文列表

1. ✅ Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

2. ✅ One-for-All: A Lightweight Stabilized and Parameter-Efficient Pre-trained LLM for Time Series Forecasting

3. ✅ Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

4. ✅ SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

5. ✅ 6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management

6. ✅ Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

7. ✅ Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

8. ✅ DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

9. ✅ Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

10. ✅ An Empirical Study of Multi-Agent Collaboration for Automated Research

11. ✅ Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

12. ✅ Concept frustration: Aligning human concepts and machine representations

13. ❌ AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

14. ❌ Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

15. ❌ A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

16. ❌ Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks

17. ❌ NeuroBRIDGE: Behavior-Conditioned Koopman Dynamics with Riemannian Alignment for Early Substance Use Initiation Prediction from Longitudinal Functional Connectome

18. ❌ Reward-Based Online LLM Routing via NeuralUCB

19. ❌ Designing FSMs Specifications from Requirements with GPT 4.0

20. ❌ How Symmetry Governs the Dihedral Angle Dependence of Intermolecular Spin-Orbit Coupling

21. ❌ Perspective of Fermi’s golden rule and its generalizations in chemical physics

22. ❌ Automatic Identification of Parallelizable Loops Using Transformer-Based Source Code Representations

23. ❌ Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

24. ❌ Tucker Attention: A generalization of approximate attention mechanisms

25. ❌ The Triadic Cognitive Architecture: Bounding Autonomous Action via Spatio-Temporal and Epistemic Friction

26. ❌ Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models

27. ❌ Phyelds: A Pythonic Framework for Aggregate Computing

28. ❌ Scalable AI-assisted Workflow Management for Detector Design Optimization Using Distributed Computing

29. ❌ Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives

30. ❌ Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

31. ❌ Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration

32. ❌ Trimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology, Gene Expression, and MRI

33. ❌ Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

34. ❌ Four Generations of Quantum Biomedical Sensors

35. ❌ Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

36. ❌ Rethinking AI Literacy Education in Higher Education: Bridging Risk Perception and Responsible Adoption

37. ❌ Bethe Ansatz with a Large Language Model

38. ❌ ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

39. ❌ End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines

40. ❌ Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence

41. ❌ Training deep learning based dynamic MR image reconstruction using synthetic fractals

42. ❌ SISA: A Scale-In Systolic Array for GEMM Acceleration

43. ❌ ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

44. ❌ C-TRAIL: A Commonsense World Framework for Trajectory Planning in Autonomous Driving

45. ❌ ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

46. ❌ Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports

47. ❌ Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis

48. ❌ From Density Matrices to Phase Transitions in Deep Learning: Spectral Early Warnings and Interpretability

49. ❌ Reasoning-Driven Synthetic Data Generation and Evaluation

50. ❌ From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety

51. ❌ Tracking vs. Deciding: The Dual-Capability Bottleneck in Searchless Chess Transformers

52. ❌ TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios

53. ❌ CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing

54. ❌ BotVerse: Real-Time Event-Driven Simulation of Social Agents

55. ❌ Spontaneous Functional Differentiation in Large Language Models: A Brain-Like Intelligence Economy

56. ❌ Reinforced Reasoning for End-to-End Retrosynthetic Planning

57. ❌ Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

58. ❌ Exploring the Impact of Skin Color on Skin Lesion Segmentation

59. ❌ Measuring the metacognition of AI

60. ❌ A First Step Towards Even More Sparse Encodings of Probability Distributions

61. ❌ KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models