MLSanity

Read It Back: Pretrained MLLMs Are Zero-Shot Reward Models for Text-to-Image Generation

Runhui Huang, Qihui Zhang, Zhe Liu, Yu Gao, Jie Wu, Hengshuang Zhao (cs.CV)

In this paper, we propose SpectraReward, a training-free reward function that turns pretrained MLLMs into off-the-shelf reward models for image-generation reinforcement learning. Instead of asking the MLLM to judge a generated image or answer decomposed verification questions, SpectraReward measures how well the original prompt can be recovered from the generated image through a single image-conditioned, teacher-forced forward pass. We use the average image-conditioned prompt log-likelihood as the reward, directly reusing the MLLM's pretrained image-text alignment ability without preference labels, reward-model fine-tuning. We further introduce Self-SpectraReward, a special case for unified multimodal models where the policy's own understanding branch serves as the reward model for its generation branch, forming a closed-loop self-improving framework without external reward models or external knowledge. Extensive experiments validate SpectraReward through a broad image-generation RL study covering two diffusion models, three RL algorithms, nine reward MLLM backbones from four MLLM families spanning 4B to 235B parameters, and five out-of-distribution text-to-image benchmarks. Results show that both SpectraReward and Self-SpectraReward significantly and consistently improve generation performance and outperform prior MLLM-derived reward training methods. Further analysis reveals that larger reward MLLMs are not always better, while Self-SpectraReward can match or surpass much larger external reward models, suggesting that reward-policy alignment is a key factor for effective image-generation RL. Project Page: https://huangrh99.github.io/SpectraReward/

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Latent-Identity Tuning in Text-to-Image Personalization Models

Daniel Garibi, Ronen Kamenetsky, Hadar Averbuch-Elor, Daniel Cohen-Or, Or Patashnik (cs.CV, cs.GR)

Generating and editing a person's face demands high precision, as even minor modifications can significantly alter a subject's perceived identity. Current personalization and editing methods built on general-purpose text-to-image models, however, often lack the precision required for fine-grained facial edits. We present a method for fine-grained identity tuning in text-to-image personalization models. Unlike standard image editing, which operates on a given image, identity tuning modifies the latent representation of a specific identity, enabling the generation of diverse images that consistently depict the same edited identity. To enable fine-grained latent identity tuning, we explore the latent space of a pre-trained, frozen encoder for text-to-image personalization. Our approach requires no additional training. Instead, it leverages the existing architecture of a frozen encoder to uncover latent semantic directions. This space consists of a set of latent tokens that play distinct roles in capturing different aspects of an identity and often correspond to specific spatial or semantic facial regions. We show that meaningful directions can be identified within this space and within subspaces defined by selected tokens, enabling localized, fine-grained, and semantically coherent edits. We validate our approach through qualitative and quantitative experiments that demonstrate diverse localized facial edits while preserving cross-image identity consistency. Project page at: https://garibida.github.io/IdentityTuning/

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Mixture of Frames Policy: Multi-Frame Action Denoising for Bimanual Mobile Manipulation

Dian Wang, Jisang Park, Xiaomeng Xu, Han Zhang, Shuran Song, Jeannette Bohg (cs.RO)

Robotic manipulation is inherently multi-frame: local actions may be simple in an end-effector frame, while transport, upright-object handling, and whole-body coordination are better represented in a base-aligned frame. However, modern diffusion-based visuomotor policies typically commit to a single predefined action frame, forcing one denoiser to model action distributions that are often unnecessarily complex in that frame. We propose Mixture of Frames Policy (MoF), a diffusion policy that performs synchronized action denoising across multiple coordinate frames. MoF maintains a single canonical diffusion state, re-expresses it in several task-relevant frames, applies frame-specialized denoisers, and fuses their noise predictions back in the canonical frame. To make this possible for intermediate noisy diffusion states, we introduce a column-based 6D rotation representation within an SE(3) action parameterization that supports exact, differentiable frame transformations without requiring noisy rotations to lie on the SO(3) manifold. Across nine simulated bimanual manipulation tasks, we show that the best action frame is task-dependent and that MoF improves over oracle frame selection and standard Mixture-of-Experts (MoE) baselines. We further evaluate MoF on two real-world bimanual mobile manipulation tasks, demonstrating that it outperforms all constituent single-frame baselines. Project homepage: https://mofpo.github.io

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Requential Coding: Pushing the Limits of Model Compression with Self-Generated Training Data

Shikai Qiu, Marc Finzi, Yujia Zheng, Kun Zhang, Andrew Gordon Wilson (cs.LG)

Compression is fundamental to intelligence. A model that can represent its training data as a short code has discovered regularities that enable generalization. Large neural networks may learn functions far simpler than their parameter counts suggest, but it is challenging to construct codes that realize this simplicity. Parameter-based methods such as quantization produce code lengths that scale with model size, insensitive to how much information the parameters store. Prequential coding bypasses this issue by compressing the training trajectory, but codes the exact data sequence regardless of how much the model learns, yielding large codes when the data has high entropy. We introduce requential coding, where a teacher model selects training samples drawn from the student's own distribution. The student's code records only these selections, which cost bits only where teacher and student disagree. The resulting code length is independent of parameter count and data entropy, and often orders of magnitude shorter than the prequential counterpart, with an advantage that grows with scale. This compression sheds light on phenomena inaccessible to prior compressors. Holding loss fixed, larger models and ensembles compress to much smaller sizes despite more parameters. Plugged into a PAC-Bayes bound, the requential code yields state-of-the-art generalization guarantees for billion-parameter LLMs, outperforming bounds built on aggressive post-training quantization even granted zero error. The bound tightens with scale in the compute-optimal regime, as models become increasingly compressible relative to dataset size. The same code predicts that models gradually overfit when trained for multiple epochs. It also isolates the learnable information in a dataset from its unpredictable, random content, revealing that lower-entropy text holds far more learnable structure than higher-entropy image data.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Metacognition in LLMs: Foundations, Progress, and Opportunities

Gabrielle Kaili-May Liu, Areeb Gani, Jacqueline Lu, Jordan Thomas, Mark Steyvers, Arman Cohan (cs.CL, cs.AI)

Metacognition is a foundational component of intelligence critical to effective learning, problem solving, decision-making, communication, and more. In recent years, it has become increasingly recognized as a cornerstone of capable, transparent AI systems. Yet while LLMs have made significant progress across diverse real-world tasks, it is not yet clear when, how, or to what extent they can exhibit or be endowed with effective metacognitive abilities, nor how such abilities can be adapted to advance the fundamental capabilities, reliability, and intelligence of AI systems. This paper bridges this gap by presenting the first comprehensive overview of the current state of knowledge on metacognition for LLMs. We analyze and taxonomize the landscape of this emerging field and summarize recent technical advancements, including methods and benchmarks to measure and evaluate LLMs' metacognitive abilities, techniques to elicit, improve, and apply metacognition in LLMs, and findings and implications of ongoing research. We also discuss applications, open questions and challenges, and promising directions for future work. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful research and discussion. An organized list of papers can be found at https://github.com/yale-nlp/LLM-Metacognition.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Invariant Learning Dynamics of Transformers in Inductive Reasoning Tasks

Tiberiu Musat, Tiago Pimentel, Nicholas Zucchet, Thomas Hofmann (cs.LG, cs.AI)

We present a theoretical framework to explain the emergence of inductive reasoning abilities in Transformer language models. While previous works on Transformer learning dynamics have so far been mostly tied to specific tasks, we study a generalized class of inductive tasks that unifies several synthetic tasks known in the literature, including in-context n-grams and multi-hop reasoning. In this class, we theoretically prove that the training dynamics of attention models can be confined to a highly interpretable, low-dimensional invariant manifold. On this manifold, the learning dynamics are captured by a handful of interpretable coordinates rather than millions of parameters, making both theoretical and empirical analysis more tractable. Using this framework, we characterize how data statistics govern the competition between in-context and in-weights learning, we study how random initializations determine the `winning' circuit when multiple solutions are possible, and we demonstrate that the coordinate frame associated with the manifold can be used to automatically detect which circuits have been learned in trained models. By casting circuit formation as a low-dimensional dynamical phenomenon, we take a step toward a predictive theory of how Transformers learn.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

A Minimalist Retargeting-Guided Reinforcement Learning Recipe for Dexterous Manipulation

Yunhai Feng, Natalie Leung, Jiaxuan Wang, Lujie Yang, Haozhi Qi, Preston Culbertson (cs.RO, cs.AI, cs.LG)

Recent work in humanoid whole-body control has found success with a simple recipe: retarget human motion to robot kinematic references, then train policies via reinforcement learning (RL) to track them. But how does this recipe transfer to dexterous manipulation? The answer is not obvious, as manipulation involves complex, contact-rich dynamics and requires delicate regulation of contact modes and forces. We present REGRIND, a minimalist retargeting-guided RL pipeline that learns dexterous manipulation policies from a single human demonstration. REGRIND retargets human hand-object motion to a robot reference that preserves hand-object spatial and contact relationships, trains a residual RL policy in simulation to track object-centric keypoints along that reference, and transfers the resulting policy zero-shot to hardware with careful system identification. The resulting policies produce fluid, human-like behavior on two different multi-fingered hands across contact-rich tool-use tasks, including operating a pair of scissors and turning a screwdriver. Through systematic hardware experiments, we identify and analyze the key factors that govern sim-to-real transfer in dexterous manipulation, offering practical guidance for retargeting-based learning in contact-rich settings. Videos and code are available at https://yunhaifeng.com/REGRIND.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

A Durability and Cross-Language Transfer Benchmark for a Validated Teaching-Feedback Classification Protocol

Esteban U. Vega Barajas (cs.CL, cs.LG)

Institutions collect far more open-ended teaching-evaluation feedback than they read. A prior study introduced a validated protocol for classifying such comments by thematic category and sentiment, built from a documented annotation guide, an intra-annotator reliability measurement, stratified cross-validation, and a held-out evaluation on a Spanish institutional corpus with a frozen-encoder design. Two questions limit its reuse: whether a protocol fixed to 2019-era frozen embeddings stays competitive as representation methods advance, and whether it transfers to a second language. We re-run it on the original Spanish data across three representation generations, sparse lexical features, frozen transformer embeddings, and prompted large language models, and transfer its sentiment task to English with a balanced 45,000-comment corpus checked against an aspect-labeled education dataset. Treating paired comparisons as descriptive, we find the protocol durable: a 2026 frontier model posts the highest thematic F1 on the hardest Spanish task, yet shows no sentiment advantage over a cheap model and no descriptive separation from it on English, so model choice is a deployment decision, not a property of the method.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Inside the Unfair Judge: A Mechanistic Interpretability Account of LLM-as-Judge Bias

Zixiang Xu, Sixian Li, Huaxing Liu, Xiang Wang, Shuai Li, Zirui Song, Xiuying Chen (cs.LG, cs.AI, cs.CL)

Existing studies of LLM-as-judge scoring bias work predominantly at the input-output level: they perturb inputs, measure score deltas, and propose prompt-level mitigations. We argue that the same biases admit a representation-level account in the judge's hidden state, complementary to the input-output view and operationally useful in ways it does not afford. We report three findings, across seven judges, seven bias types, and nine benchmarks. Geometry: baseline judging inputs occupy a tight activation manifold while biased inputs are displaced along a low-dimensional, type-specific subspace that sharpens with depth and is recovered consistently by three families of estimators. Causal control: steering hidden states along this subspace drives scoring in both directions, forward shifts reproducing biased scoring on clean inputs and reverse shifts restoring baseline scoring on biased ones, while matched-norm random directions produce shifts an order of magnitude smaller. Operational: a simple linear projection onto the same bias-direction features anticipates judge failures on three entirely unseen benchmarks, substantially outperforming text-based alternatives. Reading bias as activation geometry, rather than as input-output noise, unifies geometric structure, causal control, and operational prediction within a single framework. The project page is available at https://xzx34.github.io/unfair-judge/

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Evidence-Backed Video Question Answering

Shijie Wang, Honglu Zhou, Ziyang Wang, Ran Xu, Caiming Xiong, Silvio Savarese, Chen Sun, Juan Carlos Niebles (cs.CV, cs.AI)

Current Video Large Language Models (Video LLMs) excel in question answering (QA) but largely operate as black boxes, providing textual answers without verifiable visual grounding. Existing explainability efforts rely on textual rationales or sparse bounding boxes, which struggle to capture complex video dynamics such as occlusions and non-rigid deformations. We propose Evidence-Backed Video Question Answering (E-VQA), a novel task requiring models to jointly output a semantic answer and precise spatio-temporal evidence: temporal segments and dense, tracked object segmentation masklets. To support this, we introduce ST-Evidence, the first human-verified benchmark for both discriminative and generative pixel-level grounding. Evaluations of state-of-the-art models reveal a critical decoupling between QA accuracy and true visual perception that scaling alone fails to bridge. To address this, we develop scalable, automated generation pipelines to create ST-Evidence-Instruct, a 160k-scale dataset bridging high-level reasoning with fine-grained grounding. Fine-tuning grounded Video LLMs on this data yields substantial gains over the corresponding size-matched UniPixel baselines (e.g., +27.2 t-mean and +13.8 J&F on a 7B model), establishing a robust baseline for explainable, evidence-backed video understanding. Code and data are available at https://github.com/SalesforceAIResearch/EVQA.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Can LLMs Perform Deep Technical Comprehension of Computer Architecture Papers?

Nishant Aggarwal, Ayushi Dubal, Sreeraj Kannakarankodi, Ian McDougall, Adarsh Mittal, Vishnu Ramadas, Noah Scott, Ranganath Selagamsetty, Weichu Yang, Karthikeyan Sankaralingam (cs.CY, cs.AR, cs.MA)

Can large language models perform deep technical comprehension of computer architecture papers -- not summarization, but structured critique that names the core mechanism, surfaces buried assumptions, and connects a contribution beyond its own scope? We study Gauntlet, an open-source pipeline that analyzes a paper through five independent expert-persona reviewers and an adversarial synthesis stage. On 20 ISCA 2025 and HPCA 2026 papers, ten researchers each wrote their own analyses and then judged, for papers other than their own, the human analysis against Gauntlet's. Across the 20 comparisons evaluators preferred Gauntlet in 15 (human in 4, one tie); its advantage is significant on per-analyst totals (paired Wilcoxon, p < 0.01) and largest on Critical Rigor, vanishing only on Calibration. Where humans win, it is on trust and usefulness rather than depth: a confident wrong claim, a mechanism described but not taught, or unprioritized breadth. A 98-paper automated ablation shows the gain comes from the multi-agent structure -- the pipeline beats the same model run as a single rich-persona agent on 96% of papers -- and specifically from its synthesis pass. We release all analyses, scores, and the rubric as a community resource.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Causal Discovery in Mixtures of Populations

Bijan Mazaheri, Spencer Gordon, Yuval Rabani, Leonard Schulman (cs.LG, cs.CC, math.ST)

Causal discovery aims to learn causal structures up to certain symmetries. Diverse populations or changing environments give rise to heterogeneous data in the following sense: each population/environment is a ``source'' which idiosyncratically determines the forms of causal effects. From this perspective, the source is a latent common cause for every observed variable. While some methods for causal discovery can work around latent confounding in special cases, a global confounder poses a significant challenge. The only known ways to deal with latent global confounding involve making assumptions that limit structural equations and/or noise functions. We demonstrate that globally confounded causal structures can still be identified with arbitrary structural equations and noise functions, so long as the number of latent classes remains small relative to the size and sparsity of the underlying DAG. The approach relies on agglomerating variables into large-enough matrices of moments, whose ranks directly reveal graphical properties of the causal structure. We also provide a statistical test to test the rank of these matrices.

Review

PDF

Published: November 13, 2023

Last updated: July 13, 2026

Robust bipedal locomotion on flowable slopes via foot-driven terrain manipulation

Deniz Kerimoglu, Junnosuke Kamohara, Jiyeon Maeng, Ziwon Yoon, Seth Hutchinson, Ye Zhao, Daniel I. Goldman (cs.RO)

Bipedal robots are challenging to control because they operate close to instability, where small variations in foot-terrain contact can rapidly destabilize locomotion. On rigid terrain, bipedal robots mitigate this fragility by using well-established contact mechanics and control strategies. On flowable surfaces such as granular slopes, foot contact can induce large surface deformations and solid-fluid-like transitions, coupling terrain effects with robot dynamics, leading to underperformance or failure. This is partly due to the lack of reliable methods to represent the dynamics of flowable terrain, making it difficult to account for terrain effects in locomotion design. Here, we investigate how controlling terrain response can improve bipedal locomotion on granular slopes by studying the terradynamics of cleated feet, thin plates emanating from the foot soles. Systematic studies of a small-scale (1.4 kg) robophysical biped reveal that cleats with sparse and dense spacing lead to excessive terrain yielding and resistance, respectively, degrading performance and leading to failure. An intermediate cleat spacing distributes interaction forces to maintain substrate stresses near (or below) the yield threshold, enabling walking on granular slopes up to 30 degrees. Guided by these principles, we design a foot that actively adjusts cleat depth and accommodates both rigid and granular terrain. We also demonstrate that the principles of effective foot-terrain interaction translate to a larger (15 kg) autonomous biped. Our study presents an alternative to conventional body-centric robot control approaches, which regulate terrain-induced effects through body motion, by instead regulating terrain interactions through limb-centric approach.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Need for Speed Sort: A Recursive Distribution-Based Sorting Algorithm

Fran Sučić, Leo Vitasović, Nikola Petrušić (cs.DS)

We present Need for Speed Sort (NFS Sort), a recursive distribution-based sorting algorithm designed for numeric arrays. The algorithm partitions elements into equal-width value intervals, recursively refines dense buckets, and propagates analytical interval bounds between recursive calls, avoiding repeated scans for local minima and maxima. NFS Sort combines a fragment-based, cache-conscious scatter procedure for large subarrays with a lower-overhead auxiliary-array approach for smaller inputs. Small buckets are deferred to a final insertion-sort cleanup, while a comparison-based fallback is activated when recursive partitioning repeatedly fails to reduce the problem size. This mechanism guarantees a worst-case running time of O(n log n) and auxiliary space usage of O(log n). Experimental evaluation on synthetic inputs and real-world datasets from the SOSD benchmark suite compares NFS Sort with Balanced Learned Sort, IPS4o, Boost Spreadsort, PDQSort, and std::sort. The results show that NFS Sort is competitive or better than established state-of-the-art sorting methods across dataset sizes and distributions, outperforming the learned baseline particularly on smaller inputs while retaining strong performance at larger scales. Overall, NFS Sort combines efficient recursive distribution, practical memory management, and robust worst-case guarantees for high-performance numeric sorting.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

AdvancedMathBench: A Benchmark Suite for Advanced Mathematical Proof Generation and Verification

Lingkai Kong, Zijian Wu, Yuzhe Gu, Haiteng Zhao, Wenyong Huang, Shuang Sun, Zhicheng Xiong, Xiaotian Zhang, Shuya Zhao, Yan Wang, Disheng Xu, Wenwei Zhang, Kai Chen (cs.CL)

Large language models (LLMs) have achieved remarkable performance on high-school and olympiad-style mathematics, yet their capabilities on advanced mathematics remain poorly understood. Existing benchmarks, however, fall short in both scope and evaluation granularity: they provide limited disciplinary coverage and often rely on final-answer correctness or coarse judgments, leaving the validity of the reasoning process inadequately assessed. To bridge this gap, we introduce AdvancedMathBench, a benchmark suite designed to evaluate advanced mathematical reasoning capabilities. Its core proof-generation benchmark, ProverBench, contains 296 problems spanning undergraduate and doctoral qualifying-exam levels. To provide reliable evaluation of the proofs, we develop a dedicated automatic verification pipeline trained on large-scale expert annotations to produce both correctness verdicts and fine-grained assessments of proof errors, which exhibits strong agreement with human experts on held-out proof trajectories. We further introduce VerifierBench, consisting of 888 model-generated proof trajectories paired with expert ground truth, to evaluate whether models can correctly judge proof validity and provide sound verification rationales. Experiments show that AdvancedMathBench remains challenging for frontier models. On proof generation, the best-performing model, GPT-5.5-xhigh, achieves only 75.8 and 66.1 on the UGD and QE splits, respectively, indicating substantial room for improvement on advanced mathematical proof construction. On proof verification, the best model attains a Balanced F1 of only 65.1, and models generally exhibit low true negative rates, suggesting that critical error detection remains a major bottleneck.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa, Hugo Proença, Tiago Roxo (cs.CV)

Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formulations address this by discriminating manipulation type, but introduce an unresolved problem: models may rely solely on data source integrity to detect DeepFakes without evaluating their semantic consistency. If the DeepFake origin is not in the data source but in its content, can semantic mismatch be assessed by the state-of-the-art? This paper proposes a new evaluation setup, extending the four-class formulation by explicitly modeling semantic-level inconsistency between authentic modalities with the introduction of a new class: Real Audio-Real Video with Semantic Mismatch RARV-SMM. We assess the robustness of state-of-the-art models in this new realistic DeepFake setting, using the FakeAVCeleb dataset, highlighting the limitations of existing approaches when faced with semantic mismatch data. We further introduce three RARV-SMM variants that expose distinct architectural vulnerabilities as audio-visual divergence increases. We also propose a semantic reinforcement strategy that incorporates the semantic mismatch class and ImageBind embeddings to probe whether an explicit semantic coherence signal improves detection across architectures with different detection strategies, on FakeAVCeleb and LAV-DF, contributing toward more realistic DeepFake detectors. The source code available at https://github.com/sharayu-20/deepfake-semantic-mismatch.

Review

PDF

Published: April 30, 2026

Last updated: July 13, 2026

PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen (cs.CL, cs.AI)

Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

Review

PDF

Published: August 29, 2025

Last updated: July 13, 2026

Beyond the Single Camera: Agentic Multi-View Reasoning in Sports Video Understanding

Kerui Chen, Jinglu Wang, Xiaoyi Zhang, Yan Lu (cs.CV)

Recent Multimodal Large Language Models (MLLMs) achieve strong performance on single-view video understanding benchmarks. However, sports videos involve dense occlusion, rapid motion, and complex interactions that are difficult to resolve from a single viewpoint. In practice, sports events are recorded from multiple camera angles, providing complementary evidence used by referees. Yet, no existing benchmark evaluates MLLMs on multi-view sports video understanding. To address this gap, we introduce SportMV-Bench, a comprehensive benchmark built from official match recordings, through a dedicated pipeline combining LLM-based generation, MLLM-based verification, and human filtering to ensure quality and consistency. SportMV-Bench containing 787 multi-view video bundles and 2592 question-answer pairs across three categories: Perception-Aware Recognition (PAR), Rule-aware Event Interpretation (REI), and Adjudicative Decision Reasoning(ADR). Our analysis shows that current MLLMs fail to effectively exploit multi-view information, with the bottlenecks lying in fine-grained visual perception and view selection rather than logical reasoning or domain knowledge. We propose SportMV-Agent, an agentic framework that orchestrates an iterative loop of active view selection, perception tool execution, and evidence-grounded reasoning, achieving a significant 14.46% relative improvement over the strongest MLLM baseline.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Input-Aware Dynamic Backdoor Attack Against Quantum Neural Networks

Junrui Zhang, Zemin Chen, Lusi Li, Mohammad Ghasemigol, Daniel Takabi, Rui Ning (quant-ph, cs.LG)

Quantum Neural Networks (QNNs) are a promising framework for quantum machine learning on near-term quantum devices, but their security risks remain insufficiently understood. Studies have shown that QNNs are vulnerable to backdoor attacks, yet existing quantum backdoors mostly rely on a fixed trigger shared by all poisoned inputs. This fixed-trigger design is a major weakness because many defenses detect or weaken the repeated patterns such triggers leave in data representations. Although input-aware dynamic backdoors have been studied in classical neural networks, transferring them to QNNs is difficult because quantum learning introduces new obstacles. In particular, measurement compresses the post-ansatz quantum state into a limited classical output, weakening supervision for a trigger generator, while individual density matrices fluctuate with the input and make per-sample contrastive learning unstable. To address these challenges, we propose Q-DIBA, the first input-aware dynamic backdoor attack for QNNs. Q-DIBA jointly trains a classical trigger generator and a victim QNN through a three-mode mini-batch strategy that supports clean behavior, attack activation, and trigger specificity. To provide stable quantum-level supervision, Q-DIBA introduces an ensemble density contrastive loss that operates on post-ansatz quantum states before measurement and contrasts mode-averaged density matrices rather than individual samples. Experiments on MNIST and Fashion-MNIST across multiple QNN architectures show that Q-DIBA achieves high clean accuracy, strong attack success, and high cross-trigger accuracy, demonstrating effectiveness, stealthiness, and input specificity. The attack also remains resilient against defenses including visual inspection, spectral-signature detection, and fine-tuning, suggesting that input-aware quantum backdoors are an important threat to secure QNN deployment.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Accelerating Sampling-Based Control via Learned Linear Koopman Dynamics

Wenjian Hao, Yuxuan Fang, Zehui Lu, Shaoshuai Mou (cs.RO, eess.SY)

This paper presents an efficient model predictive path integral (MPPI) control framework for systems with complex nonlinear dynamics. To improve the computational efficiency of classic MPPI while preserving control performance, we replace the nonlinear dynamics used for trajectory propagation with a learned linear deep Koopman operator (DKO) model, enabling faster rollout and more efficient trajectory sampling. The DKO dynamics are learned directly from interaction data, eliminating the need for analytical system models. The resulting controller, termed MPPI-DK, is evaluated in simulation on pendulum balancing and surface vehicle navigation tasks, and validated on hardware through reference-tracking experiments on a quadruped robot. Experimental results demonstrate that MPPI-DK achieves control performance close to MPPI with true dynamics while substantially reducing computational cost, enabling efficient real-time control on robotic platforms.

Review

PDF

Published: March 05, 2026

Last updated: July 13, 2026

LoRA-Based Cascaded Multimodal Fusion for Action Recognition in Medical Training Environments

Divya Mereddy, Jeevan Beedareddy (cs.CV, cs.AI)

This paper presents a cascaded Low-Rank Adaptation (LoRA)-based multimodal fusion framework for action and activity recognition in healthcare-oriented training environments. The proposed architecture combines parameter-efficient modality-specific adaptation with sequential fusion, enabling modalities to be integrated in stages without retraining previously learned components. Rather than assuming a fixed fusion structure, the framework first integrates more closely related modalities and then incorporates additional heterogeneous modalities, supporting scalable adaptation across datasets with different modality sets.We evaluate the framework on two healthcare-oriented training environment datasets: NurViD and the Nurse Training dataset. Across these datasets, preliminary results suggest that the proposed cascaded fusion strategy improves over individual modality models and provides competitive performance relative to previously reported dataset-specific baselines. Overall, these findings indicate that cascaded LoRA-based fusion is a promising parameter-efficient approach for integrating heterogeneous modalities in medical training action and activity recognition tasks. github: https://github.com/anonymous0-ai/LoRA-Based-Cascaded-Multimodal-Fusion-.git.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

HASTE: A Platform for Rapid Post-Disaster Building Damage Assessment

Caleb Robinson, Anthony Ortiz, Simone Fobi Nsutezo, Cameron Birge, Meygha Machado, Marcelo Duarte, Joaquin Rivero Rodriguez, Anthony Cintron Roman, Kevin White, Inbal Becker-Reshef, Juan M. Lavista Ferres (cs.CV)

When a large disaster strikes, responders need a map of which buildings are damaged within hours. The models that do well on public benchmarks assume matched before-and-after imagery and a training set drawn from similar past events, and neither is usually available for a new disaster in its first day. We present HASTE (High-speed Assessment and Satellite Tracking for Emergencies), a no-code web platform that lets analysts who are not machine learning engineers produce per-building damage maps from post-disaster satellite imagery. HASTE implements two methods that share one interface. The first requires the user to label polygons over the post-disaster scene, trains a small semantic segmentation model on that single scene, runs it over the whole image, and joins the per-pixel output to existing building footprints. The second embeds every footprint with a pretrained vision model, requires the user to label a handful of buildings, and fits a logistic regression in the browser that scores the rest of the scene in seconds. We describe the platform, both methods, and the engineering that supports them. We also report preliminary experiments on xBD showing that foundation-model embeddings pooled over footprints separate damaged from intact buildings using post-disaster imagery alone, matching a fully supervised ResNet-50 baseline with a twentieth of its labels. HASTE and its predecessors have supported more than thirty real-world disaster responses since 2023, spanning earthquakes, hurricanes, cyclones, floods, wildfires, and tornadoes, delivering results to humanitarian partners within hours to days of imagery becoming available. We close with the directions we think are most promising, including vision-language assessment, active learning, and damage models for roads and other infrastructure. HASTE is open source at https://github.com/microsoft/haste.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Cycle-World: Mitigating Error Accumulation in Long-term Video World Models via Reverse-Prediction Cycle Consistency

Zihan Su, Teng Hu, Jiangning Zhang, Ruiyan Wang, Ran Yi, Lizhuang Ma, Dacheng Tao (cs.CV)

Autoregressive diffusion models have enabled high-quality video generation, yet their sequential nature inherently suffers from error accumulation. In long-horizon video synthesis, minor prediction deviations compound over time, inevitably leading to unconstrained generative drift, structural collapse, and severe visual degradation. To address this, we propose Cycle-World, a novel framework designed for stable and temporally consistent long-video generation. Our approach tackles error drift by enforcing strict temporal reversibility across both the training and inference phases. Theoretically, we demonstrate that forward generative drift can be strictly bottlenecked by a cycle-consistency objective. During training, we integrate an efficient reverse-prediction model to implicitly embed causal constraints into the forward generator, compelling it to produce reversible sequences that tightly adhere to the natural video manifold. At inference time, we repurpose this frozen reverse model as a runtime corrector. Through gradient-based cycle guidance, it iteratively refines the generated latent representations, actively suppressing accumulated errors before they are committed to the historical context. Extensive experiments on the VBench benchmark demonstrate that Cycle-World's dual-phase synergy significantly mitigates error drift, achieving state-of-the-art overall generation quality and long-horizon temporal consistency in 60-second synthesis.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

MicroCharNet: Less is More for License Plate Character Detection

Huy Che, Dinh-Duy Phan, Duc-Lung Vu (cs.CV)

License plate character detection is a crucial component of intelligent transportation systems, where high accuracy and computational efficiency are required for real-time deployment. Although recent deep learning-based methods have substantially improved detection performance, many high-accuracy models rely on large-scale architectures that incur substantial computational overhead, limiting their applicability to resource-constrained devices. In this paper, we propose MicroCharNet, an ultra-lightweight model specifically designed for license plate character detection. The proposed architecture employs a compact backbone composed of C2f blocks, integrated with CoordAtt module to enhance feature extraction while preserving spatial information. A lightweight C3k2-based neck fuses multi-level features, followed by a single-level anchor-free detection head that enables end-to-end prediction. Experiments conducted on the UFPR-ALPR dataset demonstrate that MicroCharNet achieves competitive detection accuracy with only 0.08M parameters and 0.096 GFLOPs, while outperforming several recent YOLO-based baselines. Hardware-level evaluations further confirm its efficiency for real-time deployment on edge devices. These results indicate that carefully designed ultra-lightweight architectures can effectively balance accuracy and efficiency in license plate character detection. The source code is available at https://github.com/chequanghuy/MicroCharNet.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Transformer-Guided Swarm Intelligence for Frugal Neural Architecture Search

Romain Amigon (cs.LG, cs.AI, cs.NE)

Neural Architecture Search (NAS) has automated the design of deep learning models but traditionally requires massive computational resources, often measured in thousands of GPU-days. In this paper, we propose a frugal and memetic NAS framework designed to democratize architecture design on consumer-grade hardware. Our approach combines the global macro-search capabilities of an autoregressive Transformer controller, trained via Reinforcement Learning (RL), with the local micro-exploitation of an Artificial Bee Colony (ABC) algorithm. To prevent premature convergence during the RL phase, we introduce a dynamic entropy mechanism that forces topological exploration upon detection of performance stagnation. Evaluated on a standard GPU (NVIDIA RTX 3060), our hybrid method effectively resolves the "cold-start" problem inherent in metaheuristics. By algorithmically penalizing network depth, our framework actively mitigates model bloat: on the CIFAR-10 dataset, it discovers an efficient architecture reaching 84.85

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Active Noise Floor Estimation for Reliability-Optimal POMDPs: A Value-of-Noise-Information Approach

Hyung-Jin Yoon (eess.SY, cs.RO)

Finite Reliability Representations (FRR) certify when a cell-constant policy is sufficient for reliable decision-making in a partially observed system with a known physical noise floor. In practice, however, sensing and execution noise can be latent and context-dependent. This paper develops a certificate-aware active disambiguation framework for an unknown physical noise parameter theta = (sigma_y, sigma_u), with the sensor-only case obtained by fixing sigma_u. We define the Value of Noise Information (VoNI) as the expected excess FRR certificate gap caused by using a reliability cover calibrated to the current estimate rather than to the realized noise parameter. We bound VoNI using action-value model mismatch and FRR radius inflation, showing that noise estimation has low decision value in sub-crossover regimes where the FRR certificate is insensitive to theta, but becomes valuable when posterior uncertainty can invalidate the current cover. A bi-level decision maker uses a posterior over theta, obtained from innovation statistics, execution residuals, or another online estimator, and triggers diagnostic probing only when uncertainty threatens the FRR certificate. We also interpret VoNI as a tractable, certificate-aware approximation to a high-level finite POMDP for latent sensing-execution regime disambiguation. Under stationary, identifiable, and persistently exciting regimes, we establish posterior consistency and convergence of the induced policy loss to the FRR approximation floor. Closed-loop UGV simulations with EKF-based innovation residuals show earlier detection of abrupt sensing-noise jumps, lower drift-tracking error, and substantially fewer probing actions than posterior-entropy exploration over 50 Monte Carlo trials.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Representing the Non-dominated Set of Multi-objective Network Problems by Supported Non-dominated Points

David Könen, Lara Löhken, Michael Stiglmayr (cs.DM, cs.NE, math.OC)

In multi-objective combinatorial optimization, unsupported non-dominated points typically outnumber supported points and are often significantly more challenging to compute. Recent studies show that extreme supported non-dominated points provide high-quality representations of the non-dominated set for certain binary problems. We demonstrate that this observation does not generalize to capacitated network optimization problems: representation quality decreases with increasing arc capacities, whereas supported non-dominated points consistently provide high-quality representations with respect to several quality indicators. However, supported point sets may still be too large in practical applications, where only a small, fixed number of alternatives is typically desired. Selecting fixed-size representations from the non-dominated set requires its computationally expensive generation and thus diminishes the computational advantages that representations are intended to provide. We therefore suggest the (extreme) supported points as alternative candidate sets in subset selection problems. Our numerical results show that restricting the candidate set to supported non-dominated points yields fixed-size representations of nearly the same quality as those selected from the complete non-dominated set. Overall, supported non-dominated points serve both as high-quality representations and as reasonable candidate sets for subset selection.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

MM-ToolSandBox: A Unified Framework for Evaluating Visual Tool-Calling Agents

Kaixin Ma, Di Feng, Alexander Metz, Jiarui Lu, Eshan Verma, Afshin Dehghan (cs.CV, cs.AI)

We introduce MM-ToolSandBox, a benchmark and evaluation framework for visually grounded tool-calling agents. The framework provides a stateful execution environment spanning 500+ tools across 16 application domains, supporting multi-image, multi-turn tasks where agents must ground progressively arriving visual inputs into executable tool calls while handling realistic conversational phenomena (goal revisions, error corrections, state mutations). An automated scenario generation pipeline produces diverse, visually grounded scenarios through information-flow-guided planning and multi-stage quality filtering, yielding 258 human-verified nominal scenarios and 50 variants targeting interactive UI applications. Evaluating 12 state-of-the-art models, from 4B open-weight to frontier proprietary systems, shows that current models still lack robust visual tool-calling capability: even the best model achieves below 50% success rate. Our failure analysis further reveals that visual precision, not only planning, is a primary bottleneck for capable models: 53% of failures stem from incorrect information extraction from images despite otherwise correct task workflows. A planning-to-precision crossover emerges with scale: smaller models fail at deciding what to do, while larger models fail at perceiving what they see, suggesting fundamentally different research directions for improving models at different capability levels. The framework and the benchmark are publicly available at https://github.com/apple/ml-mmtoolsandbox

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Relaxing Faithfulness with Intervention-Only Causal Discovery

Bijan Mazaheri, Jiaqi Zhang, Caroline Uhler (cs.LG, stat.ML)

Causal discovery algorithms learn a network that describes the causal dependencies among random variables. A common workflow involves first utilizing conditional independence properties on observational data to determine partially directed causal relationships, then applying interventions to orient the unknown causal directions. A critical assumption for the first step is faithfulness: a requirement that causally linked variables exhibit statistical dependence. Many natural systems include buffering and stabilizing pathways that cancel out to achieve systemic robustness. This cancellation of pathways violates faithfulness, leading causal discovery algorithms to incorrectly remove causal dependencies. In this paper, we argue that hard interventions contain information about the presence/absence of causal linkage that is overlooked in the first stage of structure discovery. We show that a mild assumption -- called intervention-immediacy faithfulness -- that allows cancellations, is sufficient to nonparametrically identify causal structures with hard interventions. These results position interventions as the primary carriers of information about causal structure, which should take precedence over conditional independence testing. To flip the paradigm, we also specify equivalence classes when the identification criteria are not met due to limitations in the scope of interventions.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Introducing Human-Centeredness in AI-Assisted Lexicography

Antonio San Martin, Catherine Trekker (cs.CL, cs.AI)

This paper proposes a human-centered artificial intelligence (HCAI) framework for AI-assisted lexicography. While generative AI offers significant opportunities to enhance lexicographic work, it also raises concerns regarding the future role of lexicographers and the preservation of linguistic and cultural diversity. Drawing on HCAI principles and previous applications in other language professions, the paper identifies four interrelated dimensions through which AI integration in lexicography can be understood and critically examined: the augmented lexicographer, the sociotechnical context of AI integration, bias, and the design of AI-powered lexicographic tools. The framework argues that AI should augment rather than replace lexicographers, combining high levels of automation with meaningful human control. It further emphasizes the importance of preserving professional agency, mitigating AI-generated biases, and designing tools around the needs of lexicographers. By doing so, the paper provides a foundation for future research and the beneficial integration of AI into lexicographic workflows.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Robust Bayesian Decision Making under Adversarial Uncertainty

Haripriya Harikumar, Sammie Katt, Yasir Zubayr Barlas, Samuel Kaski (cs.LG)

Scientific experiments are often designed to maximize information gain, yet in many applications the primary objective is to support reliable downstream decision-making. Existing decision-aware experimental design and active learning methods typically assume well-specified outcome models and implicitly rely on the stability of the optimal decision under real-world perturbations. In practice, however, experimental outcomes are frequently influenced by hidden or weakly modeled effects, which can substantially alter decision optimality and lead to misleading conclusions. We study sequential adversarially robust decision-aware experimental design, where data acquisition has to take into account information gain against plausible worst-case unexpected effects, modeled here as variation in adversarial variables. Building on Bayesian decision theory, we formalize an adversarially robust optimal decision under this setting and derive a principled Bayesian experimental design criterion. The criterion explicitly targets decision stability rather than nominal optimality. Experiments on synthetic and real-world scientific datasets show that conventional decision-aware design can converge rapidly to high confidence yet fragile decisions, while our robustness-aware approach yields decisions that are significantly more stable and reliable under adversarial variation.

Review

PDF

Published: July 09, 2026

Last updated: July 13, 2026

Polylogarithmic-Weight Dicke States in QAC^0 and Arbitrary Symmetric States in QAC^0_f

Lucas Gretta, Meghal Gupta, Malvika Raj Joshi (quant-ph, cs.DS)

An n-qubit Dicke state of weight k, is the uniform superposition over all n-bit strings of Hamming weight k. Dicke states are central to quantum algorithms exhibiting speedups, such as Decoded Quantum Interferometry (Jordan et al., Nature, 2025). In the NISQ era, quantum hardware is constrained by both depth and locality, motivating the question of which global operations suffice to prepare such states. QAC^0, the quantum analogue of AC^0, minimally extends local O(1)-depth quantum circuits by allowing arbitrary-width Toffoli (reversible AND) gates. We show that Dicke states of polylog(n) weight can be prepared in QAC^0. This gives the first QAC^0 construction of any super-constant-weight n-qubit Dicke state, since previous constructions relied on the much more powerful FANOUT_n gate. In general, we show that any weight-k Dicke state can be constructed using FANOUT_min(k,n-k) gates. Combined with recent hardness results, this yields a tight characterization: for k ≤ n/2, a n-qubit weight-k Dicke state can be prepared in QAC^0 if and only if FANOUT_k ∈ QAC^0. We develop a limited-fanout state-synthesis toolkit for QAC^0 that yields further constant-depth, poly(n)-ancilla constructions: 1. Every n-qubit symmetric state supported on Hamming weight ≤ k can be prepared using FANOUT_k gates. 2. Every O(log n)-qubit state can be prepared using quantum random-access memory (QRAM_n), which refers to a coherent indexing gate. QRAM_n is a potentially weaker resource than FANOUT_n and can be implemented in QAC^0_f.

Review

PDF

Published: April 16, 2026

Last updated: July 13, 2026

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning for Autonomous Driving

Bonan Wang, Letian Tao, Bin Shuai, Jiaxin Gao, Wenxin Zhao, Wei Xiong, Kehua Sheng, Bo Zhang, Yang Guan, Shengbo Eben Li (cs.LG, cs.AI)

Deep reinforcement learning is pivotal for closed-loop autonomous driving yet remains constrained by severe bottlenecks in sampling efficiency. Standard parallel sampling mitigates this but suffers from the straggler effect, where the premature termination of a single environment necessitates a synchronized batch re-initialization, leading to suboptimal sample utilization and prohibitive re-initialization latency. To address this, we propose FAST, a synchronous parallel framework tailored for closed-loop simulation. Specifically, FAST employs Dynamic Parallel Sampling Alignment (DPSA) to maintain vectorization synchronization by extending terminated episodes via virtual continuation, thereby decoupling the sampling loop from individual terminations. By dynamically triggering global truncation based on the termination rate of parallel clips, FAST effectively eliminates the bottleneck of premature resets without sacrificing data diversity. Furthermore, to strictly preserve theoretical consistency, we incorporate a Scaled Mask-Padding Optimization (SMPO) that leverages validity masking and adaptive loss normalization to nullify the bias from auxiliary padding data. Empirical evaluations demonstrate that FAST achieves at least a 1.78 times wall-clock speedup over the single-clip baseline while preserving statistical unbiasedness.

Review

PDF

Published: June 19, 2026

Last updated: July 13, 2026

Agent Step Value: Auditing Evaluator-Channel Reversals in Black-Box Agent Traces

Andrew Zhang, Chengzhan Li (cs.AI)

When evaluator-derived step rewards are pooled or compared across scoring channels, their sign is treated as transportable. Yet the same frozen transition can change sign with the scoring channel. Process rewards vary agent states, while evaluator audits vary scoring configurations; neither first difference isolates their interaction. We define Agent Step Value (ASV) as channel-indexed target-margin gain and identify the missing state-by-channel interaction by replaying complete faces. Of 1,100 PubMed open question-answering transitions, 1,004 were complete across four cyclic layouts. Their mean update is positive under direct scoring (+0.163 [0.102, 0.218]) and negative with an externally generated view (-0.160 [-0.244, -0.079]), giving an interaction of -0.323 [-0.418, -0.232]. Across paired transitions, 507/1,004 cross zero in one direction or the other. Matched templates isolate a generated-minus-quote interaction of -1.121 [-1.534, -0.703]. Its direction remains negative after the readout and evaluator-stack changes, and quote yields a higher paired area under the receiver operating characteristic curve (AUC) for stored success at all three bridge vertices. Because the two views preserve the same task information conditional on the retained state, a representation-invariance null predicts equal responses and success rankings across them. ASV provides a transport audit for evaluator-derived step measurements; causal actor credit lies outside its estimand.

Review

PDF

Published: July 05, 2026

Last updated: July 13, 2026

Exact Dynamics of Multi-class Stochastic Gradient Descent

Elizabeth Collins-Woodfin, Inbar Seroussi (stat.ML, cs.LG, math.OC, math.PR)

We develop a framework for analyzing the learning dynamics of high-dimensional problems trained using one-pass stochastic gradient descent (SGD) with data from multiple anisotropic classes. Our main theorem provides exact expressions for quantities of interest, including the risk and the overlap with the true signal, in terms of a deterministic system of ODEs, valid in the high-dimensional limit. The theorem holds for a broad class of optimization problems and extends to settings where the number of classes grows with dimension. To illustrate its utility, we investigate in detail the effect of the data's anisotropic structure on the problems of binary logistic regression and least-squares (LS) loss. We study the LS in a linear multiclass setup and derive a learning-rate threshold that depends on the average eigenvalue of the covariance matrices. In the binary logistic regression, we study three cases: isotropic covariances, data covariance matrices with a large fraction of zero eigenvalues (denoted as the zero-one model), and covariance matrices with power-law spectra. We show that a structural phase transition occurs. In particular, for the zero-one model and the power-law model with sufficiently large power, SGD aligns more closely with values of the class mean that are projected onto the ``clean directions'' (i.e., directions of smaller variance). This is supported by analytical studies and numerical simulations, which show the exact asymptotic behavior of the loss in the high-dimensional limit. The effects of data anisotropy that we demonstrate are likely to hold beyond these examples and illustrate one application of the broader theorem that we prove.

Review

PDF

Published: October 15, 2025

Last updated: July 13, 2026

Encoder-Side Neuron Identification and Amplification for Acoustic Perception in Large Audio-Language Models

Yu-Han Huang, Chih-Kai Yang, Ke-Han Lu, An-Yu Cheng, Hung-yi Lee (cs.SD, cs.AI)

Large audio-language models (LALMs) often underperform on fine-grained, non-semantic attributes of speech, such as a speaker's emotion, despite strong performance on speech content. Improving this without the cost of retraining calls for an effective inference-time intervention, yet most existing methods intervene only after the audio encoder and operate at a relatively coarse granularity. The encoder itself, where acoustic information is first extracted from the waveform, remains largely unexplored, especially at the level of individual neurons. We introduce IAAN, Identifying and Amplifying Acoustic Neurons, a training-free and label-free method that scores each feed-forward neuron in the audio encoder by contrasting its activation on the real waveform with that on a noise reference lacking the real audio's acoustic information. IAAN then amplifies a small set of the highest-scoring neurons at inference. Across ten non-semantic speech attributes, IAAN improves average accuracy by 25.7 points on Audio-Flamingo-3, 21.4 on Qwen2.5-Omni, and 9.7 on Kimi-Audio. It also improves a model already explicitly fine-tuned to prioritize acoustic evidence. In controlled comparisons, both the encoder locus and neuron-level selectivity prove necessary for this gain. Intervening after the encoder, at the decoding side or inside the language model, yields little to no improvement, or even deteriorates accuracy. The improvement also depends on which specific neurons are amplified, not merely on their number, confirming that IAAN's acoustic score succeeds in identifying the neurons that matter. These results show that a small, precisely targeted intervention inside the audio encoder is an effective and largely untapped way to strengthen the acoustic understanding of LALMs, opening a new direction for inference-time methods that improve acoustic perception through neuron-level access to the encoder.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

StoryTeller: Training-Free Narrative Grounding for Long-Form Audio Description

Seung Hyun Hahm, Minh T. Dinh, SouYoung Jin (cs.CV, cs.AI)

Long-form audio description (AD) requires more than describing visible actions: it must preserve characters, events, relationships, and story context across scenes so that blind and low-vision (BLV) audiences can follow a film. Modern video-language models (VLMs) are effective on short clips, but they often treat each moment independently, producing descriptions that miss who characters are, why events matter, and how the current scene connects to earlier narrative context. We propose StoryTeller, a training-free framework for story-aware long-form AD. Instead of relying only on local visual cues, StoryTeller maintains a verified narrative memory that carries forward story-relevant information across scenes, enabling later descriptions to remain coherent, grounded, and contextually informative. Given only raw video and a movie title, StoryTeller can optionally retrieve public movie metadata to resolve names and story context, while accepting only facts that are supported by the video through semantic filtering and VLM verification. The method requires no subtitles, scripts, AD transcripts, aligned captions, character banks, precomputed face identities, or task-specific fine-tuning. To evaluate whether generated AD preserves narrative information, we introduce StoryAD-QA, a question-answering benchmark that tests whether a language model can answer story-context questions using only the generated descriptions. Experiments on standard AD benchmarks and diverse long-form videos show that StoryTeller consistently improves narrative coherence, factual grounding, and story comprehension over strong baselines in automatic, QA-based, and human evaluations.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

An Exact Instrument for State Usage in Selective State-Space Models, and the Input-Driven Migration It Reveals

Raktim Bhattacharya (cs.LG)

Selective state-space models such as Mamba route information through a bank of first-order modes whose input coupling is set by a learned selection mechanism. We give an exact instrument for measuring how a trained model uses these modes. Because the state matrix is diagonal, each channel's output decomposes exactly into per-mode contributions, and a per-(layer, channel, window) Gram tensor yields the exact output error of dropping any subset of modes, offline, at any budget. Validated against the reference implementation to a relative error of 2.3×10^-7 on the Mamba-1 family where it is exact, the instrument predicts a layer's deployed pruning error to a median relative deviation of 5×10^-7 over 4,464 configurations, its floor set by the reconstruction. Applying the instrument across the Mamba-1 family (130M–2.8B), the deployed 7B Falcon-Mamba, and Mamba-2, we find that trained models re-allocate their state space with the input: which modes carry the signal migrates across contexts, and at the most affected layers a per-input oracle roughly halves the output error of a fixed mode set. Frozen-signal counterfactuals attribute the migration primarily to the input-dependent write map B_t; the timestep usually identified with selectivity carries almost none of it. Input-scheduled mode pruning on this measurement outperforms static, Hankel-based, and layer-adaptive rankings at every scale from 130M to the deployed 7B Falcon-Mamba, and at half the state budget it matches the unpruned model. Because the scheduler reads each window's mode usage from a first pass, this demonstrates realizable headroom; we claim no deployed compute or memory saving.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Casting Everything to Online API Services? A Survey of Integrating Localized Speech Recognition Models in Robotic Systems

Sheng Li, Jing Li, Felix Schijve, Jun Hu, Emilia Barakova (cs.RO, eess.AS)

Automatic speech recognition (ASR) has become a critical component of modern robotic systems because it is one of the most natural and intuitive ways for humans to interact with robots. A commonly used method is to directly use API services online. But is that all we can do? This article provides an overview of how ASR technologies are integrated into various intelligent robots and machines. We discuss the evolution of speech recognition from established approaches to state-of-the-art deep learning models, such as OpenAI's Whisper. We also list large-scale datasets and open source toolkits that have been widely used in both industry and academia. We structure the survey around ASR model families, deployment strategies in robotics (especially ROS-based, cloud-based, and hybrid solutions), and several real-world robotic platforms. Finally, we outline the challenges of deploying robust speech recognition in robots and discuss future directions, including multimodal interaction in diverse and dynamic environments. This paper can help social robotics researchers better navigate the emerging domain of language-based natural human-robot interaction.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

Nicolae Cudlenco, Mihai Masala, Marius Leordeanu (cs.CV)

Game engines hold what video models struggle to learn: a complete, explicit world state behind every frame. We turn one into a data instrument. GEST-Engine, our production-grade open-source system, deterministically executes Graphs of Events in Space and Time (GESTs), whether procedurally generated or derived from text, into videos of synchronized multi-actor scenarios, recording ground truth as it renders: 3D entity and camera state, pairwise spatial relations, event-to-frame mappings, instance segmentation, and long descriptions, at zero marginal annotation cost. With it we release GTASA, a 938-video sample of what the system can generate at arbitrary scale, carrying, to our knowledge, the densest spatial-relation coverage of any video dataset: a complete entity-pair relation graph at every frame, ~84x denser than the state of the art, frame-for-frame. We validate GTASA both qualitatively, through human evaluation of physical validity and semantic alignment where frontier neural generators, given the same prompts, largely fail, and quantitatively, with GTASA pretraining improving VLM video captioning. Probing six frozen video encoders across 11 spatio-temporal tasks enabled by GTASA's exact 3D ground truth, a previously untestable inter-entity relational probe of frozen video features, reveals that who-is-near-whom barely rises above chance for all of them. We release the engine, the corpus, and the benchmark, making this gap a measurable, trainable target.

Review

PDF

Published: April 12, 2026

Last updated: July 13, 2026

Forgetting Our Way to Shared Meaning: Effects of Forgetting on Conceptual Alignment in a Non-Partnership Coordination Game

Landon Liu, Mary Kelly, Alan Tsang (cs.MA, cs.CL, cs.GT, cs.HC)

Shared meaning in language requires people to learn and agree on categories. We ask how characteristics of agents' memories change the emergence and evolution of shared meaning. Without a coordination game, models of conceptual semantics cannot explain how shared meaning emerges and changes in groups of people; however, existing games assume that players share payoffs in a partnership setting. We model conceptual alignment as a non-partnership game and illustrate differences in actual and perceived conceptual convergence from counterfactual simulations using agents with varying levels of adaptiveness and memory degradation. We found that adaptive players achieved actual convergence faster and had closer final conceptual regions than non-adaptive players, while non-adaptive players perceived convergence earlier. Weighing novel information less over time resulted in more stable agreements than fixing the weight of novel information. Memory features are critical to the emergence and evolution of actual and perceived convergence.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

AgenticFocus: Object-Preserving Mixed Reality Synthesis from Human FPV Video for Dexterous Humanoid Learning

Iaroslav Kolomiets, Miguel Altamirano Cabrera, Artem Lykov, Jeffrin Sam, Dmitrii Iarchuk, Yara Mahmoud, Daniia Zinniatullina, Mikhail Konenkov, Dzmitry Tsetserukou (cs.RO)

Human egocentric video is a scalable supervision source for humanoid policy learning, but current pipelines struggle with hand-object occlusion, oversimplified motion, or specialized capture hardware. We introduce AgenticFocus, a Mixed Reality synthesis pipeline that converts ordinary first-person-view human videos into robot-trainable demonstrations by restoring occluded object geometry, reconstructing full-hand motion, and retargeting it to a humanoid embodiment through camera-relative alignment and layered compositing. The resulting dataset pairs focused visual observations with synchronized robot actions and states. AgenticFocus achieves lower trajectory error and smoother wrist motion than cross-embodiment baselines, with SPARC scores of -5.18 versus -5.56 and -6.05.

Review

PDF

Published: July 09, 2026

Last updated: July 13, 2026

MIRA: A Modular Open-Source Micro-UAV for Indoor Research

Lucas K. de Oliveira, Felipe A. G. Tommaselli, João Aires Marsicano, Marco S. Tayar, Pedro A. R. Saraiva, Ricardo V. Godoy, Marcelo Becker (cs.RO, eess.SY)

Indoor robotics research increasingly relies on micro-UAVs whose airframe, electronics, and control software are fully open to modification. Off-the-shelf platforms rarely expose the low-level access required for such modifications, while building a custom alternative typically requires substantial engineering effort before flight testing can begin, leaving many laboratories to work within constraints that limit the scope of their research. We present MIRA (Modular Indoor Research Architecture), a low-cost, open-source micro-UAV for indoor research built around a replicable 3D-printed PLA airframe and a containerized low-level software package managing the companion-to-autopilot communication bridge via Micro XRCE-DDS. Designed as a white-box architecture, core subsystems are individually replaceable without firmware refactoring, supporting local fabrication and component substitution from existing lab inventory. We characterize MIRA through manual flight in position-control mode within an optical motion-capture volume, where the communication pipeline sustains a median companion-to-autopilot latency of 0.02 ms and power spectral density analysis confirms the structural vibration energy stays concentrated in a narrow 90 to 110 Hz band, isolated from the sub-20 Hz control bandwidth and within the autopilot safety thresholds.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

How Temperature Shapes Ideological Discourse in Retrieval-Augmented Generation?

Elmira Salari, Hazem Amamou, José Victor de Souza, Shruti Kshirsagar, Maria Nunes Delfino, Anderson Avila (cs.CL)

Retrieval-Augmented Generation (RAG) has been increasingly adopted to reduce hallucinations and strengthen the factual grounding of large language models (LLMs). While robustness to errors in the retrieval process has been explored, the impact of ideological bias on LLM outputs has been overlooked. For instance, if the retrieved material contains ideological positions, the RAG may transmit, amplify, or suppress such ideological discourses in its outputs. In this study, we address this issue by examining the influence of the RAG framework, comprising ideological discourses, in LLM-generated answers. To this end, we applied Lexical Multidimensional Analysis (LMDA) on a corpus of 1,117 COVID-19 treatment articles, identifying three ideological discourses. This corpus is then used as the external knowledge source for the RAG. We assessed several LLMs by having the models answer ideological questions at different sampling temperatures. The generated texts were assessed semantically and lexically based on their similarities with ideological reference texts. Our findings show that the RAG framework is prone to transferring ideological discourses into LLM responses, with sampling temperature having a measurable impact on the strength of this transfer. Discoursive alignment between generated answers and the reference text is highest at moderate temperatures, where models balance stochasticity with retrieval grounding, and drops at low temperatures, indicating that overly deterministic sampling suppresses discourse transfer.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

A Compact Top-Loading Robot for Endovascular Interventions: Design, Control and Evaluation

Jonas Fischer, Lennart Karstensen, Franziska Mathis-Ullrich (cs.RO)

Robot-assisted endovascular intervention can potentially reduce radiation exposure, improve surgeon ergonomics, enable telesurgery, support active assistance and autonomy, and enhance procedural precision. However, existing systems often suffer from limited procedural coverage because constrained patient-side setups, restricted flexibility, and complex instrument exchange hinder clinical workflow integration. This work presents a compact robotic system for endovascular interventions that enables continuous translational and rotational manipulation of standard endovascular instruments. The system consists of two alternating carts with pneumatically actuated membrane grippers integrated into rotating gripper gears. Its top-loading design allows rapid exchange of instruments such as guidewires and catheters without changing the robotic setup. A leader-follower control strategy enables continuous motion despite the finite stroke of each cart. The system was evaluated in motion-tracking experiments with guidewires and catheters and in an in vitro vascular phantom. The motion-tracking experiments showed generally smooth translational and rotational motion profiles. Across all tested guidewire and catheter experiments, the mean relative tracking errors were 3.6% for translational motion and 4.1% for rotational motion. In the vascular phantom, robot-assisted navigation reached the target in most trials, demonstrating the feasibility of the proposed manipulation concept under in vitro conditions. The presented robotic system demonstrates technical feasibility for continuous manipulation of standard endovascular instruments in bench-top and in vitro experiments. The compact top-loading design may ease instrument exchange and clinical workflow integration. Future work will focus on improving gripping performance, actuation speed, force feedback, and evaluation in more clinically realistic settings.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

Haoyu Chen, Kaichen Zhou, Hang Hua, Kaile Zhang, Jingwen Qian, Wufei Ma, Haonan Chen, Chunjiang Liu, Yizhou Zhao, Xiaoyuan Wang, Weiyue Li, Alan Yuille, Paul Pu Liang, Yilun Du (cs.CV)

Video generation models aspire to simulate dynamic environments, and several benchmarks now evaluate memory consistency across frames. However, most assess consistency only while the target remains in view, and the few that force objects out of view evaluate static scenes where nothing changes during occlusion. To bridge this gap, we introduce MemoBench, a diagnostic benchmark built around the disappear-and-reappear paradigm in dynamically changing environments: a target object undergoes a physical process, disappears from view, and must be correctly recovered in its updated state upon reappearance. We curate 360 ground-truth clips spanning synthetic and real-world scenes, and design an evaluation suite combining automated metrics with VQA-based assessment across four diagnostic pillars. Evaluation of eight state-of-the-art models reveals key insights and open challenges regarding memory consistency under the disappear-and-reappear paradigm.

Review

PDF

Published: June 25, 2026

Last updated: July 13, 2026

A Model-Free Universal AI

Yegon Kim, Juho Lee (cs.AI)

In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically ε-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically ε-optimal and asymptotically ε-Bayes-optimal. We also apply our novel proof techniques to show asymptotic ε-optimality of Self-AIXI without any ad-hoc assumptions. Our results significantly expand the diversity of known universal agents.

Review

PDF

Published: February 26, 2026

Last updated: July 13, 2026

Evaluating RE Practices for Explainability: Synthesizing Insights from Daimler Truck into an Explainable RE Framework Proposal

Umm-e- Habiba, Lucas Mauser, Jonas Fritzsch, Justus Bogner, Stefan Wagner (cs.SE, cs.AI)

Explainability has emerged as a critical requirement for AI-based systems, particularly in safety-critical and regulated domains. Although prior research has proposed frameworks, patterns, and user-centered approaches to support explainability, there is limited empirical understanding of how existing Requirements Engineering (RE) practices support explainability requirements across the RE lifecycle, especially in an industrial context. This paper reports early findings from an ongoing industry-based study investigating how explainability requirements are elicited, specified, and validated using established RE techniques. We conducted a multi-phase qualitative study with eight practitioners at Daimler Truck, employing think-aloud protocols and moderated group discussions across requirements elicitation, specification, and validation steps. Our preliminary analysis reveals recurring challenges across all steps, including conceptual ambiguity during elicitation, limited testability and expressiveness during specification, and fragmented validation due to vague criteria and regulatory uncertainty. These findings indicate that current RE practices provide limited support to systematically address explainability requirements. The paper contributes empirical insights into step-specific and cross-cutting challenges and outlines a research vision toward developing an empirically grounded RE framework for explainable AI-based systems.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

Ying Fan, Anej Svete, Kangwook Lee (cs.LG, cs.CL)

Language models typically reason via explicit chain-of-thought (CoT), generating intermediate steps token-by-token. Latent CoT offers an alternative: it performs multi-step reasoning in the model's hidden states, replacing decoded tokens with continuous representations for greater efficiency. However, existing latent CoT methods underperform explicit CoT beyond 1B parameters, and the gap widens with scale. Looped, or recurrent-depth, Transformers, which reuse their weights to increase computation depth without adding parameters, are a natural fit for latent reasoning. We therefore ask whether looped Transformers can bridge this gap. We answer affirmatively with a simple recipe: a looped padded Transformer that processes K latent blocks in parallel for R iterations, with a cross-entropy loss on each latent position's gold CoT-step token, similar to explicit CoT supervision. We instantiate it as LOTUS (Looped Transformers with parallel supervision on latents). LOTUS is, to our knowledge, the first latent-CoT method to bridge the gap to explicit CoT at the 3B scale, while cutting thought-phase latency by 2.5x-6.9x from compact math expressions to natural language. Projecting LOTUS's post-loop latents through the base LM head recovers the gold reasoning steps and even surfaces alternative valid intermediate steps, evidence that its latent space is interpretable and CoT-aligned. Ablations confirm that both the looped backbone and the parallel supervision on gold CoT tokens are essential. Code is available at https://github.com/yingfan-bot/lotus.

Review

PDF

Published: June 30, 2026

Last updated: July 13, 2026

A Multi-Model Metric-based Selection Framework for Abstractive Text summarization

Ahmed Alansary, Ali Hamdi (cs.CL, cs.AI)

Automatic text summarization has become increasingly important due to the rapid growth of digital textual information. This paper presents a Multi-Model Summarization Framework designed to improve the robustness and quality of abstractive text summarization. Relying on a single model often leads to inconsistent summarization quality across articles with varying structures and topics. To address this limitation, the proposed framework integrates multiple fine-tuned transformer-based summarization models and introduces a metric-based selection mechanism. In this framework, each model independently generates a candidate summary for the same input article. The generated summaries are then evaluated using automatic evaluation metrics that capture both lexical similarity and semantic relevance. Based on these scores, the framework selects the highest-quality summary as the final output. The models are fine-tuned and evaluated on the widely used CNN/DailyMail news summarization dataset. Experimental results demonstrate that the proposed framework achieves the highest BERTScore among all compared methods with a score of 88.63%. It also outperforms several LLMs such as GPT3-D2, Falcon-7b, and Mpt-7b, highlighting its effectiveness and robustness. These findings highlight the effectiveness of leveraging multiple transformer-based models within a metric-based selection strategy to improve the quality and robustness of automatic text summarization systems.

Review

PDF

Published: June 03, 2026

Last updated: July 13, 2026

From Expressivity to Sample Complexity: Narrow Teachers for Transformers via C-RASP

Michael Rizvi-Martel, Satwik Bhattamishra, Guillaume Rabusseau, Michael Hahn (cs.LG, cs.CL)

A theoretical understanding of Transformers is crucial to better understand the capacities and limitations of large language models (LLMs). There is much work analyzing the expressivity of attention-based models. By proposing handcrafted weights or using computational complexity arguments, a large amount of past theoretical works have sought to characterize which tasks are and which are not in the hypothesis class of Transformer models. However, little work investigates the learnability of such solutions. In this work, we make progress towards this goal. Inspired by recent loss landscape analysis work, we propose preliminary sample complexity bounds for learning C-RASP constructions with Transformers.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

From Global to Factor-Wise Expert Composition in Discrete Diffusion Models

Haozhe Huang, Yudong Xu, Abhijoy Mandal, Alán Aspuru-Guzik (cs.LG)

Discrete diffusion models offer a powerful framework for solving complex reasoning tasks, particularly through compositional generation, which combines multiple pre-trained experts to generalize beyond their individual training data. Recent theoretical corrections introduce time-dependent mixing weights to better align composed diffusion dynamics with the intended target. However, these methods are fundamentally limited by working on a per-sample basis, treating each generated state monolithically and ignoring the potential spatial or functional specializations of different experts. In this work, we address this limitation by proposing FactorDiff - a factor-wise composition framework for diffusion models. We posit that samples can be further decomposed into smaller factors, and propose a sampling process that dynamically routes each factor to the most relevant expert. We instantiate this framework with spatial/pixel-level compositions and validate it on the ARC-AGI benchmark, demonstrating that simple factor-specific routing consistently outperforms complex global scalar weighting schemes on tasks that require logical consistency and spatial disentanglement.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

Ahmed Alansary, Molham Mohamed, Ali Hamdi (cs.AI, cs.CL)

Telehealth systems have become increasingly important for delivering accessible and timely medical information. Existing large language models often struggle to provide consistent and contextually appropriate medical responses across varying levels of case severity. This limitation highlights the need for models that can effectively adapt to the progressive complexity in medical queries. To address this challenge, we introduce a severity-aware multi-model framework that integrates curriculum training strategy with relevance-based response selection. The proposed framework employs a three-stage curriculum learning strategy, where each model is trained sequentially on mild, moderate, and critical cases to progressively acquire domain knowledge. The approach uses five large language models, each trained independently under the same curriculum. During inference, all models generate candidate responses, and the response with highest BERTScore is selected as the final output. The framework is trained and evaluated on the MAQA dataset, which provides annotated medical question-answer pairs. Experimental results evaluated using BERTScore demonstrate that the proposed method achieves superior performance compared to both baseline and fine-tuned models, attaining 86.71% in the baseline setting and 90.30% after fine-tuning. These results highlight the effectiveness of combining curriculum learning with multi-model response selection in improving response quality and relevance in medical text generation.

Review

PDF

Published: June 03, 2026

Last updated: July 13, 2026

Toward a Scientific Discovery Engine for Weather and Climate Data: A Visual Analytics Workbench for Embedding-Based Exploration

Nihanth W. Cherukuru, Matt Rehme, Kirsten J. Mayer, David John Gagne, John Schreck, John Clyne, Charlie Becker (physics.data-an, cs.AI, cs.CV, cs.IR)

Earth system science is producing increasingly large, high-dimensional datasets from both physics-based and AI-driven models. While embedding-based representations make these data searchable and serve as foundational building blocks for AI-driven discovery engines, nearest neighbors in latent spaces are not automatically scientifically meaningful. They may reflect real meteorological structures, or simply artifacts of preprocessing, geography, or model bias. Researchers therefore need visual tools to inspect latent space organization, trace search results back to physical evidence, and evaluate candidate representations against one another. We present an open source visual analytics workbench designed to support this provenance-aware scientific retrieval workflow. The system links distinct embedding experiments to shared source data, metadata, spatial contexts, and model configurations. It enables interactive retrieval strategy design by allowing users to issue image-level and localized patch-level queries, apply multi-constraint filters, and inspect analogs through familiar meteorological views. This facilitates a discovery loop where scientists characterize a phenomenon in a well-understood dataset and use its latent signature to probe larger archives. While we demonstrate the workbench through a tropical cyclone retrieval scenario using a vision foundation model (DINOv3) on ERA5 data, the framework is model-agnostic and designed to integrate with other embedding architectures in the future. Finally, we evaluate its out-of-core retrieval backend, demonstrating that interactive visual search over tens of millions of embeddings is highly scalable on commodity hardware.

Review

PDF

Published: May 01, 2026

Last updated: July 13, 2026

InqEduAgent: Adaptive AI Learning Partners with Gaussian Process Augmentation

Wen-Xi Yang, Tian-Fang Zhao, Guan Liu (cs.AI)

Collaborative partnerships play a crucial role in inquiry-oriented education. However, most learning partners are currently assigned through experience-driven heuristics or rule-based machine assistants, which often result in limited knowledge expansion and low adaptability. To address these challenges, this study introduces InqEduAgent, an LLM-empowered generative agent framework designed to simulate and select adaptive learning partners for inquiry-based learning. InqEduAgent integrates a Gaussian process-augmented matching mechanism to model the cognitive and evaluative characteristics of learners, allowing adaptive partner selection based on prior knowledge patterns. Comprehensive experiments demonstrate that InqEduAgent consistently achieves superior performance across diverse learning scenarios and large language model configurations. This study advances human-AI collaborative learning by enabling intelligent pairing between human- and AI-based learning partners, and contributes to adaptive user modeling and personalized recommendation within Web-based educational environments.

Review

PDF

Published: August 05, 2025

Last updated: July 13, 2026

Higher-Order Cell Tracking Transformer

Jordão Bragantini, Ilan Theodoro, Loïc A. Royer (cs.CV)

Reconstructing lineages from live-imaging microscopy requires linking cell detections across time, including through cell divisions. A common approach is to construct a candidate graph and associate cell segmentations (nodes) across frames. However, these and other existing methods overlook two structural obstacles in candidate tracking graphs: (i) cell divisions entangle distinct lineage paths in the node embedding space, and (ii) edges sharing a node have near-random label agreement, so the candidate-graph topology carries no useful information for graph neural networks to aggregate. We propose the Higher-Order Cell Tracking Transformer (HOCT), an edge-centric architecture in which candidate cell links attend to one another under a 3D geometric prior, resolving both issues. Evaluated on the Cell Tracking Challenge and a bacteria division benchmark, HOCT achieves state-of-the-art results without deep pre-trained image encoders. Moreover, the proposed approach is easier to fine-tune, quickly reducing tracking errors by 59

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Paradoxes of Game Theoretic Equilibria and Price of Anarchy

Georgios Piliouras, Ian Gemp, Siqi Liu, Luke Marris (cs.GT, cs.LG, cs.MA, math.DS, math.OC)

For decades, static solution concepts (Nash, Correlated, and Coarse Correlated Equilibria) and the Price of Anarchy (PoA) have formed the bedrock of algorithmic game theory, with no-regret learning proving fast convergence to such game-theoretic equilibria. We show that reducing multi-agent learning to static equilibrium and black-box regret analysis obscures underlying dynamic disequilibrium and game theoretic bounds. First, interior Nash equilibria lack C^1 vector field information, meaning agents cannot distinguish aligned from strictly opposing incentives. Inheriting this geometry, the worst-case pure Nash equilibria dictating robust PoA bounds manifest as topologically unstable strict saddles, and in canonical congestion games, as global repellers supported on almost everywhere strictly dominated strategies. Anchoring efficiency guarantees to these unstable states causes algebraic sensitivity; we prove that accommodating all strictly positive affine costs renders the PoA unbounded. Furthermore, projecting learning trajectories onto the discrete simplex of correlated play systematically accommodates non-rationalizable behavior. Evaluating dynamics via Coarse Correlated Equilibria or proximal refinements fails to preclude strictly dominated strategies. Moreover, optimal O(1/T) swap-regret minimization does not preclude macroscopic turbulence, manifesting as chaotic limit sets even in minimal games. Finally, we examine the non-atomic limit of congestion games. Though considered highly stable with tight sub-linear Θ(p/ln p) PoA bounds (where p is the polynomial degree), we prove that under discrete-time learning, the unique equilibrium destabilizes into Li-Yorke chaos and global attractors whose time-averaged inefficiency degrades exponentially as 2^p. These results necessitate re-evaluating worst-case equilibrium frameworks for dynamically grounded metrics.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

When Local Monitors Miss Compositional Harm: Diagnosing Distributed Backdoors in Multi-Agent Systems

Yibo Hu, Ren Wang (cs.CR, cs.LG, cs.MA)

As multi-agent, tool-using LLM systems are deployed, a common safety net is a runtime monitor that checks each message, tool call, or step on its own. We show this net has a fundamental hole. A distributed backdoor splits a harmful payload across agents, so every local check passes while the assembled object is the attack. The monitor can be right on every step and still miss the attack. The problem is not splitting itself: split fragments can still leak suspicious tokens or provenance edges. The hard case is local benignness. No fragment carries the harm, and what is left looks like ordinary benign traffic. We formalize this as an observability boundary: a monitor catches only what its view can tell apart from benign traffic. We prove that once the fragments look benign in the monitored view, no detector on that view can catch them, however strong it is. Across a controlled testbed, an external benchmark, and end-to-end agent runs, local monitors lose the signal exactly as local evidence disappears, and it returns only when the monitor sees the assembled object. A monitor trained only on benign traffic recovers the attack's code structure across held-out encodings (0.874 mean AUROC). A decoded-view gate, given the encoding family, blocks every tested attack. But seeing more is not enough: full-trace monitors and decoders still fail unless they reach the representation where the payload is exposed. Local safety is not global safety when harm is compositional, and the open problem is finding that representation.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Playful AI in Professional Email: A Field Experiment on Tone and Recipient Engagement

Ziv Ben-Zion, Teddy Lazebnik (cs.AI, cs.HC)

Large language models (LLMs) are rapidly reshaping workplace communication, yet whether AI-assisted writing changes how recipients actually behave, and through what channel, remains unknown. Here, in a randomized crossover field experiment, 121 employees across six companies sent work emails under three conditions over three weeks: unaided writing, GPT-5 rewriting in a playful tone, and GPT-5 rewriting in a professional tone. Across 16,880 emails, playful editing increased emotional positivity (B=+0.068, p<0.001), and professional editing decreased it (B=-0.041, p<0.001), yet neither condition directly altered open rates, reply rates, or response times. Instead, within-sender positivity strongly predicted both opening (OR=2.05) and replying (OR=3.32, p<0.001), a significant indirect pathway through which AI editing shaped behavior, in the absence of any direct effect. These findings suggest that AI-assisted communication shapes workplace engagement not through its use, but through the emotional tone of the language it produces.

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

HiFi-LLP: High-Fidelity, Low-Cost Latency Predictors with Confidence for Robust HW-NAS

Shambhavi Balamuthu Sampath, Behzad Shomali, Nael Fasfous, Moritz Thoma, Judeson Anthony Fernando, Lukas Frickenstein, Pierpaolo Mori, Manoj Rohit Vemparala, Alexander Frickenstein, Walter Stechele (cs.LG, cs.AR)

With deep neural networks (DNNs) increasingly deployed on edge devices, hardware (HW)-aware optimization techniques–such as HW-aware compression and HW-aware neural architecture search (HW-NAS)–have become essential. These methods rely on real feedback from the target hardware to tailor DNN architectures for efficient deployment. While the search can be parallelized, latency measurements via hardware-in-the-loop (HIL) remain a bottleneck due to their sequential nature. Recent approaches use latency predictors to replace costly HIL feedback, but challenges persist: (1) platform-specific predictors often require tens of thousands of samples, and (2) inaccurate predictions can mislead the NAS process. To address this, we introduce HiFi-LLP, a high-fidelity, low-cost latency predictor based on graph attention networks, augmented with a confidence metric. HiFi-LLP outperforms prior platform-specific predictors by up to 9 percentage points (p.p.) in the 10

Review

PDF

Published: July 13, 2026

Last updated: July 13, 2026

Date Filter

Tag Filter

Read It Back: Pretrained MLLMs Are Zero-Shot Reward Models for Text-to-Image Generation

Latent-Identity Tuning in Text-to-Image Personalization Models

Mixture of Frames Policy: Multi-Frame Action Denoising for Bimanual Mobile Manipulation

Requential Coding: Pushing the Limits of Model Compression with Self-Generated Training Data

Metacognition in LLMs: Foundations, Progress, and Opportunities

Invariant Learning Dynamics of Transformers in Inductive Reasoning Tasks

A Minimalist Retargeting-Guided Reinforcement Learning Recipe for Dexterous Manipulation

A Durability and Cross-Language Transfer Benchmark for a Validated Teaching-Feedback Classification Protocol

Inside the Unfair Judge: A Mechanistic Interpretability Account of LLM-as-Judge Bias

Evidence-Backed Video Question Answering

Can LLMs Perform Deep Technical Comprehension of Computer Architecture Papers?

Causal Discovery in Mixtures of Populations

Robust bipedal locomotion on flowable slopes via foot-driven terrain manipulation

Need for Speed Sort: A Recursive Distribution-Based Sorting Algorithm

AdvancedMathBench: A Benchmark Suite for Advanced Mathematical Proof Generation and Verification

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

Beyond the Single Camera: Agentic Multi-View Reasoning in Sports Video Understanding

Input-Aware Dynamic Backdoor Attack Against Quantum Neural Networks

Accelerating Sampling-Based Control via Learned Linear Koopman Dynamics

LoRA-Based Cascaded Multimodal Fusion for Action Recognition in Medical Training Environments

HASTE: A Platform for Rapid Post-Disaster Building Damage Assessment

Cycle-World: Mitigating Error Accumulation in Long-term Video World Models via Reverse-Prediction Cycle Consistency

MicroCharNet: Less is More for License Plate Character Detection

Transformer-Guided Swarm Intelligence for Frugal Neural Architecture Search

Active Noise Floor Estimation for Reliability-Optimal POMDPs: A Value-of-Noise-Information Approach

Representing the Non-dominated Set of Multi-objective Network Problems by Supported Non-dominated Points

MM-ToolSandBox: A Unified Framework for Evaluating Visual Tool-Calling Agents

Relaxing Faithfulness with Intervention-Only Causal Discovery

Introducing Human-Centeredness in AI-Assisted Lexicography

Robust Bayesian Decision Making under Adversarial Uncertainty

Polylogarithmic-Weight Dicke States in QAC^0 and Arbitrary Symmetric States in QAC^0_f

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning for Autonomous Driving

Agent Step Value: Auditing Evaluator-Channel Reversals in Black-Box Agent Traces

Exact Dynamics of Multi-class Stochastic Gradient Descent

Encoder-Side Neuron Identification and Amplification for Acoustic Perception in Large Audio-Language Models

StoryTeller: Training-Free Narrative Grounding for Long-Form Audio Description

An Exact Instrument for State Usage in Selective State-Space Models, and the Input-Driven Migration It Reveals

Casting Everything to Online API Services? A Survey of Integrating Localized Speech Recognition Models in Robotic Systems

GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

Forgetting Our Way to Shared Meaning: Effects of Forgetting on Conceptual Alignment in a Non-Partnership Coordination Game

AgenticFocus: Object-Preserving Mixed Reality Synthesis from Human FPV Video for Dexterous Humanoid Learning

MIRA: A Modular Open-Source Micro-UAV for Indoor Research

How Temperature Shapes Ideological Discourse in Retrieval-Augmented Generation?

A Compact Top-Loading Robot for Endovascular Interventions: Design, Control and Evaluation

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

A Model-Free Universal AI

Evaluating RE Practices for Explainability: Synthesizing Insights from Daimler Truck into an Explainable RE Framework Proposal

Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

A Multi-Model Metric-based Selection Framework for Abstractive Text summarization

From Expressivity to Sample Complexity: Narrow Teachers for Transformers via C-RASP

From Global to Factor-Wise Expert Composition in Discrete Diffusion Models

Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

Toward a Scientific Discovery Engine for Weather and Climate Data: A Visual Analytics Workbench for Embedding-Based Exploration

InqEduAgent: Adaptive AI Learning Partners with Gaussian Process Augmentation

Higher-Order Cell Tracking Transformer

Paradoxes of Game Theoretic Equilibria and Price of Anarchy

When Local Monitors Miss Compositional Harm: Diagnosing Distributed Backdoors in Multi-Agent Systems

Playful AI in Professional Email: A Field Experiment on Tone and Recipient Engagement

HiFi-LLP: High-Fidelity, Low-Cost Latency Predictors with Confidence for Robust HW-NAS