1
Accurate and scalable exchange-correlation with deep learning
Density Functional Theory (DFT) underpins much of modern computational chemistry and materials science. Yet, the reliability of DFT-derived predictions of experimentally measurable properties remains fundamentally limited by the need to approximate the unknown exchange-correlation (XC) functional. The traditional paradigm for improving accuracy has relied on increasingly elaborate hand-crafted functional forms. This approach has led to a longstanding trade-off between computational efficiency and accuracy, which remains insufficient for reliable predictive modelling of laboratory experiments. Here we introduce Skala, a deep learning-based XC functional that surpasses state-of-the-art hybrid functionals in accuracy across the main-group chemistry benchmark set GMTKN55 with an error of 2.8 kcal/mol, while retaining the lower computational cost characteristic of semi-local DFT. This demonstrated departure from the historical trade-off between accuracy and efficiency is enabled by learning non-local representations of electronic structure directly from data, bypassing the need for increasingly costly hand-engineered features. Leveraging an unprecedented volume of high-accuracy reference data from wavefunction-based methods, we establish that modern deep learning enables systematically improvable neural exchange-correlation models as training datasets expand, positioning first-principles simulations to become progressively more predictive.
Published: June 17, 2025
Last updated: April 21, 2026
Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.
Published: April 21, 2026
Last updated: April 21, 2026
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.
Published: April 21, 2026
Last updated: April 21, 2026
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.
Published: April 21, 2026
Last updated: April 21, 2026
Generalization at the Edge of Stability
Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.
Published: April 21, 2026
Last updated: April 21, 2026
Phase Transitions in the Fluctuations of Functionals of Random Neural Networks
We establish central and non-central limit theorems for sequences of functionals of the Gaussian output of an infinitely-wide random neural network on the d-dimensional sphere . We show that the asymptotic behaviour of these functionals as the depth of the network increases depends crucially on the fixed points of the covariance function, resulting in three distinct limiting regimes: convergence to the same functional of a limiting Gaussian field, convergence to a Gaussian distribution, convergence to a distribution in the Qth Wiener chaos. Our proofs exploit tools that are now classical (Hermite expansions, Diagram Formula, Stein-Malliavin techniques), but also ideas which have never been used in similar contexts: in particular, the asymptotic behaviour is determined by the fixed-point structure of the iterative operator associated with the covariance, whose nature and stability governs the different limiting regimes.
Published: April 21, 2026
Last updated: April 21, 2026
Safe Continual Reinforcement Learning in Non-stationary Environments
Reinforcement learning (RL) offers a compelling data-driven paradigm for synthesizing controllers for complex systems when accurate physical models are unavailable; however, most existing control-oriented RL methods assume stationarity and, therefore, struggle in real-world non-stationary deployments where system dynamics and operating conditions can change unexpectedly. Moreover, RL controllers acting in physical environments must satisfy safety constraints throughout their learning and execution phases, rendering transient violations during adaptation unacceptable. Although continual RL and safe RL have each addressed non-stationarity and safety, respectively, their intersection remains comparatively unexplored, motivating the study of safe continual RL algorithms that can adapt over the system's lifetime while preserving safety. In this work, we systematically investigate safe continual reinforcement learning by introducing three benchmark environments that capture safety-critical continual adaptation and by evaluating representative approaches from safe RL, continual RL, and their combinations. Our empirical results reveal a fundamental tension between maintaining safety constraints and preventing catastrophic forgetting under non-stationary dynamics, with existing methods generally failing to achieve both objectives simultaneously. To address this shortcoming, we examine regularization-based strategies that partially mitigate this trade-off and characterize their benefits and limitations. Finally, we outline key open challenges and research directions toward developing safe, resilient learning-based controllers capable of sustained autonomous operation in changing environments.
Published: April 21, 2026
Last updated: April 21, 2026
Generative Drifting for Conditional Medical Image Generation
Conditional medical image generation plays an important role in many clinically relevant imaging tasks. However, existing methods still face a fundamental challenge in balancing inference efficiency, patient-specific fidelity, and distribution-level plausibility, particularly in high-dimensional 3D medical imaging. In this work, we propose GDM, a generative drifting framework that reformulates deterministic medical image prediction as a multi-objective learning problem to jointly promote distribution-level plausibility and patient-specific fidelity while retaining one-step inference. GDM extends drifting to 3D medical imaging through an attractive-repulsive drift that minimizes the discrepancy between the generator pushforward and the target distribution. To enable stable drifting-based learning in 3D volumetric data, GDM constructs a multi-level feature bank from a medical foundation encoder to support reliable affinity estimation and drifting field computation across complementary global, local, and spatial representations. In addition, a gradient coordination strategy in the shared output space improves optimization balance under competing distribution-level and fidelity-oriented objectives. We evaluate the proposed framework on two representative tasks, MRI-to-CT synthesis and sparse-view CT reconstruction. Experimental results show that GDM consistently outperforms a wide range of baselines, including GAN-based, flow-matching-based, and SDE-based generative models, as well as supervised regression methods, while improving the balance among anatomical fidelity, quantitative reliability, perceptual realism, and inference efficiency. These findings suggest that GDM provides a practical and effective framework for conditional 3D medical image generation.
Published: April 21, 2026
Last updated: April 21, 2026
Tackling multiphysics problems via finite element-guided physics-informed operator learning
This work presents a finite element-guided physics-informed operator learning framework for multiphysics problems with coupled partial differential equations (PDEs) on arbitrary domains. The proposed framework learns an operator from the input space to the solution space with a weighted residual formulation based on the finite element method, enabling discretization-independent prediction beyond the training resolution without relying on labeled simulation data. The present framework for multiphysics problems is implemented in Folax, a JAX-based operator learning platform, and is verified on nonlinear coupled thermo-mechanical problems. Two- and three-dimensional representative volume elements with varying heterogeneous microstructures, and a close-to-reality industrial casting example under varying boundary conditions are investigated as the example problems. We investigate the potential of several neural operators combined with the proposed finite element-guided approach, including Fourier neural operators (FNOs), deep operator networks (DeepONets), and a newly proposed implicit finite operator learning (iFOL) approach based on conditional neural fields. The results demonstrate that FNOs yield highly accurate solution operators on regular domains, where the global features can be efficiently learned in the spectral domain, and iFOL offers efficient parametric operator learning capabilities for complex and irregular geometries. Furthermore, studies on training strategies, network decomposition, and training sample quality reveal that a monolithic training strategy using a single network is sufficient for accurate predictions, while training sample quality strongly influences performance. Overall, the present approach highlights the potential of physics-informed operator learning with a finite element-based loss as a unified and scalable approach for coupled multiphysics simulations.
Published: March 02, 2026
Last updated: April 21, 2026
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.
Published: April 21, 2026
Last updated: April 21, 2026
MapPFN: Learning Causal Perturbation Maps in Context
Planning effective interventions in biological systems requires treatment-effect models that adapt to unseen biological contexts by identifying their specific underlying mechanisms. Yet single-cell perturbation datasets span only a handful of biological contexts, and existing methods cannot leverage new interventional evidence at inference time to adapt beyond their training data. To meta-learn a perturbation effect estimator, we present MapPFN, a prior-data fitted network (PFN) pre-trained on synthetic data generated from a prior over causal perturbations. Given a set of experiments, MapPFN uses in-context learning to predict post-perturbation distributions. Pre-trained on in silico gene knockouts alone, MapPFN identifies differentially expressed genes on par with models trained on real single-cell data. Fine-tuned, it consistently outperforms baselines across downstream datasets. Our code, model and data are available at https://marvinsxtr.github.io/MapPFN.
Published: January 28, 2026
Last updated: April 21, 2026
Greedy Routing in a Sequentially Grown One-Dimensional Random Graph
We analyze greedy routing in a random graph G_n constructed on the vertex set V = {1, 2, ..., n} embedded in Z. Vertices are inserted according to a uniform random permutation pi, and each newly inserted vertex connects to its nearest already-inserted neighbors on the left and right (if they exist). This work addresses a conjecture originating from empirical studies (Ponomarenko et al., 2011; Malkov et al., 2012), which observed through simulations that greedy search in sequentially grown graphs exhibits logarithmic routing complexity across various dimensions. While the original claim was based on experiments and geometric intuition, a rigorous mathematical foundation remained open. Here, we formalize and resolve this conjecture for the one-dimensional case. For a greedy walk GW starting at vertex 1 targeting vertex n -- which at each step moves to the neighbor closest to n -- we prove that the number of steps S_n required to reach n satisfies S_n = Theta(log n) with high probability. Precisely, S_n = L_n + R_n - 2, where L_n and R_n are the numbers of left-to-right and right-to-left minima in the insertion-time permutation. Consequently, E[S_n] = 2H_n - 2 ~ 2 log n and P(S_n >= (2+c) log n) <= n^(-h(c/2) + o(1)) for any constant c > 0, with an analogous lower tail bound for 0 < c < 2, where h(u) = (1+u) ln(1+u) - u is the Bennett rate function. Furthermore, we establish that this logarithmic scaling is robust: for arbitrary or uniformly random start-target pairs, the expected routing complexity remains E[S_{s,t}] = 2 log n + O(1), closely mirroring decentralized routing scenarios in real-world networks where endpoints are chosen dynamically rather than fixed a priori.
Published: April 21, 2026
Last updated: April 21, 2026
Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks
We show that gating mechanisms in recurrent neural networks (RNNs) induce lag-dependent and direction-dependent effective learning rates, even when training uses a fixed, global step size. This behavior arises from a coupling between state-space time-scales (parametrized by the gates) and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs and applying a first-order expansion, we make explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates act not only as filters of information flow, but also as data-driven preconditioners of optimization, with formal connections to learning-rate schedules, momentum, and adaptive methods such as Adam. Empirical simulations corroborate these predictions: across several sequence tasks, gates produce lag-dependent effective learning rates and concentrate gradient flow into low-dimensional subspaces, matching or exceeding the anisotropic structure induced by Adam. Notably, gating and optimizer-driven adaptivity shape complementary aspects of credit assignment: gates align state-space transport with loss-relevant directions, while optimizers rescale parameter-space updates. Overall, this work provides a unified dynamical systems perspective on how gating couples state evolution with parameter updates, clarifying why gated architectures achieve robust trainability in practice.
Published: August 16, 2025
Last updated: April 21, 2026
Local Updates in Distributed Optimization: Provable Acceleration and Topology Effects
Inspired by the success of performing multiple local optimization steps between communication rounds in federated learning, incorporating such local updates into distributed optimization has recently attracted growing interest. However, unlike federated learning, where local updates can accelerate training by reducing gradient estimation error under minibatch settings, it remains unclear whether similar benefits persist when exact gradients are available. Moreover, existing theoretical results typically require reducing the step size when multiple local updates are employed, which can entirely offset any potential benefit of these additional local updates. In this paper, we focus on the classic DIGing algorithm and leverage the tight performance bounds provided by Performance Estimation Problems (PEP) to show that incorporating local updates can indeed accelerate distributed optimization. To the best of our knowledge, this is the first rigorous demonstration of such acceleration for a broad class of objective functions. Our analysis further reveals that, under an appropriate step size, performing only two local updates is sufficient to achieve the maximal possible improvement, and that additional local updates provide no further gains. Because more updates increase computational cost, these findings offer practical guidance for efficient implementation. We also show that these speed gains depend critically on the network structure, with sparser or less connected graphs, characterized by the spectral properties of the mixing matrix, yielding smaller improvements. Extensive experiments on both synthetic and real-world datasets corroborate the theoretical findings.
Published: January 06, 2026
Last updated: April 21, 2026
FASTER: Value-Guided Sampling for Fast RL
Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .
Published: April 21, 2026
Last updated: April 21, 2026
FB-NLL: A Feature-Based Approach to Tackle Noisy Labels in Personalized Federated Learning
Personalized Federated Learning (PFL) aims to learn multiple task-specific models rather than a single global model across heterogeneous data distributions. Existing PFL approaches typically rely on iterative optimization-such as model update trajectories-to cluster users that need to accomplish the same tasks together. However, these learning-dynamics-based methods are inherently vulnerable to low-quality data and noisy labels, as corrupted updates distort clustering decisions and degrade personalization performance. To tackle this, we propose FB-NLL, a feature-centric framework that decouples user clustering from iterative training dynamics. By exploiting the intrinsic heterogeneity of local feature spaces, FB-NLL characterizes each user through the spectral structure of the covariances of their feature representations and leverages subspace similarity to identify task-consistent user groupings. This geometry-aware clustering is label-agnostic and is performed in a one-shot manner prior to training, significantly reducing communication overhead and computational costs compared to iterative baselines. Complementing this, we introduce a feature-consistency-based detection and correction strategy to address noisy labels within clusters. By leveraging directional alignment in the learned feature space and assigning labels based on class-specific feature subspaces, our method mitigates corrupted supervision without requiring estimation of stochastic noise transition matrices. In addition, FB-NLL is model-independent and integrates seamlessly with existing noise-robust training techniques. Extensive experiments across diverse datasets and noise regimes demonstrate that our framework consistently outperforms state-of-the-art baselines in terms of average accuracy and performance stability.
Published: April 21, 2026
Last updated: April 21, 2026
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.
Published: April 21, 2026
Last updated: April 21, 2026
Benign Overfitting in Adversarial Training for Vision Transformers
Despite the remarkable success of Vision Transformers (ViTs) across a wide range of vision tasks, recent studies have revealed that they remain vulnerable to adversarial examples, much like Convolutional Neural Networks (CNNs). A common empirical defense strategy is adversarial training, yet the theoretical underpinnings of its robustness in ViTs remain largely unexplored. In this work, we present the first theoretical analysis of adversarial training under simplified ViT architectures. We show that, when trained under a signal-to-noise ratio that satisfies a certain condition and within a moderate perturbation budget, adversarial training enables ViTs to achieve nearly zero robust training loss and robust generalization error under certain regimes. Remarkably, this leads to strong generalization even in the presence of overfitting, a phenomenon known as benign overfitting, previously only observed in CNNs (with adversarial training). Experiments on both synthetic and real-world datasets further validate our theoretical findings.
Published: April 21, 2026
Last updated: April 21, 2026
Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes
The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique -- which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm -- we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative resolution. Furthermore, we integrate AMSD into ensemble methods, specifically presenting the Random Forest-AMSD (RF-AMSD) framework. Empirical evaluations on the Census Income, Heart Disease, Breast Cancer, and Forest Covertype datasets demonstrate that AMSD yields a 2-4% accuracy improvement over standard MSD-Splitting, while maintaining near-identical O(N) time complexity reductions compared to the O(N log N) exhaustive search. Our Random Forest extension achieves state-of-the-art accuracy at a fraction of standard computational costs, confirming the viability of adaptive statistical binning in large-scale ensemble learning architectures.
Published: April 21, 2026
Last updated: April 21, 2026
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.
Published: April 21, 2026
Last updated: April 21, 2026
Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views
Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that are shared across views while remaining independent of surface forms. To verify this, we employ Canonical Correlation Analysis on the paired residual activations from natural-language and symbolic-language reasoning chains, learning a low-dimensional subspace with maximum cross-view correlation. Furthermore, we design a training-free approach that steers LLMs reasoning chain along this logical subspace, thereby leveraging the complementary reasoning signals from both views. Experiments on four logical reasoning benchmarks demonstrate the effectiveness of our approach, improving accuracy by up to 11 percentage points and generalizing well on out-of-domain problems.
Published: April 21, 2026
Last updated: April 21, 2026
A Network-Aware Evaluation of Distributed Energy Resource Control in Smart Distribution Systems
Distribution networks with high penetration of Distributed Energy Resources (DERs) increasingly rely on communication networks to coordinate grid-interactive control. While many distributed control schemes have been proposed, they are often evaluated under idealized communication assumptions, making it difficult to assess their performance under realistic network conditions. This work presents an implementation-driven evaluation of a representative virtual power plant (VPP) dispatch algorithm using a co-simulation framework that couples a linearized distribution-system model with packet-level downlink emulation in ns-3. The study considers a modified IEEE~37-node feeder with high photovoltaic penetration and a primal--dual VPP dispatch that simultaneously targets feeder-head active power tracking and voltage regulation. Communication effects are introduced only on the downlink path carrying dual-variable updates, where per-DER packet delays and a hold-last-value strategy are modeled. Results show that, under ideal communication, the dispatch achieves close tracking of the feeder-head power reference while maintaining voltages within the prescribed limits at selected buses. When realistic downlink delay is introduced, the same controller exhibits large oscillations in feeder-head power and more frequent voltage limit violations. These findings highlight that distributed DER control performance can be strongly influenced by communication behavior and motivate evaluation frameworks that explicitly incorporate network dynamics into the assessment of grid-interactive control schemes.
Published: April 21, 2026
Last updated: April 21, 2026
SpecAgent: A Speculative Retrieval and Forecasting Agent for Code Completion
Large Language Models (LLMs) excel at code-related tasks but often struggle in realistic software repositories, where project-specific APIs and cross-file dependencies are crucial. Retrieval-augmented methods mitigate this by injecting repository context at inference time. The low inference-time latency budget affects either retrieval quality or the added latency adversely impacts user experience. We address this limitation with SpecAgent, an agent that improves both latency and code-generation quality by proactively exploring repository files during indexing and constructing speculative context that anticipates future edits in each file. This indexing-time asynchrony allows thorough context computation, masking latency, and the speculative nature of the context improves code-generation quality. Additionally, we identify the problem of future context leakage in existing benchmarks, which can inflate reported performance. To address this, we construct a synthetic, leakage-free benchmark that enables a more realistic evaluation of our agent against baselines. Experiments show that SpecAgent consistently achieves absolute gains of 9-11% (48-58% relative) compared to the best-performing baselines, while significantly reducing inference latency.
Published: October 20, 2025
Last updated: April 21, 2026
Ultrametric OGP - parametric RDT symmetric binary perceptron connection
In [97,99,100], an fl-RDT framework is introduced to characterize statistical computational gaps (SCGs). Studying symmetric binary perceptrons (SBPs), [100] obtained an algorithmic threshold estimate α_a≈ α_c^(7)≈ 1.6093 at the 7th lifting level (for κ=1 margin), closely approaching 1.58 local entropy (LE) prediction [18]. In this paper, we further connect parametric RDT to overlap gap properties (OGPs), another key geometric feature of the solution space. Specifically, for any positive integer s, we consider s-level ultrametric OGPs (ult_s-OGPs) and rigorously upper-bound the associated constraint densities α_ult_s. To achieve this, we develop an analytical union-bounding program consisting of combinatorial and probabilistic components. By casting the combinatorial part as a convex problem and the probabilistic part as a nested integration, we conduct numerical evaluations and obtain that the tightest bounds at the first two levels, _ult_1≈ 1.6578 and _ult_2≈ 1.6219, closely approach the 3rd and 4th lifting level parametric RDT estimates, α_c^(3)≈ 1.6576 and α_c^(4)≈ 1.6218. We also observe excellent agreement across other key parameters, including overlap values and the relative sizes of ultrametric clusters. Based on these observations, we propose several conjectures linking ult-OGP and parametric RDT. Specifically, we conjecture that algorithmic threshold α_a=lim_s→∞ α_ult_s = lim_s→∞ult_s = lim_r→∞ α_c^(r), and α_ult_s≤ α_c^(s+2) (with possible equality for some (maybe even all) s). Finally, we discuss the potential existence of a full isomorphism connecting all key parameters of ult-OGP and parametric RDT.
Published: April 21, 2026
Last updated: April 21, 2026
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.
Published: April 21, 2026
Last updated: April 21, 2026
Quantifying Data Similarity Using Cross Learning
Measuring dataset similarity is fundamental in machine learning, particularly for transfer learning and domain adaptation. In the context of supervised learning, most existing approaches quantify similarity of two data sets based on their input feature distributions, neglecting label information and feature-response alignment. To address this, we propose the Cross-Learning Score (CLS), which measures dataset similarity through bidirectional generalization performance of decision rules. We establish its theoretical foundation by linking CLS to cosine similarity between decision boundaries under canonical linear models, providing a geometric interpretation. A robust ensemble-based estimator is developed that is easy to implement and bypasses high-dimensional density estimation entirely. For transfer learning applications, we introduce a "transferable zones" framework that categorizes source datasets into positive, ambiguous, and negative transfer regions. To accommodate deep learning, we extend CLS to encoder-head architectures, aligning with modern representation-based pipelines. Extensive experiments on synthetic and real-world datasets validate the effectiveness of CLS for similarity measurement and transfer assessment.
Published: October 13, 2025
Last updated: April 21, 2026
OmniGen2: Towards Instruction-Aligned Multimodal Generation
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2
Published: June 23, 2025
Last updated: April 21, 2026
Regression with Large Language Models for Materials and Molecular Property Prediction
We demonstrate the ability of large language models (LLMs) to perform material and molecular property regression tasks, a significant deviation from the conventional LLM use case. We benchmark the Large Language Model Meta AI (LLaMA) 3 on several molecular properties in the QM9 dataset and 28 materials properties. Only composition-based input strings are used as the model input and we fine tune on only the generative loss. We broadly find that LLaMA 3, when fine-tuned using the SMILES representation of molecules, provides useful regression results which can rival standard materials property prediction models like random forest or fully connected neural networks on the QM9 dataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the state-of-the-art models that were trained using far more granular representation of molecules (e.g., atom types and their coordinates) for the same task. Similarly, LLaMA 3 provides comparable, although slightly worse, accuracy relative to random forest and elemental descriptors when using just compound chemical description on our set of 28 materials properties. Interestingly, LLaMA 3 provides improved predictions compared to GPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting that LLM-like generative models can potentially transcend their traditional applications to tackle complex physical phenomena, thus paving the way for future research and applications in chemistry, materials science and other scientific domains.
Published: September 09, 2024
Last updated: April 21, 2026
Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM
Speculative decoding accelerates LLM inference by using a small draft model to propose k candidate tokens for a target model to verify. While effective for same-tokenizer pairs on high-bandwidth GPUs, its applicability to cross-family pairs with mismatched tokenizers and consumer-grade unified memory remains underexplored. We extend the MLX-LM framework with Universal Assisted Generation (UAG) to enable cross-tokenizer speculative decoding on Apple Silicon. We evaluate Bielik 11B-Instruct (Mistral-based) as the target model, paired with three draft models: Bielik 1.5B (Qwen-based with custom tokenizer), Qwen2.5-1.5B, and Llama 3.2-1B. Experiments on three Polish-language datasets (Wikipedia, pl_alpaca, synthetic) use draft lengths k in {2, 4, 6} to compare naive and context-aware token translation. Results show: (1) context-aware translation consistently improves acceptance rates across all configurations; (2) the Polish-specialized Bielik 1.5B achieves lower acceptance than general-purpose Qwen2.5 and Llama 3.2 drafters; (3) throughput on Apple Silicon is content-dependent, reaching 1.7x speedup for structured text but failing for varied instructions; and (4) verification cost on unified memory does not amortize as theory predicts because both models are memory-bandwidth bound, making sequential drafting expensive relative to batched verification. We propose a hardware-aware speedup formula and characterize conditions for cross-family speculative decoding on Apple Silicon. This is the first systematic evaluation of cross-family speculative decoding for Polish LLMs and the first empirical study of UAG-based decoding on unified memory architectures.
Published: March 22, 2026
Last updated: April 21, 2026
Face Anything: 4D Face Reconstruction from Any Image Sequence
Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3× lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16
Published: April 21, 2026
Last updated: April 21, 2026
QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models
Large Language Models (LLMs) have been emerging as prominent AI models for solving many natural language tasks due to their high performance (e.g., accuracy) and capabilities in generating high-quality responses to the given inputs. However, their large computational cost, huge memory footprints, and high processing power/energy make it challenging for their embedded deployments. Amid several tinyLLMs, recent works have proposed spike-driven language models (SLMs) for significantly reducing the processing power/energy of LLMs. However, their memory footprints still remain too large for low-cost and resource-constrained embedded devices. Manual quantization approach may effectively compress SLM memory footprints, but it requires a huge design time and compute power to find the quantization setting for each network, hence making this approach not-scalable for handling different networks, performance requirements, and memory budgets. To bridge this gap, we propose QSLM, a novel framework that performs automated quantization for compressing pre-trained SLMs, while meeting the performance and memory constraints. To achieve this, QSLM first identifies the hierarchy of the given network architecture and the sensitivity of network layers under quantization, then employs a tiered quantization strategy (e.g., global-, block-, and module-level quantization) while leveraging a multi-objective performance-and-memory trade-off function to select the final quantization setting. Experimental results indicate that our QSLM reduces memory footprint by up to 86.5%, reduces power consumption by up to 20%, maintains high performance across different tasks (i.e., by up to 84.4% accuracy of sentiment classification on the SST-2 dataset and perplexity score of 23.2 for text generation on the WikiText-2 dataset) close to the original non-quantized model while meeting the performance and memory constraints.
Published: January 02, 2026
Last updated: April 21, 2026
3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography
Head computed tomography (CT) imaging is a widely-used imaging modality with multitudes of medical indications, particularly in assessing pathology of the brain, skull, and cerebrovascular system. It is commonly the first-line imaging in neurologic emergencies given its rapidity of image acquisition, safety, cost, and ubiquity. Deep learning models may facilitate detection of a wide range of diseases. However, the scarcity of high-quality labels and annotations, particularly among less common conditions, significantly hinders the development of powerful models. To address this challenge, we introduce FM-CT: a Foundation Model for Head CT for generalizable disease detection, trained using self-supervised learning. Our approach pre-trains a deep learning model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans without the need for manual annotations, enabling the model to learn robust, generalizable features. To investigate the potential of self-supervised learning in head CT, we employed both discrimination with self-distillation and masked image modeling, and we construct our model in 3D rather than at the slice level (2D) to exploit the structure of head CT scans more comprehensively and efficiently. The model's downstream classification performance is evaluated using internal and three external datasets, encompassing both in-distribution (ID) and out-of-distribution (OOD) data. Our results demonstrate that the self-supervised foundation model significantly improves performance on downstream diagnostic tasks compared to models trained from scratch and previous 3D CT foundation models on scarce annotated datasets. This work highlights the effectiveness of self-supervised learning in medical imaging and sets a new benchmark for head CT image analysis in 3D, enabling broader use of artificial intelligence for head CT-based diagnosis.
Published: February 04, 2025
Last updated: April 21, 2026
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
Active mapping aims to determine how an agent should move to efficiently reconstruct unknown environments. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete reconstruction. To address this, we introduce MAGICIAN, a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a scene representation based on 3D Gaussian Splatting, derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient coverage gain computation for any novel viewpoint via fast volumetric rendering, allowing its integration into a tree-search algorithm for long-horizon planning. We update Imagined Gaussians and refine the trajectory in a closed loop. Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, highlighting the advantage of long-term planning in active mapping.
Published: March 23, 2026
Last updated: April 21, 2026
Epistemic orientation in parliamentary discourse is associated with deliberative democracy
The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence--Minus--Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.
Published: April 21, 2026
Last updated: April 21, 2026
On two ways to use determinantal point processes for Monte Carlo integration
The standard Monte Carlo estimator I_N^MC of ∫ fdω relies on independent samples from ω and has variance of order 1/N. Replacing the samples with a determinantal point process (DPP), a repulsive distribution, makes the estimator consistent, with variance rates that depend on how the DPP is adapted to f and ω. We examine two existing DPP-based estimators: one by Bardenet Hardy (2020) with a rate of 𝒪(N^-(1+1/d)) for smooth f, but relying on a fixed DPP. The other, by Ermakov Zolotukhin (1960), is unbiased with rate of order 1/N, like Monte Carlo, but its DPP is tailored to f. We revisit these estimators, generalize them to continuous settings, and provide sampling algorithms.
Published: April 21, 2026
Last updated: April 21, 2026
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
Published: April 21, 2026
Last updated: April 21, 2026
Planning in entropy-regularized Markov decision processes and games
We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the environment. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order O~(1/epsilon^4) for a desired accuracy epsilon, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sample complexity in the worst case.
Published: April 21, 2026
Last updated: April 21, 2026
Bayesian Event-Based Model for Disease Subtype and Stage Inference
Chronic diseases often progress differently across patients. Rather than randomly varying, there are typically a small number of subtypes for how a disease progresses across patients. To capture this structured heterogeneity, the Subtype and Stage Inference Event-Based Model (SuStaIn) estimates the number of subtypes, the order of disease progression for each subtype, and assigns each patient to a subtype from primarily cross-sectional data. It has been widely applied to uncover the subtypes of many diseases and inform our understanding of them. But how robust is its performance? In this paper, we develop a principled Bayesian subtype variant of the event-based model (BEBMS) and compare its performance to SuStaIn in a variety of synthetic data experiments with varied levels of model misspecification. BEBMS substantially outperforms SuStaIn across ordering, staging, and subtype assignment tasks. Further, we apply BEBMS and SuStaIn to a real-world Alzheimer's data set. We find BEBMS has results that are more consistent with the scientific consensus of Alzheimer's disease progression than SuStaIn.
Published: December 03, 2025
Last updated: April 21, 2026
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.
Published: April 21, 2026
Last updated: April 21, 2026
Quantum Non-Linear Bandit Optimization
We study non-linear bandit optimization where the learner maximizes a black-box function with zeroth order function oracle, which has been successfully applied in many critical applications such as drug discovery and materials design. Existing works have showed that with the aid of quantum computing, it is possible to break the classical Ω(√(T)) regret lower bound and achieve the new O(polylog T) upper bound. However, they usually assume that the objective function sits within the reproducing kernel Hilbert space and their algorithms suffer from the curse of dimensionality. In this paper, we propose the new Q-NLB-UCB algorithm which enjoys an input dimension-free O(polylog T) upper bound, making it applicable for high-dimensional tasks. At the heart of our algorithm design are quantum Monte Carlo mean estimator, parametric function approximation technique, and a new quantum non-linear regression oracle, which can be of independent interests in more quantum machine learning problems. Our algorithm is also validated for its efficiency compared with other quantum algorithms on both high-dimensional synthetic and real-world tasks.
Published: March 04, 2025
Last updated: April 21, 2026
An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA
Answering open-ended questions remains challenging for AI systems because it requires synthesis, judgment, and exploration beyond factual retrieval, and users often refine answers through multiple iterations rather than accepting a single response. Existing QA benchmarks do not explicitly support this refinement process. To address this gap, we introduce a new task, document-grounded related insight generation, where the goal is to generate additional insights from a document collection that help improve, extend, or rethink an initial answer to an open-ended question, ultimately supporting richer user interaction and a better overall question answering experience. We curate and release SCOpE-QA (Scientific Collections for Open-Ended QA), a dataset of 3,000 open-ended questions across 20 research collections. We present InsightGen, a two-stage approach that first constructs a thematic representation of the document collection using clustering, and then selects related context based on neighborhood selection from the thematic graph to generate diverse and relevant insights using LLMs. Extensive evaluation on 3,000 questions using two generation models and two evaluation settings shows that InsightGen consistently produces useful, relevant, and actionable insights, establishing a strong baseline for this new task.
Published: April 21, 2026
Last updated: April 21, 2026
PREF-XAI: Preference-Based Personalized Rule Explanations of Black-Box Machine Learning Models
Explainable artificial intelligence (XAI) has predominantly focused on generating model-centric explanations that approximate the behavior of black-box models. However, such explanations often overlook a fundamental aspect of interpretability: different users require different explanations depending on their goals, preferences, and cognitive constraints. Although recent work has explored user-centric and personalized explanations, most existing approaches rely on heuristic adaptations or implicit user modeling, lacking a principled framework for representing and learning individual preferences. In this paper, we consider Preference-Based Explainable Artificial Intelligence (PREF-XAI), a novel perspective that reframes explanation as a preference-driven decision problem. Within PREF-XAI, explanations are not treated as fixed outputs, but as alternatives to be evaluated and selected according to user-specific criteria. In the PREF-XAI perspective, here we propose a methodology that combines rule-based explanations with formal preference learning. User preferences are elicited through a ranking of a small set of candidate explanations and modeled via an additive utility function inferred using robust ordinal regression. Experimental results on real-world datasets show that PREF-XAI can accurately reconstruct user preferences from limited feedback, identify highly relevant explanations, and discover novel explanatory rules not initially considered by the user. Beyond the proposed methodology, this work establishes a connection between XAI and preference learning, opening new directions for interactive and adaptive explanation systems.
Published: April 21, 2026
Last updated: April 21, 2026
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.
Published: April 21, 2026
Last updated: April 21, 2026
IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language
Reclaimed slur usage is a common and meaningful practice online for many marginalized communities. It serves as a source of solidarity, identity, and shared experience. However, contemporary automated and AI-based moderation tools for online content largely fail to distinguish between reclaimed and hateful uses of slurs, resulting in the suppression of marginalized voices. In this work, we use quantitative and qualitative methods to examine the attitudes of social media users in LGBTQIA+, Black, and women communities around reclaimed slurs targeting our focus groups including the f-word, n-word, and b-word. With social media users from these communities, we collect and analyze an annotated online slur usage corpus. The corpus includes annotators' perceptions of whether an online text containing a slur should be flagged as hate speech, as well as contextual features of the slur usage. Across all communities and annotation questions, we observe low inter-annotator agreement, indicating substantial disagreement among in-group annotators. This is compounded by the fact that, absent clear contextual signals of identity and intent, even in-group members may disagree on how to interpret reclaimed slur usage online. Semi-structured interviews with annotators suggest that differences in lived experience and personal history contribute to this variation as well. We find poor alignment between annotator judgments and automated hate speech assessments produced by Perspective API. We further observe that certain features of a text such as whether the slur usage was derogatory and if the slur was targeted at oneself are more associated with whether annotators report the text as hate speech. Together, these findings highlight the inherent subjectivity and contextual nature of how marginalized communities interpret slurs online.
Published: April 17, 2026
Last updated: April 21, 2026
On the Derivation of Tightly-Coupled LiDAR-Inertial Odometry with VoxelMap
This note presents a concise mathematical formulation of tightly-coupled LiDAR-Inertial Odometry within an iterated error-state Kalman filter framework using a VoxelMap representation. Rather than proposing a new algorithm, it provides a clear and self-contained derivation that unifies the geometric modeling and probabilistic state estimation through consistent notation and explicit formulations. The document is intended to serve both as a technical reference and as an accessible entry point for a foundational understanding of the system architecture and estimation principles.
Published: March 16, 2026
Last updated: April 21, 2026
IR-Flow: Bridging Discriminative and Generative Image Restoration via Rectified Flow
In image restoration, single-step discriminative mappings often lack fine details via expectation learning, whereas generative paradigms suffer from inefficient multi-step sampling and noise-residual coupling. To address this dilemma, we propose IR-Flow, a novel image restoration method based on Rectified Flow that serves as a unified framework bridging the gap between discriminative and generative paradigms. Specifically, we first construct multilevel data distribution flows, which expand the ability of models to learn from and adapt to various levels of degradation. Subsequently, cumulative velocity fields are proposed to learn transport trajectories across varying degradation levels, guiding intermediate states toward the clean target, while a multi-step consistency constraint is presented to enforce trajectory coherence and boost few-step restoration performance. We show that directly establishing a linear transport flow between degraded and clean image domains not only enables fast inference but also improves adaptability to out-of-distribution degradations. Extensive evaluations on deraining, denoising and raindrop removal tasks demonstrate that IR-Flow achieves competitive quantitative results with only a few sampling steps, offering an efficient and flexible framework that maintains an excellent distortion-perception balance. Our code is available at https://github.com/fanzh03/IR-Flow.
Published: April 21, 2026
Last updated: April 21, 2026
ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection
Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) remains challenging due to complex backgrounds, low contrast, irregular object shapes, and large variations in object scale. Existing discriminative methods directly regress saliency maps, while recent diffusion-based generative approaches suffer from stochastic sampling and high computational cost. In this paper, we propose ORSIFlow, a saliency-guided rectified flow framework that reformulates ORSI-SOD as a deterministic latent flow generation problem. ORSIFlow performs saliency mask generation in a compact latent space constructed by a frozen variational autoencoder, enabling efficient inference with only a few steps. To enhance saliency awareness, we design a Salient Feature Discriminator for global semantic discrimination and a Salient Feature Calibrator for precise boundary refinement. Extensive experiments on multiple public benchmarks show that ORSIFlow achieves state-of-the-art performance with significantly improved efficiency.
Published: March 30, 2026
Last updated: April 21, 2026
MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation
Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.
Published: April 21, 2026
Last updated: April 21, 2026
Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation
Function vectors (FVs) are vector representations of tasks extracted from model activations during in-context learning. While prior work has shown that multilingual model representations can be language-agnostic, it remains unclear whether the same holds for function vectors. We study whether FVs exhibit language-agnosticity, using machine translation as a case study. Across three decoder-only multilingual LLMs, we find that translation FVs extracted from a single English→Target direction transfer to other target languages, consistently improving the rank of correct translation tokens across multiple unseen languages. Ablation results show that removing the FV degrades translation across languages with limited impact on unrelated tasks. We further show that base-model FVs transfer to instruction-tuned variants and partially generalize from word-level to sentence-level translation.
Published: April 21, 2026
Last updated: April 21, 2026
Learning Hybrid-Control Policies for High-Precision In-Contact Manipulation Under Uncertainty
Reinforcement learning-based control policies have been frequently demonstrated to be more effective than analytical techniques for many manipulation tasks. Commonly, these methods learn neural control policies that predict end-effector pose changes directly from observed state information. For tasks like inserting delicate connectors which induce force constraints, pose-based policies have limited explicit control over force and rely on carefully tuned low-level controllers to avoid executing damaging actions. In this work, we present hybrid position-force control policies that learn to dynamically select when to use force or position control in each control dimension. To improve learning efficiency of these policies, we introduce Mode-Aware Training for Contact Handling (MATCH) which adjusts policy action probabilities to explicitly mirror the mode selection behavior in hybrid control. We validate MATCH's learned policy effectiveness using fragile peg-in-hole tasks under extreme localization uncertainty. We find MATCH substantially outperforms pose-control policies -- solving these tasks with up to 10% higher success rates and 5x fewer peg breaks than pose-only policies under common types of state estimation error. MATCH also demonstrates data efficiency equal to pose-control policies, despite learning in a larger and more complex action space. In over 1600 sim-to-real experiments, we find MATCH succeeds twice as often as pose policies in high noise settings (33% vs.~68%) and applies ~30% less force on average compared to variable impedance policies on a Franka FR3 in laboratory conditions.
Published: April 21, 2026
Last updated: April 21, 2026
MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention
Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient deterministic generation without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly built upon diffusion models, which incur substantial computational overhead due to iterative sampling and are often constrained by UNet-based parameterizations. In this work, we introduce MedFlowSeg, a conditional flow matching framework that formulates medical image segmentation as learning a time-dependent vector field that transports a simple prior distribution to the target segmentation distribution. This formulation enables one-step deterministic inference while preserving the expressiveness of generative modeling. We further develop a dual-conditioning mechanism to incorporate structured priors into the learned flow. Specifically, we propose a Dual-Branch Spatial Attention module that injects multi-scale structural information into the flow field, and a Frequency-Aware Attention module that models cross-domain interactions between spatial and spectral representations via discrepancy-aware fusion and time-dependent modulation. Together, these components provide an effective parameterization of conditional flows that capture both global anatomical structure and fine-grained boundary details. We provide extensive empirical validation across multiple medical imaging modalities, demonstrating that MedFlowSeg achieves state-of-the-art performance while significantly reducing computational cost compared to diffusion-based methods. Our results highlight the potential of flow matching as a theoretically grounded and computationally efficient alternative for generative medical image segmentation.
Published: April 21, 2026
Last updated: April 21, 2026
Fast and Robust Diffusion Posterior Sampling for MR Image Reconstruction Using the Preconditioned Unadjusted Langevin Algorithm
Purpose: The Unadjusted Langevin Algorithm (ULA) in combination with diffusion models can generate high quality MRI reconstructions with uncertainty estimation from highly undersampled k-space data. However, sampling methods such as diffusion posterior sampling (DPS) or likelihood annealing suffer from long reconstruction times and the need for parameter tuning. The purpose of this work is to develop a robust sampling algorithm with fast convergence. Theory and Methods: In the reverse diffusion process used for sampling the posterior, the exact likelihood is multiplied with the diffused prior at all noise scales. To overcome the issue of slow convergence, preconditioning is used. The method is trained on fastMRI data and tested on retrospectively undersampled brain data of a healthy volunteer. Results: For posterior sampling in Cartesian and non-Cartesian accelerated MRI the new approach outperforms annealed sampling and DPS in terms of reconstruction speed and sample quality. Conclusion: The proposed exact likelihood with preconditioning enables rapid and reliable posterior sampling across various MRI reconstruction tasks without the need for parameter tuning.
Published: December 05, 2025
Last updated: April 21, 2026
InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.
Published: April 21, 2026
Last updated: April 21, 2026
Conjuring Semantic Similarity
The semantic similarity between sample expressions measures the distance between their latent 'meaning'. These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or 'conjure.' We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.
Published: October 21, 2024
Last updated: April 21, 2026
Reasoning Models Will Sometimes Lie About Their Reasoning
Hint-based faithfulness evaluations have established that Large Reasoning Models (LRMs) may not say what they think: they do not always volunteer information about how key parts of the input (e.g. answer hints) influence their reasoning. Yet, these evaluations also fail to specify what models should do when confronted with hints or other unusual prompt content -- even though versions of such instructions are standard security measures (e.g. for countering prompt injections). Here, we study faithfulness under this more realistic setting in which models are explicitly alerted to the possibility of unusual inputs. We find that such instructions can yield strong results on faithfulness metrics from prior work. However, results on new, more granular metrics proposed in this work paint a mixed picture: although models may acknowledge the presence of hints, they will often deny intending to use them -- even when permitted to use hints and even when it can be demonstrated that they are using them. Our results thus raise broader challenges for CoT monitoring and interpretability.
Published: January 12, 2026
Last updated: April 21, 2026
Budgeted Online Influence Maximization
We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set. Our approach better models the real-world setting where the cost of influencers varies and advertisers want to find the best value for their overall social advertising budget. We propose an algorithm assuming an independent cascade diffusion model and edge level semi-bandit feedback, and provide both theoretical and experimental results. Our analysis is also valid for the cardinality constraint setting and improves the state of the art regret bound in this case.
Published: April 21, 2026
Last updated: April 21, 2026
Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teaming
Effective human-robot teaming is crucial for the practical deployment of robots in human workspaces. However, optimizing joint human-robot plans remains a challenge due to the difficulty of modeling individualized human capabilities and preferences. While prior research has leveraged the multi-cycle structure of domains like manufacturing to learn an individual's tendencies and adapt plans over repeated interactions, these techniques typically consider task-level and motion-level adaptation in isolation. Task-level methods optimize allocation and scheduling but often ignore spatial interference in close-proximity scenarios; conversely, motion-level methods focus on collision avoidance while ignoring the broader task context. This paper introduces RAPIDDS, a framework that unifies these approaches by modeling an individual's spatial behavior (motion paths) and temporal behavior (time required to complete tasks) over multiple cycles. RAPIDDS then jointly adapts task schedules and steers diffusion models of robot motions to maximize efficiency and minimize proximity accounting for these individualized models. We demonstrate the importance of this dual adaptation through an ablation study in simulation and a physical robot scenario using a 7-DOF robot arm. Finally, we present a user study (n=32) showing significant plan improvement compared to non-adaptive systems across both objective metrics, such as efficiency and proximity, and subjective measures, including fluency and user preference. See this paper's companion video at: https://youtu.be/55Q3lq1fINs.
Published: April 21, 2026
Last updated: April 21, 2026
HardNet++: Nonlinear Constraint Enforcement in Neural Networks
Enforcing constraint satisfaction in neural network outputs is critical for safety, reliability, and physical fidelity in many control and decision-making applications. While soft-constrained methods penalize constraint violations during training, they do not guarantee constraint adherence during inference. Other approaches guarantee constraint satisfaction via specific parameterizations or a projection layer, but are tailored to specific forms (e.g., linear constraints), limiting their utility in other general problem settings. Many real-world problems of interest are nonlinear, motivating the development of methods that can enforce general nonlinear constraints. To this end, we introduce HardNet++, a constraint-enforcement method that simultaneously satisfies linear and nonlinear equality and inequality constraints. Our approach iteratively adjusts the network output via damped local linearizations. Each iteration is differentiable, admitting an end-to-end training framework, where the constraint satisfaction layer is active during training. We show that under certain regularity conditions, this procedure can enforce nonlinear constraint satisfaction to arbitrary tolerance. Finally, we demonstrate tight constraint adherence without loss of optimality in a learning-for-optimization context, where we apply this method to a model predictive control problem with nonlinear state constraints.
Published: April 21, 2026
Last updated: April 21, 2026
Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.
Published: April 21, 2026
Last updated: April 21, 2026
From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems
Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item-level assessments to top-K list-level explanations. Through extensive experiments on three real-world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade-off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph-based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: https://github.com/L2R-UET/CFExpRec.
Published: April 21, 2026
Last updated: April 21, 2026