Local Reinforcement Learning with Action-Conditioned Root Mean Squared Q-Functions

Frank Wu ^1,2, Mengye Ren²

¹Carnegie Mellon University, ²New York University

teaser figure

ARQ achieves superior performance compared to state-of-the-art local backprop-free RL methods and outperforms algorithms trained with backpropagation on most MinAtar tasks.

Abstract

The Forward-Forward (FF) Algorithm is a recently proposed learning procedure for neural networks that employs two forward passes instead of the traditional forward and backward passes used in backpropagation. However, FF remains largely confined to supervised settings, leaving a gap at domains where learning signals can be yielded more naturally such as RL. In this work, inspired by FF's goodness function using layer activity statistics, we introduce Action-conditioned Root mean squared Q-Functions (ARQ), a novel value estimation method that applies a goodness function and action conditioning for local RL using temporal difference learning. Despite its simplicity and biological grounding, our approach achieves superior performance compared to state-of-the-art local backprop-free RL methods in the MinAtar and the DeepMind Control Suite benchmarks, while also outperforming algorithms trained with backpropagation on most tasks.

Background & Motivation

Backprop-based reinforcement learning (RL) relies on global error signals that are difficult to reconcile with biologically plausible learning. Recent local learning methods such as Forward-Forward show that meaningful representations can be learned without backpropagation, motivating their extension to RL. We propose Action-Conditioned RMS-Q (ARQ), a fully local, backprop-free RL algorithm that reformulates Q-learning using a vector-based, action-conditioned goodness objective. By estimating Q-values through layer-local RMS activations, ARQ removes architectural constraints present in prior local RL methods. Despite relying only on local updates, ARQ achieves competitive performance across standard RL benchmarks.

Method

Inspired by the Forward-Forward algorithm's goodness function using layer activity statistics, we propose Action-conditioned Root mean squared Q-Function (ARQ), a simple vector-based alternative to traditional scalar-based Q-value predictors designed for local RL.

ARQ is composed of two key ingredients:

RMS Goodness Function: We extract value predictions from a vector of arbitrary size by computing the root mean squared (RMS) of the hidden layer activations: \(g_l = \sqrt{\text{mean}(h_l^2)}\). This significantly improves expressivity by allowing more neurons at the output layer without sacrificing the backprop-free property.
Action Conditioning: We insert an action candidate at the model input, enabling the network to produce representations specific to each state-action pair. This unleashes the capacity of the network compared to prior local methods that relied on dot-products between learned mappings.

ARQ can be readily implemented on top of Artificial Dopamine (AD), taking full advantage of their non-linearity and attention-like mechanisms while maintaining biological plausibility.

method figure

ARQ Algorithm Pseudocode

Results

We evaluate ARQ on two challenging benchmarks designed to test RL algorithms in settings where local methods remain viable: MinAtar (5 discrete action games) and the DeepMind Control Suite (5 continuous control tasks).

Key Findings: ARQ consistently outperforms current local RL methods and surpasses conventional backprop-based value-learning methods in most games, demonstrating strong decision-making capabilities without relying on backpropagation. On MinAtar, ARQ shows particularly strong improvements on Breakout, SpaceInvaders, Seaquest, and Asterix. On DMC tasks, ARQ matches or exceeds the performance of SAC and TD-MPC2 while maintaining biological plausibility.

Results Table

Analysis

Effect of Action Conditioning at Input

We ablate on the effect of conditioning on our method. Our results show that without input-level action conditioning, the network struggles to differentiate between action-specific Q-values, leading to significantly degraded performance. This demonstrates that early fusion of action information is essential for learning meaningful state-action representations in a local learning framework. Interestingly, the benefits of action conditioning is only mild for AD, while being rather significant for ARQ.

Action Conditioning Ablation

Representation Analysis: Effect of Action Conditioning

We visualize the hidden layer activations using 2-component PCA on the MinAtar Breakout environment.

Key Observation: Without action conditioning, activations cluster almost entirely by action identity and show no meaningful correlation with Q-values, indicating that action-related variance dominates the representation space. With action conditioning, representations become more state-driven and exhibit a mild positive relationship with Q-values, suggesting that the model can allocate capacity toward value-relevant structure rather than implicitly inferring action identity.

Representation Analysis

ARQ (RMS Goodness) vs. ARQ-MS (Mean-Squared Goodness)

We ablate on the choice of goodness function between ARQ and ARQ-MS. We find that ARQ produces moderate, stable goodness magnitudes throughout training, while ARQ-MS shows large initial spikes followed by sharply reduced variability. This suggests that the square root operation in RMS helps normalize the magnitude of activations, preventing numerical instability while maintaining the sensitivity to layer activities.

RMS vs MS Comparison

BibTeX

@inproceedings{
        wu2026local,
        title={Local Reinforcement Learning with Action-Conditioned Root Mean Squared Q-Functions},
        author={Wu, Frank and Ren, Mengye},
        booktitle={The Fourteenth International Conference on Learning Representations},
        year={2026},
        url={https://openreview.net/forum?id=pi4tbBMLsM}
        }