The Forward-Forward (FF) Algorithm is a recently proposed learning procedure for neural networks that employs two forward passes instead of the traditional forward and backward passes used in backpropagation. However, FF remains largely confined to supervised settings, leaving a gap at domains where learning signals can be yielded more naturally such as RL. In this work, inspired by FF's goodness function using layer activity statistics, we introduce Action-conditioned Root mean squared Q-Functions (ARQ), a novel value estimation method that applies a goodness function and action conditioning for local RL using temporal difference learning. Despite its simplicity and biological grounding, our approach achieves superior performance compared to state-of-the-art local backprop-free RL methods in the MinAtar and the DeepMind Control Suite benchmarks, while also outperforming algorithms trained with backpropagation on most tasks.
Backprop-based reinforcement learning (RL) relies on global error signals that are difficult to reconcile with biologically plausible learning. Recent local learning methods such as Forward-Forward show that meaningful representations can be learned without backpropagation, motivating their extension to RL. We propose Action-Conditioned RMS-Q (ARQ), a fully local, backprop-free RL algorithm that reformulates Q-learning using a vector-based, action-conditioned goodness objective. By estimating Q-values through layer-local RMS activations, ARQ removes architectural constraints present in prior local RL methods. Despite relying only on local updates, ARQ achieves competitive performance across standard RL benchmarks.
Inspired by the Forward-Forward algorithm's goodness function using layer activity statistics, we propose Action-conditioned Root mean squared Q-Function (ARQ), a simple vector-based alternative to traditional scalar-based Q-value predictors designed for local RL.
ARQ is composed of two key ingredients:
ARQ can be readily implemented on top of Artificial Dopamine (AD), taking full advantage of their non-linearity and attention-like mechanisms while maintaining biological plausibility.
We evaluate ARQ on two challenging benchmarks designed to test RL algorithms in settings where local methods remain viable: MinAtar (5 discrete action games) and the DeepMind Control Suite (5 continuous control tasks).
Key Findings: ARQ consistently outperforms current local RL methods and surpasses conventional backprop-based value-learning methods in most games, demonstrating strong decision-making capabilities without relying on backpropagation. On MinAtar, ARQ shows particularly strong improvements on Breakout, SpaceInvaders, Seaquest, and Asterix. On DMC tasks, ARQ matches or exceeds the performance of SAC and TD-MPC2 while maintaining biological plausibility.
We ablate on the effect of conditioning on our method. Our results show that without input-level action conditioning, the network struggles to differentiate between action-specific Q-values, leading to significantly degraded performance. This demonstrates that early fusion of action information is essential for learning meaningful state-action representations in a local learning framework. Interestingly, the benefits of action conditioning is only mild for AD, while being rather significant for ARQ.
We visualize the hidden layer activations using 2-component PCA on the MinAtar Breakout environment.
Key Observation: Without action conditioning, activations cluster almost entirely by action identity and show no meaningful correlation with Q-values, indicating that action-related variance dominates the representation space. With action conditioning, representations become more state-driven and exhibit a mild positive relationship with Q-values, suggesting that the model can allocate capacity toward value-relevant structure rather than implicitly inferring action identity.
We ablate on the choice of goodness function between ARQ and ARQ-MS. We find that ARQ produces moderate, stable goodness magnitudes throughout training, while ARQ-MS shows large initial spikes followed by sharply reduced variability. This suggests that the square root operation in RMS helps normalize the magnitude of activations, preventing numerical instability while maintaining the sensitivity to layer activities.
@inproceedings{
wu2026local,
title={Local Reinforcement Learning with Action-Conditioned Root Mean Squared Q-Functions},
author={Wu, Frank and Ren, Mengye},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=pi4tbBMLsM}
}