Deep Learning and Reinforcement Learning in Algorithmic Trading (2018–2025)

A practical survey of deep learning (LSTM/CNN/Transformers) and deep reinforcement learning (DQN/PPO/A2C) for trading across equities, futures, and crypto—what works, what breaks, and how to deploy with realistic risk controls.

Why this matters

From 2018 to 2025, deep learning and deep reinforcement learning (DRL) moved from “interesting research” to a real production toolkit in systematic trading—especially for:

  • Feature extraction from high-dimensional inputs (order books, cross-asset signals, alt data)
  • Policy learning (position sizing / allocation) rather than pure price forecasting
  • Regime-aware behavior (trade less in chop, scale down risk in stress, rebalance dynamically)

The hard part isn’t finding papers with strong backtests. The hard part is building systems that survive non-stationarity, transaction costs, market impact, and evaluation leakage.

This report surveys key model families and what the literature and practitioner experience suggest actually translates into robustness.

Deep learning for prediction: what each family is good at

LSTM / GRU (sequence models)

Typical use: predict returns, direction, volatility; then map forecasts into trades via rules or portfolio optimization.

Strengths

  • Captures temporal dependencies and non-linear dynamics in price/indicator sequences
  • Often strong on medium-horizon signals (hours → days), especially with sensible regularization

Common failure modes

  • Overfits to one regime (e.g., post-2020 liquidity) and degrades sharply out-of-sample
  • Forecast accuracy doesn’t necessarily convert to trade PnL after costs

CNN (local motif extraction)

Typical use: extract short-horizon patterns from time series windows, technical-indicator “images”, or limit order books.

Strengths

  • Great at learning local shapes/patterns (microstructure motifs, short bursts)
  • Works well as a feature extractor feeding an LSTM/Transformer/RL policy

Common failure modes

  • Needs a lot of data; fragile to small distribution shifts
  • Can “learn the backtest” unless the evaluation protocol is strict

Transformers / attention models

Typical use: multi-asset + long context modeling; mixing modalities (prices + macro + text/sentiment).

Strengths

  • Handles long-range dependencies and cross-asset interactions better than classic RNNs
  • Attention can improve interpretability (“what did the model look at?”)

Common failure modes

  • Data hunger and over-parameterization; can look great in-sample and disappoint live
  • Requires careful training design (walk-forward splits, early stopping, robust regularization)

Deep reinforcement learning: when policy learning beats prediction

DRL frames trading as sequential decision-making: learn a policy that maximizes a reward (PnL, Sharpe, drawdown-penalized return).

Value-based: DQN and variants

Where it fits: discrete actions (long/short/flat), simpler single-asset strategies, or constrained crypto bots.

What works well in practice

  • Reward shaping that penalizes volatility/drawdown and discourages overtrading
  • State representation that includes volatility, trend, and risk measures—not just price deltas

Typical pitfalls

  • Unstable learning in noisy markets
  • Unrealistic fills and costs make “paper alpha” vanish

Policy gradient / actor-critic: PPO, A2C/A3C, DDPG/TD3, SAC

Where it fits: continuous actions (position sizing, portfolio weights), multi-asset allocation.

Why people like PPO in finance

  • Generally stable and easy to tune relative to many alternatives
  • Works well as a baseline in portfolio environments (especially with constraints)

Typical pitfalls

  • Overtrading if the reward function doesn’t explicitly price turnover
  • Agents learn “cheats” in the simulator unless the environment is very realistic

Market-specific observations

Equities

  • Plenty of data, but news-driven jumps and structural breaks dominate many periods.
  • Deep learning can help with ranking/selection and risk overlays, but “one model to rule them all” is rare.
  • DRL portfolio rebalancing is promising when constraints (turnover, leverage, sector caps) are built in.

Futures / multi-asset

  • Strong use case for DRL: learning volatility-scaled exposure and “trade / don’t trade” behavior across diverse contracts.
  • Evaluation must handle roll/continuous series correctly and include realistic costs.

Crypto

  • DRL can look spectacular in certain regimes; robustness across boom/bust cycles is the problem.
  • Sentiment features (social) can help, but they can also introduce leakage if not handled carefully.

A pragmatic benchmark table (representative results)

Work / system Approach Market What it reported (high level)
Zhang, Zohren & Roberts (2019) Deep RL with risk/volatility scaling 50 liquid futures (multi-asset) Outperformed classical time-series momentum after costs; learned to stay out in consolidation
Théate & Ernst (2021) DQN variant optimized for risk-adjusted performance Stocks (multi-market) Improved risk-adjusted returns under stricter evaluation
FinRL benchmark/contest (reported 2025) PPO and other DRL baselines US equities (Dow 30) PPO often a strong baseline; ensembles can reduce drawdown
Sattarov & Choi (2024) Multi-level DQN + sentiment + risk-aware reward Bitcoin Higher Sharpe vs prior baselines in their setting; highlights reward design importance
“FTRL” (Financial Transformer + RL) (2025) Transformer state encoder + RL policy Portfolio setting Improved returns vs baselines in the paper’s testbed; illustrates attention for state representation

Note: reported numbers across papers aren’t directly comparable; environments, costs, and evaluation rigor vary widely.

What “best practice” looks like (if you want something that survives live)

  1. Walk-forward evaluation (multiple splits) and a strict “never peek” feature pipeline.
  2. Cost model first, not last: commissions + spreads + slippage; include impact proxies where relevant.
  3. Turnover constraints (explicit penalty) and realistic order execution assumptions.
  4. Regime robustness checks: stress periods, vol spikes, sideways markets; test sensitivity to small perturbations.
  5. Risk controls outside the model: max leverage, vol targeting, drawdown brakes, kill-switch rules.
  6. Monitoring + retraining policy: define drift metrics and a schedule; avoid “retrain whenever it hurts”.

Bottom line

  • Deep learning is best treated as a feature engine and forecasting component that needs strong risk overlays.
  • DRL is most compelling when you need policy learning (position sizing / allocation) under constraints.
  • The dominant edge isn’t a specific architecture; it’s rigorous evaluation, realistic execution/cost modeling, and operational discipline.

References

  1. Review of reinforcement learning in trading (notes early-stage + realism gap): https://arxiv.org/abs/2106.00123
  2. Deep RL for continuous futures trading (multi-asset): https://ideas.repec.org/p/arx/papers/1911.10107.html
  3. An application of deep reinforcement learning to algorithmic trading (DQN): https://arxiv.org/abs/2004.06627
  4. FinRL benchmark/contest report (portfolio DRL baselines): https://arxiv.org/pdf/2504.02281
  5. Multi-level deep Q-networks for Bitcoin trading (Scientific Reports, 2024): https://www.nature.com/articles/s41598-024-51408-w
  6. Comparing transformer structures for stock prediction (2025): https://arxiv.org/html/2504.16361v1
  7. Financial Transformer Reinforcement Learning (FTRL) (2025): https://www.sciencedirect.com/science/article/abs/pii/S0925231225011233
  8. Deep reinforcement learning strategy behavior study (2024): https://arxiv.org/html/2407.09557v1
  9. Backtest overfitting comparison in the ML era: https://www.sciencedirect.com/science/article/abs/pii/S0950705124011110