Deep Research

Deep Learning and Reinforcement Learning in Algorithmic Trading (2018–2025)

A practical survey of deep learning (LSTM/CNN/Transformers) and deep reinforcement learning (DQN/PPO/A2C) for trading across equities, futures, and crypto—what works, what breaks, and how to deploy with realistic risk controls.

Kaizhi Tang

16 Feb 2026 • 4 min read

Why this matters

From 2018 to 2025, deep learning and deep reinforcement learning (DRL) moved from “interesting research” to a real production toolkit in systematic trading—especially for:

Feature extraction from high-dimensional inputs (order books, cross-asset signals, alt data)
Policy learning (position sizing / allocation) rather than pure price forecasting
Regime-aware behavior (trade less in chop, scale down risk in stress, rebalance dynamically)

The hard part isn’t finding papers with strong backtests. The hard part is building systems that survive non-stationarity, transaction costs, market impact, and evaluation leakage.

This report surveys key model families and what the literature and practitioner experience suggest actually translates into robustness.

Deep learning for prediction: what each family is good at

LSTM / GRU (sequence models)

Typical use: predict returns, direction, volatility; then map forecasts into trades via rules or portfolio optimization.

Strengths

Captures temporal dependencies and non-linear dynamics in price/indicator sequences
Often strong on medium-horizon signals (hours → days), especially with sensible regularization

Common failure modes

Overfits to one regime (e.g., post-2020 liquidity) and degrades sharply out-of-sample
Forecast accuracy doesn’t necessarily convert to trade PnL after costs

CNN (local motif extraction)

Typical use: extract short-horizon patterns from time series windows, technical-indicator “images”, or limit order books.

Strengths

Great at learning local shapes/patterns (microstructure motifs, short bursts)
Works well as a feature extractor feeding an LSTM/Transformer/RL policy

Common failure modes

Needs a lot of data; fragile to small distribution shifts
Can “learn the backtest” unless the evaluation protocol is strict

Transformers / attention models

Typical use: multi-asset + long context modeling; mixing modalities (prices + macro + text/sentiment).

Strengths

Handles long-range dependencies and cross-asset interactions better than classic RNNs
Attention can improve interpretability (“what did the model look at?”)

Common failure modes

Data hunger and over-parameterization; can look great in-sample and disappoint live
Requires careful training design (walk-forward splits, early stopping, robust regularization)

Deep reinforcement learning: when policy learning beats prediction

DRL frames trading as sequential decision-making: learn a policy that maximizes a reward (PnL, Sharpe, drawdown-penalized return).

Value-based: DQN and variants

Where it fits: discrete actions (long/short/flat), simpler single-asset strategies, or constrained crypto bots.

What works well in practice

Reward shaping that penalizes volatility/drawdown and discourages overtrading
State representation that includes volatility, trend, and risk measures—not just price deltas

Typical pitfalls

Unstable learning in noisy markets
Unrealistic fills and costs make “paper alpha” vanish

Policy gradient / actor-critic: PPO, A2C/A3C, DDPG/TD3, SAC

Where it fits: continuous actions (position sizing, portfolio weights), multi-asset allocation.

Why people like PPO in finance

Generally stable and easy to tune relative to many alternatives
Works well as a baseline in portfolio environments (especially with constraints)

Typical pitfalls

Overtrading if the reward function doesn’t explicitly price turnover
Agents learn “cheats” in the simulator unless the environment is very realistic

Market-specific observations

Equities

Plenty of data, but news-driven jumps and structural breaks dominate many periods.
Deep learning can help with ranking/selection and risk overlays, but “one model to rule them all” is rare.
DRL portfolio rebalancing is promising when constraints (turnover, leverage, sector caps) are built in.

Futures / multi-asset

Strong use case for DRL: learning volatility-scaled exposure and “trade / don’t trade” behavior across diverse contracts.
Evaluation must handle roll/continuous series correctly and include realistic costs.

Crypto

DRL can look spectacular in certain regimes; robustness across boom/bust cycles is the problem.
Sentiment features (social) can help, but they can also introduce leakage if not handled carefully.

A pragmatic benchmark table (representative results)

Work / system	Approach	Market	What it reported (high level)
Zhang, Zohren & Roberts (2019)	Deep RL with risk/volatility scaling	50 liquid futures (multi-asset)	Outperformed classical time-series momentum after costs; learned to stay out in consolidation
Théate & Ernst (2021)	DQN variant optimized for risk-adjusted performance	Stocks (multi-market)	Improved risk-adjusted returns under stricter evaluation
FinRL benchmark/contest (reported 2025)	PPO and other DRL baselines	US equities (Dow 30)	PPO often a strong baseline; ensembles can reduce drawdown
Sattarov & Choi (2024)	Multi-level DQN + sentiment + risk-aware reward	Bitcoin	Higher Sharpe vs prior baselines in their setting; highlights reward design importance
“FTRL” (Financial Transformer + RL) (2025)	Transformer state encoder + RL policy	Portfolio setting	Improved returns vs baselines in the paper’s testbed; illustrates attention for state representation

Note: reported numbers across papers aren’t directly comparable; environments, costs, and evaluation rigor vary widely.

What “best practice” looks like (if you want something that survives live)

Walk-forward evaluation (multiple splits) and a strict “never peek” feature pipeline.
Cost model first, not last: commissions + spreads + slippage; include impact proxies where relevant.
Turnover constraints (explicit penalty) and realistic order execution assumptions.
Regime robustness checks: stress periods, vol spikes, sideways markets; test sensitivity to small perturbations.
Risk controls outside the model: max leverage, vol targeting, drawdown brakes, kill-switch rules.
Monitoring + retraining policy: define drift metrics and a schedule; avoid “retrain whenever it hurts”.

Bottom line

Deep learning is best treated as a feature engine and forecasting component that needs strong risk overlays.
DRL is most compelling when you need policy learning (position sizing / allocation) under constraints.
The dominant edge isn’t a specific architecture; it’s rigorous evaluation, realistic execution/cost modeling, and operational discipline.

References

Review of reinforcement learning in trading (notes early-stage + realism gap): https://arxiv.org/abs/2106.00123
Deep RL for continuous futures trading (multi-asset): https://ideas.repec.org/p/arx/papers/1911.10107.html
An application of deep reinforcement learning to algorithmic trading (DQN): https://arxiv.org/abs/2004.06627
FinRL benchmark/contest report (portfolio DRL baselines): https://arxiv.org/pdf/2504.02281
Multi-level deep Q-networks for Bitcoin trading (Scientific Reports, 2024): https://www.nature.com/articles/s41598-024-51408-w
Comparing transformer structures for stock prediction (2025): https://arxiv.org/html/2504.16361v1
Financial Transformer Reinforcement Learning (FTRL) (2025): https://www.sciencedirect.com/science/article/abs/pii/S0925231225011233
Deep reinforcement learning strategy behavior study (2024): https://arxiv.org/html/2407.09557v1
Backtest overfitting comparison in the ML era: https://www.sciencedirect.com/science/article/abs/pii/S0950705124011110