Deep Learning and Reinforcement Learning in Algorithmic Trading (2018–2025)
A practical survey of deep learning (LSTM/CNN/Transformers) and deep reinforcement learning (DQN/PPO/A2C) for trading across equities, futures, and crypto—what works, what breaks, and how to deploy with realistic risk controls.
Why this matters
From 2018 to 2025, deep learning and deep reinforcement learning (DRL) moved from “interesting research” to a real production toolkit in systematic trading—especially for:
- Feature extraction from high-dimensional inputs (order books, cross-asset signals, alt data)
- Policy learning (position sizing / allocation) rather than pure price forecasting
- Regime-aware behavior (trade less in chop, scale down risk in stress, rebalance dynamically)
The hard part isn’t finding papers with strong backtests. The hard part is building systems that survive non-stationarity, transaction costs, market impact, and evaluation leakage.
This report surveys key model families and what the literature and practitioner experience suggest actually translates into robustness.
Deep learning for prediction: what each family is good at
LSTM / GRU (sequence models)
Typical use: predict returns, direction, volatility; then map forecasts into trades via rules or portfolio optimization.
Strengths
- Captures temporal dependencies and non-linear dynamics in price/indicator sequences
- Often strong on medium-horizon signals (hours → days), especially with sensible regularization
Common failure modes
- Overfits to one regime (e.g., post-2020 liquidity) and degrades sharply out-of-sample
- Forecast accuracy doesn’t necessarily convert to trade PnL after costs
CNN (local motif extraction)
Typical use: extract short-horizon patterns from time series windows, technical-indicator “images”, or limit order books.
Strengths
- Great at learning local shapes/patterns (microstructure motifs, short bursts)
- Works well as a feature extractor feeding an LSTM/Transformer/RL policy
Common failure modes
- Needs a lot of data; fragile to small distribution shifts
- Can “learn the backtest” unless the evaluation protocol is strict
Transformers / attention models
Typical use: multi-asset + long context modeling; mixing modalities (prices + macro + text/sentiment).
Strengths
- Handles long-range dependencies and cross-asset interactions better than classic RNNs
- Attention can improve interpretability (“what did the model look at?”)
Common failure modes
- Data hunger and over-parameterization; can look great in-sample and disappoint live
- Requires careful training design (walk-forward splits, early stopping, robust regularization)
Deep reinforcement learning: when policy learning beats prediction
DRL frames trading as sequential decision-making: learn a policy that maximizes a reward (PnL, Sharpe, drawdown-penalized return).
Value-based: DQN and variants
Where it fits: discrete actions (long/short/flat), simpler single-asset strategies, or constrained crypto bots.
What works well in practice
- Reward shaping that penalizes volatility/drawdown and discourages overtrading
- State representation that includes volatility, trend, and risk measures—not just price deltas
Typical pitfalls
- Unstable learning in noisy markets
- Unrealistic fills and costs make “paper alpha” vanish
Policy gradient / actor-critic: PPO, A2C/A3C, DDPG/TD3, SAC
Where it fits: continuous actions (position sizing, portfolio weights), multi-asset allocation.
Why people like PPO in finance
- Generally stable and easy to tune relative to many alternatives
- Works well as a baseline in portfolio environments (especially with constraints)
Typical pitfalls
- Overtrading if the reward function doesn’t explicitly price turnover
- Agents learn “cheats” in the simulator unless the environment is very realistic
Market-specific observations
Equities
- Plenty of data, but news-driven jumps and structural breaks dominate many periods.
- Deep learning can help with ranking/selection and risk overlays, but “one model to rule them all” is rare.
- DRL portfolio rebalancing is promising when constraints (turnover, leverage, sector caps) are built in.
Futures / multi-asset
- Strong use case for DRL: learning volatility-scaled exposure and “trade / don’t trade” behavior across diverse contracts.
- Evaluation must handle roll/continuous series correctly and include realistic costs.
Crypto
- DRL can look spectacular in certain regimes; robustness across boom/bust cycles is the problem.
- Sentiment features (social) can help, but they can also introduce leakage if not handled carefully.
A pragmatic benchmark table (representative results)
| Work / system | Approach | Market | What it reported (high level) |
|---|---|---|---|
| Zhang, Zohren & Roberts (2019) | Deep RL with risk/volatility scaling | 50 liquid futures (multi-asset) | Outperformed classical time-series momentum after costs; learned to stay out in consolidation |
| Théate & Ernst (2021) | DQN variant optimized for risk-adjusted performance | Stocks (multi-market) | Improved risk-adjusted returns under stricter evaluation |
| FinRL benchmark/contest (reported 2025) | PPO and other DRL baselines | US equities (Dow 30) | PPO often a strong baseline; ensembles can reduce drawdown |
| Sattarov & Choi (2024) | Multi-level DQN + sentiment + risk-aware reward | Bitcoin | Higher Sharpe vs prior baselines in their setting; highlights reward design importance |
| “FTRL” (Financial Transformer + RL) (2025) | Transformer state encoder + RL policy | Portfolio setting | Improved returns vs baselines in the paper’s testbed; illustrates attention for state representation |
Note: reported numbers across papers aren’t directly comparable; environments, costs, and evaluation rigor vary widely.
What “best practice” looks like (if you want something that survives live)
- Walk-forward evaluation (multiple splits) and a strict “never peek” feature pipeline.
- Cost model first, not last: commissions + spreads + slippage; include impact proxies where relevant.
- Turnover constraints (explicit penalty) and realistic order execution assumptions.
- Regime robustness checks: stress periods, vol spikes, sideways markets; test sensitivity to small perturbations.
- Risk controls outside the model: max leverage, vol targeting, drawdown brakes, kill-switch rules.
- Monitoring + retraining policy: define drift metrics and a schedule; avoid “retrain whenever it hurts”.
Bottom line
- Deep learning is best treated as a feature engine and forecasting component that needs strong risk overlays.
- DRL is most compelling when you need policy learning (position sizing / allocation) under constraints.
- The dominant edge isn’t a specific architecture; it’s rigorous evaluation, realistic execution/cost modeling, and operational discipline.
References
- Review of reinforcement learning in trading (notes early-stage + realism gap): https://arxiv.org/abs/2106.00123
- Deep RL for continuous futures trading (multi-asset): https://ideas.repec.org/p/arx/papers/1911.10107.html
- An application of deep reinforcement learning to algorithmic trading (DQN): https://arxiv.org/abs/2004.06627
- FinRL benchmark/contest report (portfolio DRL baselines): https://arxiv.org/pdf/2504.02281
- Multi-level deep Q-networks for Bitcoin trading (Scientific Reports, 2024): https://www.nature.com/articles/s41598-024-51408-w
- Comparing transformer structures for stock prediction (2025): https://arxiv.org/html/2504.16361v1
- Financial Transformer Reinforcement Learning (FTRL) (2025): https://www.sciencedirect.com/science/article/abs/pii/S0925231225011233
- Deep reinforcement learning strategy behavior study (2024): https://arxiv.org/html/2407.09557v1
- Backtest overfitting comparison in the ML era: https://www.sciencedirect.com/science/article/abs/pii/S0950705124011110