Reinforcement learning in Trading (part 2 - the last)

In the previous post, I started to write about my trading RL experiment. Today I decided to complete the description of the trading network (I coded it a year ago).

Metrics (or features)

Well, prices themselves are not very helpful for NN, returns too, because they are volatile (5 min) and exhibit stochastic behavior. Therefore, a first idea is to generate some features, like RSI, MACD, support and resistance levels. I used only resistance, support levels and returns of different periods for simplicity. For training, the metrics are calculated at the beginning of the series.

Thousands of different features can be generated, but most of them in short periods capture the noise only. So, for something truly useful we need to apply Kalman filter to some non-stationary process like Ornstein–Uhlenbeck process. This problem is out of the scope of the post.


I created different versions of NN: Simple DQN, Double DQN and Dueling Double DQN with simple and prioritized replays. Today, these architectures seem to be naive about serious problems.

Neural Network

I used ordinary feed-forward networks, but some kind of LSTM model would be preferable.


Rewards are the crucial part of the training, and goals of trading RL network can be fine-tuned to anything ranging from: stay longer in the position, minimize portfolio volatility or maximize Sharpe ratio. I used simple rewards: current profit in the position.


Results from training Double Dueling DQN with Simple Replay. After two iterations (2 epochs) over the series (assume zero comissions and slippage):

Below there is a synthetic portfolio graph in 2 epochs. Literally, it passes the same series two times. Initial endowment 1 mln.


Trading is a hard puzzle for RL due to its stochastic nature. Neural Networks are greedy, prone to overffitting. RL magnifies every drawback of an ordinary NN by a factor of 100. So, I do not believe that RL applies to trading, bacause, despite other games’ environments, financial markets are not normal ,they have volatility clustering, bubbles, crashes etc. The more sensible approach is to use it with imitation learning first, and then give it more freedom.

The full code can be found here.

Finance + Data + Python.