Reinforcement learning in trading (Part 1)

Andrey Babynin
3 min readAug 4, 2019


This is a brief introduction to making a simple bot for trading using reinforcement learning.

Looking at the success of Deepmind robots at various games, it is a trivial idea to build a trading bot. In the end, trading is yet another zero-sum like game.

In the beginning, I thought that it would not take long until I manage it to work even with a modest profit. Well, it was my first mistake. So, the rest of the posts (I plan to publish) will be about lowly learning how to learn RL concepts and trying to make them work. I should say this: I haven’t built anything close to profitable bot, but have learned a lot.

The simple scheme of the bot contains 6 necessary components:

  1. Environment: the class containing instructions about possible actions at each step (e.g. You can’t close if you have no open positions), information about commissions and price slippage, your portfolio value, etc.
  2. Metrics: these are the inputs for your agent. It can be simple prices, or vector of more sophisticated values: RSI, EMA crossover, etc.
  3. NN: Neural Net is the engine of the RL agent.
  4. Agent: agent contains instructions about calculating q-values, exploration probabilities, strategies to find the best action (Double, Dueling q-learning, etc.).
  5. Reward scheme: there are plenty of possibilities to implement: intolerance to being idle, Sharpe ratio of possible future rewards, etc.
  6. Learning loop

And one optional, which I call settings, which documents all the initial assumptions.

Let’s start with the settings:

It contains main assumptions about the environment, landscape: COMMISSION, SLIPPAGE, type of agent: DUELING, DOUBLE, the structure of neural nets: LAYERS and source of initial file: FILE.

The next step is creating an environment:

The environment contains several attributes: slippage, commission size, initial endowment (it will be important in the later stages of testing), and history dictionary, which documents all the actions and corresponding changes in the portfolio.

Ideally, the dictionary should be somewhere saved apart from operating memory, for instance, in the SQL database.

The next step is creating some utility functions to control the environment:

position_value calculates value depending on the type long/short deal.

price_with_com calculates the actual price taking into account slippage and commission.

action_space defines the next step available actions. If current action is open/long/short or hold, then its possible actions are hold and close, otherwise, if the action is close or cash (no actions), then it can open positions (0,1) or sit in cash (4).

The next two functions add a layer of statistics: calculating cash rewards from deals and holding period for each deal. The reward function calculates a cash reward from the closed position. The cash reward is not equal to the reward which RL-agent receives.

The main part about the environment is the step function which executes the actual recalculations of the position at each time step.

This is the end of the environment part. In the next post, I will explain how I created Metrics.



Andrey Babynin