How is Stock Market Data Used to Build AI Trading Models?

—TechRound does not recommend or endorse any financial, investment, gambling, trading or other advice, practices, companies or operators. All articles are purely informational—

AI has revolutionised the field of automated stock trading, whereas previously all models needed to be hypothesised by humans, then backtested, and finally implemented; AI models can now scan entire markets for anomalies and subtle trading patterns. 

 

Data Selection and Acquisition

 

The core raw material for any AI model will be high-quality data, This applies to an even greater extent for any AI stock market training since humans cannot infer or create any of the data (unlike expert content). Thus, you may wish to use raw stock price data from vendors such as FirstRate Data or QuantQuote as well as others available

The volume and density of the data is extremely important, as end-of-day data that is freely available has a limited number of data points (usually just four for each day – open, high, low, close). As AI models can cope with large volumes of data – a much more dense data type, such as intraday bar data, with an open, high, low, close (OHLC) for each trading minute, or tick data (which is the individual trade-by-trade stream of data) will yield great results.

It should be noted that the volume of data should be appropriate for the AI model’s context window (ie the amount of data the model can ingest). A token is approximately equal to a single stock price, so a 2.5 million context window will be able to ingest approximately 2.5 million prices, which equates to about 650 days of OHLC data or about three years of trading data for a given stock. 

If the model is to search of market pricing anomalies and micro-structures, the raw price data will need to be augmented with macro meta-data such as major economic data releases and relevant corporate releases (such as earnings releases). This will guide the model to avoid analysing major market moves that are driven by macro factors and not the internal market factors. 

 

Data Preprocessing and Cleaning

 

Commercial stock market data has usually been cleaned, but it would still be advisable to run a scan for major spikes or data gaps. In addition, if the stock has a high dividend yield, the stock price data should be adjusted to account for the artificial price moves when the stock goes ex-dividend. 

 

Model Selection and Training

 

The first step is to select the AI model to use for analysing and developing the trading algorithm.

In general, Classification and regression models such as Support Vector Machines (SVM) are best suited to generating a model to output buy/sell signals on a single stock. Alternatively, if the goal is to model the returns of a portfolio of multiple stocks, then LSTM (Long Short-Term Memory) networks are best suited to analysing related sequential time-series data.

For training the data, the model should only be given approximately 50% of the data and the remaining 50% should be used to independently back-test the model’s trading strategy. 

 

Deployment and Real-Time Integration


Although many brokerage platforms such as Interactive Brokers or Alpaca, offer automated trade execution via API, it would be advisable to adopt a two-step approach of first implementing a semi-automated system whereby the algorithm generates trade signals that require manual execution before moving to direct trade execution via API. 

The infrastructure for deploying the model also requires careful consideration, as speed of execution can be critical to some high-frequency trading strategies. As such, most generic cloud platforms such as AWS or Azure will not be appropriate since these are built for regular cloud applications and usually have high and variable latency.

Specialist low-latency platforms with a dedicated server will be most appropriate, although initially a large server instance will not be required as the actual server load is usually very low of auto-trading applications. 

In addition to the trading application. It is usually necessary to build simple risk and position tooling, as the free brokerage tools are insufficient to manage and monitor the risk of the large number of positions. For the initial deployment, it is usually sufficient to implement a freely available risk management library such as PyRisk or Pyfolio. 

—TechRound does not recommend or endorse any financial, investment, gambling, trading or other advice, practices, companies or operators. All articles are purely informational—