Yes, this is what I am doing at 1am when market shifts remind that there is a hidden beauty within pseudo chaotic signals that rule our life. Then I go back to the work, which is trying to systematize stock market into a single network and through the static and dynamic properties of the network to understand it better. What can we find there? Lots of cool stuff. Hidden connections, clusters, internal and cross sector connections. What can we possibly derive out of it? Investment considerations, robustness properties of the market, pathologies and more…
I am going to post findings in a small chunks – the way I am actually working on this 🙂
So how shall we start? Certainly from the raw data. We need a trading records of financial instruments for some period of time. I’ve got 2 sets of data, based on ranges –
- 1 week of 3.5k NASDAQ company stocks and exchange-traded fund (ETF) indexes with 5 minutes granularity
- 10 years of daily granularity data for NASDAQ stocks
Next step is to choose the set (or subset, based on required scope), gather the data and probably to cleanup/arrange formats. Once the data is ready, we need to run cross-correlation. This would giver us a matrix of NxN with correlation coefficients (R square) between each stock. From this point and on we will call each Company Stock of ETF a “Node” and connection between two companies an “Edge“. This is because, as I said, we are going to build a network and those are the terms of basic components.
Application of Rsq threshold is going to reduce significantly the amount of stocks that are correlated. How much? Well, exponentially. This is important, since it is reducing the load on our system and makes the analysis faster. In addition it gives us the required focus of investigation:

Amount of Edges (stock connections – axis Y) as function of applied threshold on Rsq (axis X). Exponential drop, so here it is presented in logarithmic scale.
I prefer to work with highly correlated signals >0.9 Rsq on 5 minutes granularity data and a bit lower for annual scale signals. This is enough data to dig into for a single person during the night. For example 0.9 Rsq cleanup is giving on my data set 364 nodes (companies and funds) and 5778 edges (connections between them).
To start work easily, need some visualization SW. I like Gephi. We import the Edge table, when Rsq values are defined as “Weights” inside So how it looks like?

Cross Correlation Network of 364 NASDAQ company stocks and funds at 16 Apr 2014 with correlation higher than 0.9 based on 5 minutes granularity sampling. Colors are based on Market Sector, size of node based on capital value.
Beautiful, is not it?
Rsq>0.95 gives much more focused picture: