Yes, this is what I am doing at 1am when market shifts remind that there is a hidden beauty within pseudo chaotic signals that rule our life. Then I go back to the work, which is trying to systematize stock market into a single network and through the static and dynamic properties of the network to understand it better. What can we find there? Lots of cool stuff. Hidden connections, clusters, internal and cross sector connections. What can we possibly derive out of it? Investment considerations, robustness properties of the market, pathologies and more…

I am going to post findings in a small chunks – the way I am actually working on this 🙂

So how shall we start? Certainly from the raw data. We need a trading records of financial instruments for some period of time. I’ve got 2 sets of data, based on ranges –

- 1 week of 3.5k NASDAQ company stocks and exchange-traded fund (ETF) indexes with 5 minutes granularity
- 10 years of daily granularity data for NASDAQ stocks

Next step is to choose the set (or subset, based on required scope), gather the data and probably to cleanup/arrange formats. Once the data is ready, we need to run cross-correlation. This would giver us a matrix of NxN with correlation coefficients (R square) between each stock. From this point and on we will call each Company Stock of ETF a “**Node**” and connection between two companies an “**Edge**“. This is because, as I said, we are going to build a network and those are the terms of basic components.

Application of Rsq threshold is going to reduce significantly the amount of stocks that are correlated. How much? Well, exponentially. This is important, since it is reducing the load on our system and makes the analysis faster. In addition it gives us the required focus of investigation:

I prefer to work with highly correlated signals >0.9 Rsq on 5 minutes granularity data and a bit lower for annual scale signals. This is enough data to dig into for a single person during the night. For example 0.9 Rsq cleanup is giving on my data set 364 nodes (companies and funds) and 5778 edges (connections between them).

To start work easily, need some visualization SW. I like Gephi. We import the Edge table, when Rsq values are defined as “Weights” inside So how it looks like?

Beautiful, is not it?

Rsq>0.95 gives much more focused picture:

Now we can inspect by zooming in, filtering out based on sectors. For example Health Care and Pharma:

At this stage we can ask ourselves various questions. For example, how does it look (and why) two stocks with high correlation?

Sometimes it perfectly makes sense, e.g. for FOXA (for the Class A shares) and FOX (for the Class B shares):

In other cases you might find that intra-day correlation is not representative and occasional (or caused by rare common event) and so there is a need to switch to annual scale.

Do we build a portfolio of Pharma companies, should we take them all? If now it does not really makes sense, then which part? Rep from each cluster? Well, we can run cluster algorithm, Run Centrality algorithm and probably choose based on those considerations.

**All those things and more in next parts… **