Calculating correlation between two stocks with Python!

in #trading6 years ago (edited)

Intrfo.png

Today we‘ll look at how we can utilize python to give us the correlation between two stocks. Our goal is to write a command line programm that takes in two .csv files with the closing prices. Furthermore we want to be able to specify the points in time between which we‘ll look for the correlation.

Here is the link to the script we are talking about:

https://github.com/toslaty/steem/blob/master/corr.py

You might ask youself where you are able to get the data. For now we‘ll just use the pricing data we can download on yahoo.com. Here are the links to the files we‘ll use in this tutorial.

https://finance.yahoo.com/quote/AAPL/history?p=AAPL
https://finance.yahoo.com/quote/GOOGL/history?p=GOOGL

There are definitely better ways to acquire the data. If you want to know more about it you can look for the Pandas Webreader on the web. For now we‘ll run with the csv-files from Yahoo.

First we import all the libraries we‘ll use in this tutorial. Those are the following:

impos.png

First we import Pandas, which is a library for analyzing and working with data. The next two lines might look a bit confusing. It looks like we are importing dateime twice but that isn‘t true. The first import is the import of the module datetime. The second ones refers to the class, also called datetime. Later we‘ll look at why we imported it this way.
After that we‘ll import argparse which you might know from this tutorial . Argparse is an easy to use module that helps us create a command line interface. We'll also import math to help us with the math later on.

Now to the main function:

main.png

We define four arguments with the add_argument() method so that we can call our script with the following parameters.

-f is the first company

-s is the second company

-st starting time in the format YY-MM-DD

-et end date in the format YY-MM-DD

clicmd.png

After we parse the arguments with the parse_args() method. Now we define two variables called start and end. Here we use datetime with its strptime() method to format the Time given. strptime() takes in two arguments. The first one is the string and the second one is the format specifier.

The next two variables called one and two each call our prep_data() function that we defined in line 9.

prep.png

The function takes in three arguments(The stocks as .csv, the start date, the end date ). The first variabe we define is one called name where we simply strip the ‘.csv‘ part of the string.

Now we define our dataframe with the fr variable. It calls pandas read_csv() function to import our data from the .csv file.
Then we use drop() to drop the columns we don‘t need because we just want the „Adj Close“ column. In the next line we rename the Adj Close column to the stock name.
We now define the rng variable that will be returned by the function. It uses the loc indexer which we pass the start and end date to, so that we only get the timeframe we specified in the command line.

Back in our main() function we noe call the concat function that will concatenate the two datasets. Furthermore the axis on which to concatenate will be specified.

If you now print() the ind variable you should get the following.

table.png

In the last line of the main() function we simply print the sentence „The correlation between the stocks is :“ and then calls the corr_stocks() function that is defined in line 21.

calc.png

Whats happens here? In general we broke down the following formula for calculating the correlation, into some smaller steps.

Correlation Function:

Corr = (n * Sum(X,Y) – (Sum(X) * Sum(Y)) / SquareRoot((n * Sum(X^2) – Sum(X)^2) * (n * Sum(Y^2) - Sum(Y)^2))

Where:

n – is the number of days in our case
Sum- The Sum of whats in the parenthesis

In our function corr_stocks() we brake that down into several smaller steps. First we calculate five different sums.
The first one is the sum of all the values in column 0(Stock A). The second one does the same for the values of column 1(Stock B).
The third one is Sum(X,Y) from our function above. It multiplies the values in each row and then gives us the sum of these.
The fourth and fifth are the sum of each value squared.

After that we simply define t by counting the length of the index. That gives us our n in the above function. We then calculate the values for each side of the divisor and then define and return the correlation between the two stocks.

result.png

So that‘s it for today! You can leave questions in the comments if you want to.

Sort:  

Wow. This is something great that you have made. I will check it out.

Posted using Partiko Android

Hello @toalsty! This is a friendly reminder that you have 3000 Partiko Points unclaimed in your Partiko account!

Partiko is a fast and beautiful mobile app for Steem, and it’s the most popular Steem mobile app out there! Download Partiko using the link below and login using SteemConnect to claim your 3000 Partiko points! You can easily convert them into Steem token!

https://partiko.app/referral/partiko

Congratulations @toalsty! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 1 year!

Click here to view your Board

Do not miss the last post from @steemitboard:

Carnival Challenge - Collect badge and win 5 STEEM
Vote for @Steemitboard as a witness and get one more award and increased upvotes!