In this first Lambda School portfolio project we are practicing the following skills:
- Dataset Selection
- Statistical Hypothesis Formulation
- Draw Conclusions & Effectively Communicate Results
Description of Data
I selected the following two datasets:
The Bitcoin Historical Price Data is from kaggle.com and contains 8 columns covering a period from 2012–01–01 to 2020–12–31:
- Timestamp: Unix Time Stamp
- Open, High, Low, Close: Price characteristic within 60s time window
- Volume: Volume of Currency Transacted within 60s time window
- Weighted_Price: Volume Weighted Average Price
To create the Google Trends search interest file, the past 5 years was selected in the time window options and a .csv file was downloaded. It contains 2 columns:
- Week - ‘mm/dd/yyyy’ format
- bitcoin: (worldwide) —Normalized Search interest in ‘bitcoin’ worldwide, 0 to 100 — [Weekly Searches] / [Total Searches Last 5 Years]
- Google Trend data is measured every week
- Bitcoin price data is measured every minute
My Hypothesis: Bitcoin Price and Bitcoin Search Interest are correlated.
Null hypothesis: There is not a statically significant relationship between Bitcoin Search Interest and Bitcoin Price.
Alternative hypothesis: There is a statistically significant relationship between Bitcoin Search Interest and Bitcoin Price.
A one-sample t-test and an ordinary least squares regression was performed to test the hypothesis.
The following was conducted to prep the raw data:
- Converted all time series time indexes to Panda’s datetime class
- Dropped rows w/ NaNs in BTC Price data
- Aggregated BTC minute price data records in a week and calculated mean BTC price across week using pd.resample
- Merged Search Interest and Price Data Frames utilizing Search Interest file’s dates as merge-key
We can see Bitcoin Price and Search Interest are correlated somewhat for lower prices, while they are less correlated as search interest picks up.
The statsmodel.formula.api library was utilized to perform the t-test.
Output for alpha = .05 and 95% Confidence Interval :
- R² = 36.3%, p = 0.000
- m = [178.7 – 249.0], Intercept = [2855.7– 4162.6]
At the alpha = .05 level, We reject the null hypothesis and conclude there is a statistically significant relationship between BTC price and Google Search Interest.
While this correlation is high for a single sentiment measure, the correlation is weak if our goal was to develop a trading strategy utilizing the model to determine when BTC is overpriced or underpriced. We would most certainly lose money.
Volatility and Measurement Frequency
Historically BTC price has been very volatile. The below figure shows five 50 week periods and the % change relative to starting price in weekly mean price. Over 50 week periods, -70% and +1,600 % change in prices were observed:
In this example, the price of BTC is measured at a much greater frequency than Bitcoin Search Interest. This was resolved by our resample operation that took the mean of 10,880 BTC price measurements over a day.
It follows, if we were to compare our model’s prediction of price to any real-time measurement of price, we would expect the prediction to have an additional error term equal to the delta between the unknown real-time value and last measurement of the Google Search Interest.
Upon observation of our initial R² and thinking through implications of our resampling, I considered whether another variable in the Bitcoin dataset would have a stronger correlation with Search. Notable points of conjecture:
- BTC liquidity and market cap has grown over 5 yr window considered as first set of institutional investors have adopted amidst backdrop of constant retail growth
- Asset Pricing reflects what a set of market participants were willing to trade their asset without measuring quantity traded
These lead me to suspect Volume (Price * Quantity) traded would have a stronger correlation with BTC searches and would lose less information via a resample operation using sum. Let’s see:
Output for alpha = .05, and 95% Confidence Interval :
- R² = 54.8%, p = 0.000
- m = [.021 - .026], Intercept = [.074 – .173]
Less outliers than Price vs. Search Interest. We reject the null again. Stronger correlation.
Our results suggest that a 1% increase in BTC Search Interest corresponds to a $250 increase in BTC price or $24 M in volume traded over the time period considered.
A 55% correlation between BTC Volume Traded and Search Interest is quite high. This may suggest BTC price movement — relative to comparable assets — is largely driven by retail investors, and this could be explored by performing the same exercise for precious metals or stocks.
Potential follow-on items to improve BTC price and sentiment correlation, my own investment thesis with BTC is it is speculative asset experiencing boom-bust cycles riding on top of a very interesting growing value-prop/infrastructure:
- BTC experienced several boom/bust cycles that likely reset retail pool of market participants, i.e. people completely exit following a bust — a rate of change in google trend over time or utilizing rolling normalization factor might improve search interest relationship to BTC price change
- Would a sentiment indicator be a leading/lagging indicator? Could a model’s correlation be improved by time phasing a regression?
- If a higher frequency google search metric had been used in the model would it have causes a significant increase in R²? Or would we find smoothing price over time works better?
Finally, I am left with the following questions for further exploration as I continue through my Data Science journey:
- Such smoothing/compression operations like the resample operation used here for Price lose information. It would seem important to measure Signal Quality/Noise ratio and quantify in input/output variables. When feasible, should input variables be measured at a rate that is greater than or equal to output variables subject to constraints on signal quality?
- In this exercise, one could rationally argue that the dependent/independent variables should be swapped — using price to predict how often BTC is googled. The two measurements are essentially independent measurements of the same thing: How much do people care about bitcoin? Are there methods or processes to evaluate, measure, and/or isolate co-movement in response to some other known or unknown driving factor?