Bitcoin daily price movement prediction using Twitter Sentiment Analysis

Faridoon Farahi
6 min readApr 19, 2021

Bitcoin is a decentralized digital currency that was first introduced in 2009 by an unknown person who goes by the alias of Satoshi Nakamoto. To predict the price of Bitcoin, online tools and platforms are being used. One such tool is Twitter, a social media tool that allows people to post “tweets”. By using Twitter people can express their opinions and sentiments about various objects, including Bitcoin. Because of this reason, I chose Twitter as a sentiment analysis platform. The project uses the tweets of the 70 most influential people in the world of cryptocurrencies for analysis to predict the price of Bitcoin based on their sentiment. The lexicon-based approach is being used for sentiment analysis which uses a dictionary of words, where each word has a semantic score. To calculate the sentiment scores, VaderSentiment, a python library is being used. VaderSentiment has its own dictionary of words, however since it assigned these words sentiment scores based on no specific context, thus the dictionary is not of great use for this project. Instead, a unique dictionary of words with semantic scores dedicated to Bitcoin is being developed and used. It then passed this dictionary to VaderSentiment to calculate the sentiment scores of each tweet. VaderSentiment assigns each word a score between -4 to 4. The sentiment score of tweets is a sum of all the scores of all the words in a sentence and then it is being normalized to the range of -1 by using the formula where α=15. K-Nearest Neighbor, Logistic Regression Multi-Layer Perceptron Classifier, and Support Vector Machine Classifier are the algorithms chosen for this project based on their ability to deal with over-fitting and performing well on data without noise, and of these three, MLP performed the best with accuracy result of 96%.

After gaining popularity in 2017, when the price of Bitcoin crossed the $1,000 mark, the price of Bitcoin continued to soar. Bitcoin became the first cryptocurrency to break the $1,000 mark in course of a year, Bitcoin attracted the eyes and interests of investors, consumers, financial institutes, and businesses alike. Being a decentralized cryptocurrency allows all the transactions to take place without the need of an intermediary, such as banks. The elimination of the intermediary or the middle-man coupled with the fact that there are no transaction fees associated with Bitcoin was the reason for the surge in popularity of Bitcoin. People became even more interested in Bitcoin since the government had no control over Bitcoin and could not impose any transaction fees. Additionally, the anonymity of the transactions made people more inclined to buy Bitcoin. The naivety and sentiment of the people can cloud their judgments, allowing them to make decisions to buy and sell Bitcoin with no understanding of the situation. The broker gets a tip that Bitcoin could really excel. This causes panic and people start selling Bitcoin. Similarly, goodbye is misunderstood as an indicator to buy Bitcoin, and people start buying Bitcoin without giving it a second thought, thus showing that people are often naïve and can let the sentiment get the better of them by guiding them to buy or sell Bitcoin.

Using social media for sentiment analysis has its own problems and challenges. One challenge is the ambiguity of language. The ambiguity can arise because of several factors, for example, it can be quite challenging and sometimes difficult to figure out the object that is being talked about. For example, a tweet has the text “It will rise”. While the tweet mentions something will rise, there is an ambiguity with the subject of the text. The context is unclear as the text could refer to an increase in temperature or could simply refer to a balloon rising in the air after it is let go. It could talk about the price of a stock (which stock?) or could talk about a revolution. Similarly, it could be talking about the price of Bitcoin or could be talking about inflation in a country. Thus, the ambiguity of language is a challenge and could be difficult to deal with.

Language being dynamic is another challenge synonymous with sentiment analysis. Anything that can be characterized by constant change is said to dynamic. The same is the case with the English language. New words are added as old words are deleted while others become obsolete. In the case of Bitcoin, this dynamism translates to the fact that people will use new words to refer to Bitcoin. For example, Bitcoin is abbreviated as BTC. Someone might decide to refer to Bitcoin as BC, or bcoin. This could lead using of BC or bcoin becoming the new trend, rendering the words Bitcoin, bitcoin, BTC and btc obsolete. A prime example is how XBT has become the new abbreviation for Bitcoin. Such new trends can be difficult to predict and can further add to the problem of ambiguity.

Another challenge that presents itself in sentiment analysis is the detection of sarcasm. For example, a tweet has the text “The drive here was pretty safe as Jim only drove at 60 km/h”. There is no sarcasm in this tweet since driving at 60 km/h is safe as it is usually under the speed limit, the standard for which is 70 km/h. Another tweet has the text “The drive here was safe as Jim only drove at 120 km/h”. This tweet is a sarcastic one since 120 km/h is way above the standard speed limit. The difference may be obvious to humans who can pick up sarcasm, it is not the same for machines. There are methods to detect sarcasm, these methods are not always successful in dealing with sarcasm.

Twitter has a character limit of 280 characters per tweet. This ensures people send tweets that are not too lengthy. However, only 1% of the users reach this limit. People keep their tweets short and concise. In recent years, emojis (otherwise known as emoticons) have been increasingly used to express emotions, moods, and feelings so much so that people now prefer using emojis rather than words to express themselves. This “lazy” style of expressing oneself makes sentiment analysis much more difficult as it further adds to the problem of ambiguity.

The approach of implementing all algorithms not a practical one as some algorithms may perform exceptionally well in terms of time and memory while others may take too long or may require too much memory, depending on the type of data being used. Understanding what classification algorithm is suitable for what type of dataset may save a lot of time and memory. Therefore, I used the following algorithms:

K Nearest Neighbors

When data is free of noise, K Nearest Neighbors (KNN) can perform exceptionally well. Noise-free data refers to the fact that there aren’t any values in the class label that are not related to the data.

KNN performs great on a small dataset that is not complex. This is because of KNN Classifier being a “lazy learner” it doesn’t learn from a discriminative function (a function of a set of variables that is evaluated for samples of events or objects and used as an aid in discriminating between or classifying them) as it doesn’t have any explicit training phase. If a dataset is large or complex, then KNN Classifier won’t perform well.

Support Vector Machine

Support Vector Machine (SVM) Classifier, also known as Support Vector Classifier (SVC) performs best either when data has high dimensionality, i.e. has a high number of columns and/or samples, or is sparse. Furthermore, SVC can yield excellent results irrespective of the linearity of the data. In other words, it doesn’t matter if the data is linear or non-linear, SVC can perform exceptionally well.

Multi-Layer Perceptron

MLP is a good choice for classification problems and can perform well if the data is in tabular form. This tabular form of data can be found in CSV files or spreadsheets. While MLP performs excellent on image data, it can also yield excellent performance on text data and time-series data.

--

--