Sunteți pe pagina 1din 6

BitCoin price-sentiment analysis

DATA MINING PROJECT REPORT

Professor: Hadikadi Amer

Students: Asmir Avdicevic


Anis Borcak
Adna Kolakovic
Mirza Masic

June, 2014
Sarajevo

BitCoin price-sentiment analysis

AA, AB, AK, MM

Contents
1.

Project Definition.................................................................................................. 3

2.

Data location and collection.................................................................................3

3.

Data preparation, pre-processing, integration and exploration............................3

BitCoin price-sentiment analysis

AA, AB, AK, MM

1. Project Definition
Bitcoin is peer to peer version of electronic cash which allows users online
payments to be sent directly from one party to another without going
through financial institution. During last quarter of 2013 bitcoin started to
grow very rapidly and reached record price on November 29 of $1,242 per
coin. For comparison, during the same day spot gold prices hit a price of
$1,240 per ounce. Currently there are more than 12 million bitcoins in
circulation and the rate of new bitcoins will be halved every four years until
there is a maximum of 21 million coins. After record price of bitcoin in
November, price plunged to around $600 and then started to stagnate
around that price point with sight ups and downs. Today, price of bitcoin is
$617 and scored slight growth in May 2014. Because of stated facts where
price of virtual currency passes price of gold in one point of day, we will try
to analyze is there a correlation between twitter post called tweets and price
of bitcoin. If there is a correlation, that can be a good standing point for
predicting future plunges or jumps in terms of bitcoin price.

2. Data location and collection


The data source we choose to use is twitter. Twitter is social platform which
allows users to post small amount of text called tweets. For data mining
purposes we can use 1% of all tweets and can choose tweets with certain
keywords. Keyword we used is #bitcoin. Because this is basically real time
data in raw format and with each tweet we collect huge amount of junk

BitCoin price-sentiment analysis

AA, AB, AK, MM

which must discard to get data we need. After that we must adjust that raw
data for inserting it to the tables which can later on be used for analysis.

BitCoin price-sentiment analysis

AA, AB, AK, MM

3. Data preparation, pre-processing, integration and


exploration
Before preprocessing and discarding of unnecessary data we had 13,7GB file.
For easing the process of data mining we had to preprocess the data. During
preprocessing part, we discarded all irrelevant attributes like profile pictures,
background etc. The file we got after these two methods was 1.76GB (7.2
million records) and we concluded that was enough if we take in
consideration that one tweet with all relevant attributes take approximately
256bytes. Next thing is to clear all non-English language tweets by language
filtering and remove spam by taking most frequent words and with nave
Bayesian decided which the spam are. For making things faster we included
missing data handling within spam filter. Because we need time stamp for
our mining process and time is very hard to fill in instead of missing values,
we discarded all tweets without timestamp.

For spam reduction we

discarded all data records with word count lower than 3 and records whose
tweet contain non ASCII characters because those are ones which we cannot
analyze with confidence. After these filtering methods, we have got around
1.1 million records which was 252MB. With data we acquired after filtering
we begin with sentiment analysis. Sentiment analysis is done with list of
words with valance, arousal and dominance. After successful sentiment
analysis we should get three dimensional map of tweet moods but sentiment
analysis will remove records which cannot be analyzed. After sentiment
analysis we were left with 80MB of data or 335 000 individual records and
each of them have new derived attributes related to sentiment analysis and
those are: mood, mean valence, mean arousal, mean dominance and
5

BitCoin price-sentiment analysis

AA, AB, AK, MM

intensity of the mood. Attribute mood has 20 different values and each of
those can have different arousal, valence, dominance and intensity. After
preparing our data acquired from twitter, we must take historic bitcoin price
data with time information from one of the largest bitcoin exchange
websites. Next thing is to match each tweet with corresponding bitcoin price
by using relevant timestamp. Next thing we need to do is to adjust our data
set for WEKA. Because WEKA requires csv format we need to convert our
data set to that format.

S-ar putea să vă placă și