Sunteți pe pagina 1din 3

Big Data: Clustering Using Expectation-

Maximization Algorithm With Gaussians


Mixture Models for Analysis of Tren Products
from E-Commerce in Indonesia

Elvia Nur Anggraini1


School of Electrical Engineering, Burhanuddin Dirgantoro2 Surya Michrandi Nasution
Telkom University School of Electrical Engineering School of Electrical Engineering
Bandung, Indonesia Telkom University Telkom University
1
fiyahalfarizi@student.telkomuniversity Bandung, Indonesia Bandung, Indonesia
.ac.id 2
burhanuddin@telkomuniversity.ac.id 3
michrandi@telkomuniversity.ac.id

Abstract—E-commerce or electronic commerce is a collection of Each e-commerce web has its own features that can make e-
technologies, applications and businesses that connect commerce superior to others, so that buyers can choose
companies or individuals as consumers to conduct electronic according to the needs of their respective consumers. By
transactions, the sale and purchase of goods, as well as exchange analyzing some e-commerce in Indonesia, we can use the data
information through the internet. The Expectation- from the e-commerce to find out products that are trending
Maximization algorithm approach with Gaussians Mixture with scientific analysis so that it is more objective than
Models is used to cluster data and is grouped according to their conventional systems
respective categories. The results of the data obtained will be
analyzed so as to display products that are trending to the user II. SYSTEM DESIGN
in accordance with the desired search keywords in the form of
graphics. (Abstract) A. Flow process of Expectation Maximization Process

Keywords: e-commerce, trends, Expectation-Maximization


algorithm, Gaussian Mixture Models.

I. INTRODUCTION
In the field of trade, the existence of internet technology
enables business transactions not only to be done directly.
Internet media has been widely used as a medium for business
activity mainly because of its contribution to efficiency.
Efficiency is one of the advantages in transactions through
internet media because of time savings, both because there is
no need for sellers and buyers to meet face to face, and there
are no transportation constraints.
One of the benefits of an online store is that consumers can First step, we need to set the initial value for each datum in
shop without leaving home or compare the price of a product product data, and then do the expectation step.
from one store to another, but only by clicking a few buttons While the expectation step finished, do the maximization step
according to the wishes of consumers. Trading activities or to recalculate value from the result of expectation step.
transactions through internet media are known as electronic After one iteration process, we read the value from result of
commerce or e-commerce (Aribowo and Nugroho, 2013). maximization step and then compare it to convergen
As the reach of internet services, e-commerce or electronic condition state. If in this iteration have reach convergen state,
commerce is growing very rapidly in Indonesia. How to shop then the process is done. If it’s not yet, then do next iteration
now began to shift from initially interacting directly with and repeat this step until convergen.
sellers to be via e-commerce. In addition, some startups also B. Linear regression process
open online stores to facilitate sellers and buyers to make
transactions online without being exposed to fraud related to Linear regression process required for counting every b value
goods or payments. (slope) from past 3 and 6 months value. This b value next
would be used for measuring and sorting the product by slope
In conventional systems, to find out products that are trending value. The higher slope value, the steeper it graph become.
obtained from advertising, information from mouth to mouth This system sort the product from the biggest slope to the
or directly contacting merchants in the store so that the level least one.
of subjectivity is very high. This led to the emergence of
differences of opinion between consumers and other
consumers related to products that are considered trends.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


D. Equations
The equation used for Gaussian Distribution is:
Slope value from 6 months data (top 3)

No. Product name Slope


1. Lenovo Ideapad S5 N4205U 29.486
2. HP S CF47TU N4205 KB 16.914
3. S CF0062TU 8.228 Where μ is the mean and σ is the variance or standard
deviation, then:
Slope value from 3 months data (top 3)

No. Product name Slope


1. Lenovo Ideapad S5 N4205U 87
2. HP S CF47TU N4205 KB 52
3. Fujitsu A573 18
4. S CF0062TU DTS PVCY -24

III. SYSTEM EXPLANATION


This system outlines two segments, first the process of Every data will be computed for each cluster
collecting data by scrapping from the three biggest e-
commerce in Indonesia: Bukalapak, Tokopedia and Shopee, probability. So the equation of the value of becomes:
and then the second is processing data and cluster them using
the expectation maximization method.
A. Scrapping Process
With custom made code for automated scrapping process, the
element of single page in e-commerce website could be Legend :
gathered to get the information we need. In this case, we
T = Transpose
defined to get product name, price, and date of review from
users μ = Mean
B. Data Pre-Processing Process σ = Deviation Standard
Before entering data processing, the previous data will be x = Data that will counted from dataset
filtered first to remove some unidentified data elements so
that the results will be null. Some data will also contain errors
and the same name but contain other characters such as *, #,
etc. so that the data does not match the desired results.
Filtered data has better data quality so that it will be easier to
process and get the desired results.
Some data-processing processes are to eliminate characters
or other punctuation, and if the product name is a foreign
name then common words such as the, of, etc. will be
eliminated. In addition, some duplicate data will also be
removed so that the data will be neater before processing
data.
C. Clustering Process
● Stage (E), where at this stage the function created will
calculate one of the data and serve as a reference for initial
expectations (such as K on K-Means)..
● Stage (M), after finding the expected value, this stage will
calculate the parameters used as a reference to find data that
has the highest similarity (maximum) to the initial
expectations of stage (E).
● This flow is repeated continuously until the convergence
of the data set is found.
IV. ANALYSIS RESULT V. CONCLUSION
A. At the time of the first test carried out by calculating Based on the results of this thesis, it can be concluded that:
linear regression using data last 3 months with calculations
when using the last 6 months showed a different trend. With 1. The results of the moving average analysis on the product
6 months data shows there are differences in trends compared "s cf0062tu dts pvcy" show a flat trend that tends to go up,
to when taken 3 months. when the linear regression analysis shows a decrease because
the calculated slope value is negative (-24). And predictions
B. The difference in the results of this trend occurs using linear regression accurately describe sales conditions
because the value of the slope or slope of the results of the for data that have high fluctuations.
linear regression is different when the input in the form of
data last 3 months with data from the last 6 months. 2. Moving averages are not suitable for determining trends
because they show trends that are false positive. Shown in
C. Analysis of trend charts uses moving averages and
conclusion point 1.
linear regression with the same data, there are different results
for the product "s cf0062tu dts pvcy", based on testing with 3. Linear regression provide much more realistic data since
linear regression, the value of the slope or slope of the graph this analysis measuring the slope for each past 3 months sales
is at -24. This means, in calculating trends using trends for data
the past 3 months, moving averages show trends that tend to
be flat and this is very different from what is shown in the REFERENCES
original graph where the trend value is down as shown by the
original sales chart. In linear regression, because the [1] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of
prediction uses the value of the slope of the graph, the test Lipschitz-Hankel type involving products of Bessel functions,” Phil. Trans.
results with regression show the same data as the original Roy. Soc. London, vol. A247, pp. 529–551, April 1955. (references)
sales ie the trend is down. [2] J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd
ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.
D. Product "s cf0062tu dts pvcy" [3] I. S. Jacobs and C. P. Bean, “Fine particles, thin films and
exchange anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds.
a.
New York: Academic, 1963, pp. 271–350.
[4] K. Elissa, “Title of paper if known,” unpublished.
[5] R. Nicole, “Title of paper with only first word capitalized,” J.
Name Stand. Abbrev., in press.
[6] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron
spectroscopy studies on magneto-optical media and plastic substrate
interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987
[Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982].
[7] M. Young, The Technical Writer’s Handbook. Mill Valley, CA:
University Science, 1989.
[8] Alexa Internet, Inc. (2020, January 6). Alexa - Top Sites in
Fig. 2. “s cf0062tu dts pvcy” sales graph from January until November Indonesia. Retrieved from Alexa.com:
https://www.alexa.com/topsites/countries/ID.
[9] [2] Eaton, C., Dirk, D., Tom, D., George, L., & Paul, Z.
(2015). Understanding Big Data: Analytics for Enterprise Class Hadoop And
No. Name Value Streaming Data. Mc Graw Hill.
1. Sum X 27 [10] [3] Dumbill, E. (2012). Big Data Now Current
2. Sum X2 245 Perspective. O'Reilly Media.
3. Sum Y 136 [11] [4] E. Prasetyo. (2014). Data Mining Mengolah Data
4. Sum Y2 7360 Menjadi Informasi Menggunakan Matlab, Yogyakarta: ANDI Yogyakarta.
5. Sum XY 1176 [12] [5] Larose, Daniel T. (2005). Discovering Knowledge in
Data : An Introduction to Data Mining. John Willey & Sons, Inc.
6. b (slope) -24
[13] [6] Maryati. (2010). 7 Strategi Pembelejaran Inkuiri.
Fig. 3. Tabel sales data for linear regression analysis of past 3 months Yogyakarta. Available at http://staff.uny.ac.id/dosen/maryati-ssi-msi.
[14] [7] Garima., Gulati, Hina., Singh, P. K. (2015).
Slope value of product "s cf0062tu dts pvcy" are negative (- “Clustering Techniques in Data Mining: A Comparison”, in 2nd
24) while moving average graph show the trend still continue International Conference on Computing for Sustainable Global
for next month. Development.
[15] [8] Mark-Shane, E. Scale (2009). “Cloud Computing and
Collaboration”. In Library Hi Tech News, Vol. 26 Iss: 9, pp.10 – 13.
[16] [9] Park, D., Kim, H., Kim, J. (2011). “A Review and
Classification of Recommender Systems Research.”, in International
Conference on Social Science and Humanity, Singapore.
[17] (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA

S-ar putea să vă placă și