Sunteți pe pagina 1din 10

Introduction to Classification Algorithms

Let's see an introduction to classification algorithms.


How to Build Your Own CDN With Kubernetes
Design and code to deploy a self-hosted content delivery network.
by Ilhaan Rasheed � Oct. 07, 19 � Cloud Zone � Review
Like (2)
Comment (0)
Save Tweet 3,233 Views
Join the DZone community and get the full member experience. JOIN FOR FREE

kubeCDN
Kubernetes goes worldwide with kubeCDN.
In this blog post, Ilhaan discusses the design and implementation of kubeCDN, a
tool designed to simplify geo-replication of Kubernetes clusters in order to deploy
services with high availability on a global scale.

More Users, More Problems


The internet has transformed how people exchange ideas and share results across the
globe. Nevertheless, this medium of innovation is nowhere near perfect. Since the
internet has become the first stop for most information, any hindrance in getting
what we want quickly can be excruciatingly frustrating. We all know that
frustration of waiting those few extra seconds (or even worse, minutes!) for a
shared document or video stream to reach you. Given that we have become extremely
dependent on internet-based systems for all facets of life, we end up having to
stare at �buffering� animations for a larger portion of our days than we would
like. This UX degradation causes users to become frustrated with services and, in
the presence of competing options, pushes them to pursue other alternatives. No
business wants to lose customers this way, but reducing this type of customer loss
comes with several engineering challenges.

Users suffer when they don�t get what they want (Source:
https://i.gifer.com/7Pzx.gif)
One of the main reasons for user-perceived latency in internet service delivery is
that servers are located far from users. If your infrastructure is located in one
part of the world and serves a global customer base, users located on the other
side of the world will see increased latency when using your service. This
increased latency leads to delays in data retrieval, more buffering animations and
frustrated users.

Users located further away from servers see higher latency. ( Image source)
Deploying servers closer to users� locations can reduce this latency, but doing so
can be challenging, as managing global infrastructure requires more capital and
personnel investments.

Current Solution: Content Delivery Network (CDN) Providers


One way to overcome this challenge is to use a third-party content delivery network
provider. CDN providers like Akamai, CloudFlare and Fastly allow companies to build
infrastructure in one region and scale their services to a global audience.

You may also enjoy: Build Your Own CDN in 5 Steps


CDN providers minimize user access time for your service by caching static content
across their points of presence (PoPs) worldwide. These PoPs consist of several
edge servers providing cached content to users requesting static assets like
images. If a requested asset is not found, the edge server pulls the asset, either
from the origin server (i.e your server), or from nearby edge servers. This setup
improves the UX for geographically distributed users as they see reduced latencies
and packet loss.

Read more about CDNs and how they work here.

Issues Using CDN Providers


While CDN providers offer a convenient solution to a difficult problem, there are
issues associated with these services.

The first issue is that, when utilizing a third-party CDN to deliver your service,
you give up some control over your infrastructure. You become reliant upon an
external entity to ensure that your users get the best experience. An even more
important consideration here is the security of data being transmitted from your
service to users. Bugs in a third-party CDN provider�s system, such as this one
from 2017, can have serious implications for the security and privacy of your
users.

Another issue with a CDN provider is that they may not always have your best
interests in mind. You�re likely not their only customer and, at the end of the
day, the third-party CDN is a business. They�ll prioritize their profits over the
quality of your service.

Finally, the most important issue with using a CDN provider is the amount of
insight they gain into your business. A CDN provider can determine the locations of
your customers, the times in which they use your service, and the types of devices
they use to access your service. Business data like this can be very valuable to
your competitor, and information leaks to a third-party can make your business
vulnerable to customer loss.

It is important to keep in mind that, for many teams, these trade-offs might be
worth the convenience of using a CDN provider. However, many top engineering teams
like Netflix and LinkedIn have decided to handle this internally. What if you want
to do the same and build your own CDN?

kubeCDN
This is where kubeCDN comes into play. kubeCDN is a self-hosted content delivery
network based on Kubernetes. As a self-hosted solution, you maintain complete
control over your infrastructure. It eliminates the need for a third-party service
to deliver content, and restores control over the flow of data from your servers to
users� devices.

Design and Architecture


kubeCDN uses Terraform to deploy EKS and other AWS infrastructure components in a
chosen region. Route53, a cloud domain name system (DNS) from AWS, is used to route
users between multiple regions and ExternalDNS is used to automatically create DNS
records when new services are deployed.

The image below illustrates how Terraform is used in kubeCDN.

Terraform and AWS

Terraform is used to deploy EKS infrastructure in selected regions.


While Terraform is used to deploy the infrastructure necessary for kubeCDN, Route53
is used to route user traffic to specific regions. For my demonstration of kubeCDN
shown here, I set up a video server in two AWS regions, us-east-1and us-west-2. I
set up a hosted zone on Route53 for my domain and set A records for each region
where I deployed clusters. I used a latency-based routing policy to route users to
the region that provided them the lowest latency. In this demonstration, the user
was always routed to the geographically closest region. However, please note that
this may not always be the case. Latency measurements on the internet can change
over time, and these are the measurements that Route53 uses to determine where to
route users when this routing policy is implemented. Read more about this here.

kubeCDN and Route53

kubeCDN uses Route53 to route user traffic to regions that provide lower latency.
The figure above illustrates how Route53 routes user traffic with clusters set up
in the two regions mentioned earlier. The user in San Francisco is routed to the
cluster in us-west-2 instead of the cluster in us-east-1 because it provided a
lower latency when the demonstration was conducted.

There are several other routing policies available on Route53 that can be used with
kubeCDN to accommodate various application requirements. These are shown here.

Problems Solved
kubeCDN makes it easy to scale services and applications globally in minutes. It
took about 15 minutes to deploy the infrastructure needed for the demonstration
mentioned above, a significant reduction compared to a manual deployment using the
AWS console.

Apart from the ease of scaling, kubeCDN can also optimize infrastructure costs by
tearing down regions during periods of low user activity. This level of
infrastructure cost optimization can be crucial when budgets are tight to ensure
maximum profitability.

Development Challenges
Like any other project, I faced a few issues developing kubeCDN. While most were
minor and fairly easy to overcome, some were more challenging and required me to
devise workarounds. Two such challenges are described below.

Terraform Code Refactor


The Terraform code used in kubeCDN to deploy EKS clusters to a region is based on a
repository created for this webinar from HashiCorp. While the repository is very
clear, with detailed instructions, it is not capable of multi-regional deployments.
The target region is hard-coded and would require the repository to be replicated
for each desired region. This is a tedious manual process that is vulnerable to
misconfiguration due to human error. It would also lead to a haphazard management
structure, making it difficult to monitor infrastructure and optimize for cost.

Refactoring the Terraform code is a way to overcome this issue, however, the
challenge lies in how this refactoring would take place. After reading a relevant
post and taking a look at another open-source project with similar requirements, I
determined that the structure shown below would be the best way to organize the
infrastructure portion of the project at this stage.

+-- terraform (Dir containing all infrastructure code) +-- cluster


(EKS cluster components) � +-- main.tf � +-- outputs.tf � +--
rbac.yaml � +-- variables.tf +-- main.tf (Main .tf file where
regions are specified) +-- variables.tf (Some files
have been removed for brevity)
In the directory structure shown above, main.tf is where all desired regions are
specified. A region deployment is defined using the following snippet:

module "<name-for-region-deployment>" {source = "cluster"region = "<AWS-region>"}


This enables easy definition of new regions in a single config file that is also
easy to manage. I�ll discuss additional possibilities for improvement in the
Extending kubeCDN section below.
ExternalDNS Issues
ExternalDNS is a tool that dynamically configures DNS records via Kubernetes
resources. The tool can interface with DNS providers such as Route53 and many
others. When deploying a new service to your Kubernetes cluster, such as a web
server, ExternalDNS creates DNS records so your web server is discoverable using
public DNS servers.

For kubeCDN, I intended to use ExternalDNS to automatically create DNS records for
different regional deployments and configure Route53 to use a latency-based routing
policy to route users to the region providing the lowest latency. While I was able
to achieve this, I had to work around a few issues with ExternalDNS that prevented
me from fully automating the dynamic configuration of DNS records for services in
different regions.

I wanted to set two A records for my domain, one for each region deployed (my
demonstration only used two regions). This, combined with Route53�s latency-based
routing policy, would direct users to the AWS regions that provide the lowest
latency. This way, I could set two A records for my domain, one pointing to my East
Coast infrastructure and the other to my West Coast infrastructure. I incorporated
ExternalDNS into kubeCDN and configured my video test service to use ExternalDNS
and set a DNS record when deployed.

I did notice that ExternalDNS overwrites A records for the same domain name, even
if the IP address is different. This seems to be a temporary limitation of the AWS
provider in ExternalDNS. Given that ExternalDNS is a new tool and still incubating
as a project, I decided to manually address this issue for now and will revisit it
again in the future. This Github issue refers to this problem and has been closed
since a pull request to fix it was submitted. However, at this point, the pull
request still has not been merged with the master branch of ExternalDNS.

Another limitation of the AWS provider in ExternalDNS, as noted here (open issue on
Github at the time of writing), is the inability to set latency-based routing on
Route53. This was addressed by manually setting the routing policy on AWS Console.

The issues noted here are specific to the AWS provider on ExternalDNS. However, the
same issues may not exist for other cloud providers on ExternalDNS or in future
versions of ExternalDNS. These will be revisited in the future as the project
develops and more features are added.

Extending kubeCDN
Given that kubeCDN was developed over a short period of time (three weeks), there
is still room for improvement and I have several ideas to extend the project and
its capabilities.

Multi-Cloud Support
At the moment, kubeCDN is solely based on AWS. This is mainly due to the fact that
Fellows at Insight are given AWS credits for projects. Using only one cloud
provider poses an issue in the event of provider outage. Adding support for
multiple cloud providers, such as GCP & Azure, will provide necessary failover in
such scenarios.

This does present a few challenges with respect to the Kubernetes aspect of
kubeCDN. Currently, kubeCDN uses the EKS-managed Kubernetes service from AWS.
Adding support for other cloud providers will require one of the following:

Incorporate managed Kubernetes services from all cloud providers. While this will
simplify deployment, incorporating different managed services into kubeCDN will be
quite challenging.
Use custom Kubernetes deployment. This will involve using kops to install a It will
allow clusters to be uniform across all cloud providers and enable increased
flexibility for unique deployment scenarios. In addition to this, kops integration
will allow teams to use kubeCDN with on-premise equipment in case all third-party
dependencies need to be eliminated.
While the kops option seems to be best for this project long-term, incorporating it
will be tricky and require extensive development. Even so,kops-based Kubernetes
installation will have great benefits for the project in the future. Multi-region
support will also enable regional failover in the event of a provider-specific
outage.

Regional Auto-Scaling
While scaling infrastructure globally drastically improves UX, it may not be
necessary to run infrastructure in many parts of the world on a 24/7 basis. There
may be times that it�s better, from a profitability standpoint, to tear down
infrastructure in a region and have the small, or negligible, number of users there
experience increased latency. This level of control over infrastructure cost
optimization can be extremely beneficial from a financial standpoint.

In order to achieve this, a monitoring solution would be needed to monitor a chosen


metric. When this metric reaches a predefined threshold in a region, infrastructure
can be dismantled. Additionally, if there is a spike in users from a specific
region, similar thresholds can be used to automatically spin up infrastructure in a
pre-approved region closer to where the user spike is identified.

List of Pre-defined Regions


The current version of kubeCDN has made it easy to deploy EKS clusters to new
regions. However, there are better ways to achieve this than the current method of
replicating three lines of code for each region. The ability to list desired
regions in a file would be a much cleaner approach.

This feature would also tie in well with the regional auto-scaling feature
previously mentioned. Providing kubeCDN with two lists of regions, one where
infrastructure needs to run continuously and one approved for regional auto-
scaling, would immensely simplify infrastructure management.

Federation of Deployed Kubernetes Clusters


Federation makes it easy to manage multiple clusters. Here is a list of the
features that Federation provides. While every feature provided by Federation would
be beneficial to kubeCDN, one that particularly stands out is the ability to sync
resources across clusters. This poses a huge management challenge, and Federation
helps simplify that immensely. However, Federation is not a mature feature of
Kubernetes, and has some limitations. Here are some known issues with Federation
that the Kubernetes development team is working to solve.

Takeaways
The current version of kubeCDN simplifies geo-replication of Kubernetes clusters
and allows for easy scaling. Being self-hosted, kubeCDN allows for the
accommodation of unique infrastructure requirements and provides immense
infrastructure management flexibility.

However, there are several limitations and opportunities for enhancements that
would make kubeCDN a more robust tool. Some of these have been discussed in this
post, and I anticipate this list to only grow as I receive feedback on this post.
Another major factor affecting future features will be the improvements coming to
the Kubernetes project itself. I also anticipate the maturing of Kubernetes
Federation to drastically change the functions and features of kubeCDN in the
future.

I sincerely hope that you find kubeCDN to be a useful tool. Feel free to reach out
to me if you have any questions.

Ilhaan Rasheed was an Insight DevOps Engineering Fellow in early 2019. He developed
kubeCDN during his tenure at Insight. Connect with him at ilhaan.com.

Say hello to classification algorithms!


The idea of Classification Algorithms is pretty simple. You predict the target
class by analyzing the training dataset. This is one of the most � if not the most
essential � concepts you study when you learn data science.

You might also like: Top Machine Learning Algorithms You Should Know to Become a
Data Scientist
What Is Classification?
We use the training dataset to get better boundary conditions that could be used to
determine each target class. Once the boundary conditions are determined, the next
task is to predict the target class. The whole process is known as classification.

Target Class Examples:


Analysis of the customer data to predict whether he will buy computer accessories
(Target class: Yes or No)
Classifying fruits from features like color, taste, size, weight (Target classes:
Apple, Orange, Cherry, Banana)
Gender classification from hair length (Target classes: Male or Female)
Let�s understand the concept of classification algorithms with gender
classification using hair length (by no means am I trying to stereotype by gender,
this is only an example). To classify gender (target class) using hair length as
feature parameter, we could train a model using any classification algorithms to
come up with some set of boundary conditions that can be used to differentiate the
male and female genders using hair length as the training feature. In gender
classification case the boundary condition could the proper hair length value.
Suppose the differentiated boundary hair length value is 15.0 cm then we can say
that if hair length is less than 15.0 cm then gender could be male or else female.

Classification Algorithms vs Clustering Algorithms


In clustering, the idea is not to predict the target class as in classification,
it�s more trying to group the similar kind of things by considering the most
satisfied condition, all the items in the same group should be similar and no two
different group items should not be similar.

Group Items Examples:

While grouping similar language type documents (Same language documents are one
group.)
While categorizing the news articles (Same news category(Sport) articles are one
group)
Let�s understand the concept with clustering genders based on hair length example.
To determine gender, different similarity measures could be used to categorize male
and female genders. This could be done by finding the similarity between two hair
lengths and keep them in the same group if the similarity is less (Difference of
hair length is less). The same process could continue until all the hair length
properly grouped into two categories.

Basic Terminology in Classification Algorithms


Classifier: An algorithm that maps the input data to a specific category.
Classification model: A classification model tries to draw some conclusions from
the input values given for training. It will predict the class labels/categories
for the new data.
Feature: A feature is an individual measurable property of a phenomenon being
observed.
Binary Classification: Classification task with two possible outcomes. Eg: Gender
classification (Male/Female)
Multi-class classification: Classification with more than two classes. In multi-
class classification, each sample is assigned to one and only one target label. Eg:
An animal can be a cat or dog but not both at the same time.
Multi-label classification: Classification task where each sample is mapped to a
set of target labels (more than one class). Eg: A news article can be about sports,
a person, and location at the same time.
Applications of Classification Algorithms
Email spam classification
Bank customers loan pay willingness prediction.
Cancer tumor cell identification.
Sentiment analysis
Drugs classification
Facial keypoints detection
Pedestrian detection in automotive car driving.
Types of Classification Algorithms
Classification Algorithms could be broadly classified as the following:

Linear Classifiers
Logistic regression
Naive Bayes classifier
Fisher�s linear discriminant
Support vector machines
Least squares support vector machines
Quadratic classifiers
Kernel estimation
k-nearest neighbor
Decision trees
Random forests
Neural networks
Learning vector quantization
Examples of a few popular Classification Algorithms are given below.

Logistic Regression
As confusing as the name might be, you can rest assured. Logistic Regression is a
classification and not a regression algorithm. It estimates discrete values (Binary
values like 0/1, yes/no, true/false) based on a given set of an independent
variable(s). Simply put, it basically predicts the probability of occurrence of an
event by fitting data to a logit function. Hence it is also known as logit
regression. The values obtained would always lie within 0 and 1 since it predicts
the probability.

Let's try and understand this through another example.

Let�s say there�s a sum on your math test. It can only have 2 outcomes, right?
Either you solve it or you don�t (and let�s not assume points for method here). Now
imagine, that you are being given a wide range of sums in an attempt to understand
which chapters have you understood well. The outcome of this study would be
something like this � if you are given a trigonometry based problem, you are 70%
likely to solve it. On the other hand, if it is an arithmetic problem, the
probability of you getting an answer is only 30%. This is what Logistic Regression
provides you.

If I had to do the math, I would model the log odds of the outcome as a linear
combination of the predictor variables.

odds= p/ (1-p) = probability of event occurrence / probability of event occurrence


ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk)
In the equation given above,p is the probability of the presence of the
characteristic of interest.

It chooses parameters that maximize the likelihood of observing the sample values
rather than that minimize the sum of squared errors (like in ordinary regression).

Logistic Regression - Classification Algorithms - EdurekaNow, a lot of you might


wonder, why take a log? For the sake of simplicity, let�s just say that this is one
of the best mathematical ways to replicate a step function. I can go way more in-
depth with this, but that will beat the purpose of this blog.

x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
There are many different steps that could be tried in order to improve the model:

include interaction terms


remove features
regularize techniques
use a non-linear model
Decision Trees
Now, the decision tree is by far, one of my favorite algorithms. With versatile
features helping actualize both categorical and continuous dependent variables, it
is a type of supervised learning algorithm mostly used for classification problems.
What this algorithm does is it splits the population into two or more homogeneous
sets based on the most significant attributes making the groups as distinct as
possible.

Decision Tree - Classification Algorithms - Edureka

In the image above, you can see that the population is classified into four
different groups based on multiple attributes to identify �if they will play or
not�.

library(rpart)
x <- cbind(x_train,y_train)
# grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
Naive Bayes Classifier
This is a classification technique based on an assumption of independence between
predictors or what�s known as Bayes� theorem. In simple terms, a Naive Bayes
classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and


about 3 inches in diameter. Even if these features depend on each other or upon the
existence of the other features, a Naive Bayes Classifier would consider all of
these properties to independently contribute to the probability that this fruit is
an apple.

To build a Bayesian model is simple and particularly functional in case of enormous


data sets. Along with simplicity, Naive Bayes is known to outperform sophisticated
classification methods as well.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c). The expression for Posterior Probability is as follows.

Bayes Rule - Classification Algorithms - Edureka

Here,

P(c|x) is the posterior probability of class (target) given predictor (attribute).


P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Example: Let�s work through an example to understand this better. So, here I have a
training data set of weather namely, sunny, overcast and rainy, and corresponding
binary variable �Play�. Now, we need to classify whether players will play or not
based on weather conditions. Let�s follow the below steps to perform it.

Step 1: Convert the data set to the frequency table

Step 2: Create a Likelihood table by finding the probabilities like Overcast


probability = 0.29 and probability of playing is 0.64.

Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability
for each class. The class with the highest posterior probability is the outcome of
the prediction.

Naive Bayes - Machine Learning Algorithms - Edureka

Problem: Players will play if the weather is sunny, is this statement correct?

We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) *
P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 =
0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class
based on various attributes. This algorithm is mostly used in text classification
and with problems having multiple classes.

R-Code:
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
KNN (k- Nearest Neighbors)
K nearest neighbors is a simple algorithm used for both classification and
regression problems. It basically stores all available cases to classify the new
cases by a majority vote of its k neighbors. The case assigned to the class is most
common amongst its K nearest neighbors measured by a distance function (Euclidean,
Manhattan, Minkowski, and Hamming).

While the three former distance functions are used for continuous variables, the
Hamming distance function is used for categorical variables. If K = 1, then the
case is simply assigned to the class of its nearest neighbor. At times, choosing K
turns out to be a challenge while performing kNN modeling.

KNN - Classification Algorithms - Edureka

You can understand KNN easily by taking an example of our real lives. If you have a
crush on a girl/boy in class, of whom you have no information, you might want to
talk to their friends and social circles to gain access to their information!

library(knn)
x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
Things to Consider Before Selecting KNN:
KNN is computationally expensive
Variables should be normalized else higher range variables can bias it
Works on pre-processing stage more before going for kNN like an outlier, noise
removal
SVM (Support Vector Machine)
In this algorithm, we plot each data item as a point in n-dimensional space (where
n is a number of features you have) with the value of each feature being the value
of a particular coordinate.

For example, if we only had two features like Height and Hair length of an
individual, we�d first plot these two variables in two-dimensional space where each
point has two coordinates (these coordinates are known as Support Vectors)

SVM - Classification Algorithms - Edureka

Now, we will find some line that splits the data between the two differently
classified groups of data. This will be the line such that the distances from the
closest point in each of the two groups will be farthest away.

SVM 2 - Classification Algorithms - Edureka

In the example shown above, the line which splits the data into two differently
classified groups is the blue line, since the two closest points are the farthest
apart from the line. This line is our classifier. Then, depending on where the
testing data lands on either side of the line, that�s what class we can classify
the new data as.

So, with this, we come to the end of this Classification Algorithms article. Try
out the simple R-Codes on your systems now and you�ll no longer call yourselves
newbies in this concept.

Further Reading
Intro to Machine Learning for Developers

Machine Learning in a Box (Part 3): Algorithm Learning Styles

S-ar putea să vă placă și