Sunteți pe pagina 1din 14

Lesson 1: Introduction to Big Data

Transcript:
When you hear the term big data what do you think of aggregating data from a
spreadsheet with a million rows. Or maybe processing files that measure in the gigabyte
or terabyte range. Perhaps running reports on a database cluster. While there are many
ways to describe Big Data one thing is consistent. If you can do the work on a single
machine you're not doing big data. The unique challenge of Big Data is the need to
spread the work across a cluster of machines. The individual work being done is often
fairly simple and common. Typically, you will be loading data from a file or database do
some sort of transformation on it and write out the results back to the file or database.
Dividing up a big pile of data and processing it chunk by chunk may not seem like a
particularly difficult task but it moves the problem into the world of distributed computing
which is notoriously complex in subtle ways. During this course we will build up from a
simple data transformation application that can only do one thing at a time to a fully
distributed application that can run across as many computers and processors to which
you have access. It is possible to perform Big Data Processing with almost any
programming language for this course. We will be using python. This is widely used in
data science and is often used for data transformation applications. It is natural for
these smaller applications to sometimes grow and evolve into distributed applications as
the amount of data being processed starts to exceed what can reasonably be done by a
single computer.
You will be able to use the concepts you learned here and apply them to any
programming language in which you might eventually work.

Page 1: Course Overview


Big data refers to both the storage and analysis of large collections of data. For
perspective, in the year 2013, 90% of the world's data was generated in the previous
two years. This trend is going to continue to increase exponentially as computing
becomes cheaper. In this course, students will learn how to process large datasets on
their local machine, how to distribute work to multiple computers for better performance,
and how to prevent and recover from errors.
Course Objectives
 Learn the fundamentals of Big Data
 Learn how to process large data files with Python
 Learn why clustering is essential for Big Data
 Learn how to maximize computing efficiency
 Learn how Hadoop can be used to solve Big Data problems
 Learn how to recover from data processing errors

Course Details
 Time to complete: 40 Hours
 Prerequisites: Python 3
 Core Technologies and Tools: Python 3, Socket.io, and VSCode

Introduction to Big Data


What is Big Data?
The two inseparable components of big data are data collection and data analysis. Data
collection is simply aggregating data and storing it. For example, the data may come in
the form of text containing the date and time of an item purchased from a grocery store.
Making a decision based off of data is, intuitively, referred to as data-driven decisions.
The second component of big data, data analysis, is where information becomes
knowledge. Data scientists can analyze the data, run custom code to process the data,
and return the final results.
Let's revisit the aforementioned example of the grocery store, which will be referred to
as ACME Groceries. If ACME Groceries has collected both the dates and times of each
item sold in the store, they're empowered to ask questions concerning inventory.
Consider the following questions:
Answerable questions using ACME Groceries data:

 Most produce is sold during which month?


o Solution: Mail out advertisements and coupons the week prior.
 Which two items are frequently purchased together?
o Solution: Place these items adjacent to each other.
 What are the five most frequently purchased items between the hours of 6:00 AM and
8:00 AM?
o Solution: Place the items in a popup kiosk near the front of the store for quick
access in the morning.
 Which brand of toothpaste generates the most money?
o Solution: Move the best seller to the middle shelf (eye-level).
The ACME Groceries dataset may seem trivial in size, but recall that only the dates,
times, and items are being stored. If the types collected were to increase (e.g. store
location, social media rating, etc.) and the number of ACME Groceries stores increased
world wide, the increase of data generation and storage will increase dramatically. With
fine detail and vast records of customer purchase history, companies can spot patterns
and predict customer behavior. Big data makes large-scale insights possible.

Why Big Data?


A fundamental part of big data analysis is being able to process all of the necessary
data. This means a computer will need to run the custom code (Python in this case) to
parse the relevant information and combine the results together. However, attempting to
process vast amounts of data on a single computer will take long stretches of time. In
some situations, the processing time and cost is too long to justify the usefulness of the
data.
What if the work could be shared among two computers? Well, that would cut the time
in half. What if a third computer were added? As you may guess, the more computers
that are added the shorter the amount of time required to process the same amount of
data. This method of distributing the workload across multiple computers is referred to
as compute clustering, where groups of computers are referred to as clusters. This will
be covered in more detail later in the module.

Volume, Variety, Velocity


There are three properties, or dimensions, that describe big data: Volume, Variety,
and Velocity.

1. Volume refers to the size of the datasets being stored. According to an article by
Northeastern University, the total amount of data in the world is projected to rise steeply
to 44 zettabytes by 2020. For those who are unfamiliar with the metric prefixes, 44
zettabytes expanded is 44,000,000,000,000,000,000,000 bytes! We usually think of data
as "big" when we have at least a million records being analyzed at a time.
2. Variety refers to the different forms the data can take, such as video, audio, and text.
3. Velocity refers to the speed of data generation and storage. This typically ties into the
"real-time" rate of data transmission. For example, Twitter's variety dimension may
consist mostly of text while YouTube's varietydimension is primarily video. In a
2013 blogpost by developers at Twitter, they shared that during a big movie release in
Japan, 143,199 Tweets were sent within the span of a single second.

Key Terms
 Data Collection: The aggregation and storage of data.
 Data Analysis: Analyzing the data in order to answer questions.
 Data-driven decisions: A decision that is informed by the analysis of data.
 Compute clustering: Distributing a workload across multiple computers to reduce wait
time.
 Cluster: A group of connected computers.

Question 1 of 2

Which of the following is not one of the 3Vs of Big Data?

 A

Velocity

 B

Validity

While it's important to have valid data, it is not one of the 3Vs.

Correct

 C

Variety

Incorrect

 D

Volume

Question 2 of 2

Which dimension of big data represents the different kinds of data that
can be collected?

 A

Velocity

 B

Variety
Correct

 C

Volume

Incorrect

 D

None of the above

Page 2: Python Review


Since there is a need to process large data sets, Python will be used to do so. This
lesson will review a small subset of the Python programming language that is relevant
to big data, such as writing a function, reading standard input, and file operations.

Setting Up
1. Create a new folder on your computer which will contain all the content in this
module. Name the folder big-data to stay consistent with the assumptions made
in this module.
2. Ensure Visual Studio Code is installed locally on your machine. If not, click
here to download the installer. Once VS Code is installed, open it and navigate to
the folder created in the previous step (i.e. big-data).
3. Ensure Python 3 is installed by opening the command line tool on your operating
system: Terminal for macOS/Linux users and Command Prompt for Windows
users. Once opened, run the command python3 and observe what happens. If
Python 3 is installed, the three greater-than characters (>>>) will denote the
Python interpetor has started which means it's installed properly. If Python 3
is not installed, click here to download the latest version of Python 3.
4. Using the command line tool, navigate into the directory that was created
previously (i.e. big-data).
Note - "folder" and "directory" are two different words for the same thing. If you're using
a mouse to open it, it's called a folder. If you're using a command line tool to open it, it's
called a directory.
Hello Again, World!
Note - Make sure to open the big-data folder
In Visual Studio Code, create a new file called review.py. The purpose of this file will be

Reading From Variable


Once the tools have been setup and configured, let's review how to declare a function.
In the review.py file created in the previous step, we will review the basics of setting up
a function. Define a new function called filter which will take in a list variable as the
parameter. The filter function will create a copy of the list that does not contain items
that are "null". For example, if the list contains "apple, orange, null, banana" the new list
would contain "apple, orange, banana". You are encouraged to write as much code as
possible before moving onto the solution as this will help identify areas of improvement.
review.py

# Define the function


def filter(list):
new_list = []
for item in list:
if item == "null":
continue
else:
new_list.append(item)
return new_list

# Print original item list


item_list = ["apple", "orange", "null", "banana"]
print(item_list)

# Print new filtered item list


new_list = filter(item_list)
print(new_list)
The function is now declared at the top of the file. The remaining lines of code test the
function with a list of data and prints the results. To run the function, run the command
shown below in your command line tool.
Input:

python3 review.py

Output:

['apple', 'orange', 'null', 'banana']


['apple', 'orange', 'banana']

Reading From Standard Input


Now that the filter function is working properly, it can be useful to expand it's
functionality. Python programs are not limited to reading data from the variables in the
program. Python programs can take in input in the form of standard input, which is
passing in the data via the command line.
The code below is updated to include a few new things:

 Imported sys to read standard input.


 Removed the item_list variable since the data will come from standard input.
 Added code to process the standard input line by line

Tip! Type each line of the code below rather than copying it to improve comprehension.
# Import needed libraries ('sys' is for reading standard input from th
e command line)
import sys

# Define the method


def filter(list):
new_list = []
for item in list:
# Ignore "null" items
if item == "null":
continue
else:
new_list.append(item)
return new_list

# Read in standard input line-by-line


standard_input = []
for line in sys.stdin:
# Stop reading input if no data is entered
if line == "\n":
break
# Add line to the list
# "line.strip()" will remove trailing white space
standard_input.append(line.strip())

# Print standard input list


print(standard_input)

# Print new filtered list


new_list = filter(standard_input)
print(new_list)

The comments were added to explain what the code does at each step. Be sure to
understand how the code is working at each step as you should have a working
knowledge of this. Once reviewed, run the same command used initially to run
the review.py Python script. Since standard input is being used, users will be able to
enter text continuously. In order to stop, press the enter key with no text to add, which is
commented in the code on line 18.
Input:

python3 review.py
apple
orange
null
banana

Output:

['apple', 'orange', 'null', 'banana']


['apple', 'orange', 'banana']

Reading From Files (File IO)


Now that we've reviewed how to write a method that processes data and learned how to
read from standard input, the final review step will be to review file operations. In
the big-datafolder that the review.py file is located in, create a plain text file
called review.txt. This can be done using Visual Studio Code by clicking "File -> New
File". An empty "Untitled-1" file will be created. Save the file by going to "File -> Save"
and enter the name review.txt. Then, add some content to the file as shown below.
review.txt

apple
orange
null
banana

Next, the review.py file will need to be updated to open and file and process the
contents. There are a few changes:

 Remove the unnecessary sys import.


 Add code to retrieve each line from the file and store it in a variable.

review.py

# Define the method


def filter(list):
new_list = []
for item in list:
# Ignore "null" items
if item == "null":
continue
else:
new_list.append(item)
return new_list

# Open the file for reading


file_input = []
file = open("review.txt", "r")
for line in file:
# Add line to the list
# "line.strip()" will remove trailing white space
file_input.append(line.strip())

# Print file input list


print(file_input)

# Print new filtered list


new_list = filter(file_input)
print(new_list)

With the review.py script being complete, it's now time to run it. This time, there is no
need for standard input so the results will be displayed as soon as the file can be
opened and processed.
Input:

python3 review.py

Output:

['apple', 'orange', 'null', 'banana']


['apple', 'orange', 'banana']

This concludes the Python code review. If there were any areas that caused a long
moment of pause, that's quite alright. Practice makes perfect and is the purpose of the
Hands On projects.
Page 3: Lesson 1 Practice Hands-On
Directions
This Hands-On will not be graded, but we encourage you to complete it. However, the
best way to become great cyber security professional is to practice.

Hands-On: Part 1
Now that you have learned the importance of Big Data and reviewed crucial parts of the
Python language, it's now time to apply that knowledge in a fun but challenging project.
This Hands-On project will involve the tools and skills illustrated in this lesson such as
Visual Studio Code and the command line.

Description
Leverage your knowledge of the Python programming language to count the
occurences of a given word. Create a new Python file called project1_part1.py for this
project. The starter code below contains the list of values to be used in this lesson. Your
task will be to count the occurences of the search_word in the word_list and print the
results. For example, if the search_word is 'energy', print energy,1. But if
the search_word is 'sandwich', print sandwich,0. Your code should be able to handle
different values of search_word.
Starter Code:

search_word = 'the'
count = 0

word_list = ['the', 'universe', 'is', 'all', 'of', 'space', 'and', 'ti


me', 'and', 'its', 'contents', 'which', 'includes', 'planets', 'moons'
, 'stars', 'galaxies', 'the', 'contents', 'of', 'intergalactic', 'spac
e', 'and', 'all', 'matter', 'and', 'energy']

# TODO: Count the occurences of the search_word


# `str` is used to convert the number to a string
print(search_word + ',' + str(count))

Hands-On: Part 2
Create a new Python file called project1_part2.py for this project. In this file, create a
function which takes in a list of words and makes a new copy excluding the filter word.
The filter word will be retrieved from reading standard input.

word_list = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', '


lazy', 'dog']

Be sure to print both the original list and the filtered list.
Note - Be sure to use .strip() to remove unnecessary white space from standard
input.

Hands-On: Part 3
Create a new text file called project1_part3.txt and copy the following text into it:
project1_part3.txt

1934
2004
????
1987
2009
????
1902
1964
1913
1956
????
????
1965
1920
1963
????
????
1900
2009
1986
????
1907
1994
1938
1964
1913
????
1944

Additionally, create a new Python file called project1_part3.py for this project. In this
file, write the code that will open the text file created above and and load all of the
values into a list. Then, create a function which takes in the list of words and makes a
new copy excluding the filter word retrieved from reading standard input.
Note - Be sure to use .strip() to remove unnecessary white space from the file and
from standard input.
NEED HELP? - If you find yourself needing additional support, please reach out to a
mentor using the live chat, or you can submit a ticket through the helpdesk.

Requirements
 Provided Data: Python application must use the unmodified data provided in the lesson.
 Standard Input: Must be able to read in text from standard input.
 Filter: Must be able to filter specific items from the list.
 Print: Print the list data before and after it has been processed.
 File Operations: Code must open file and read data from it.
Grading
 Meets all Requirements: 50% of your grade will be based on meeting the
requirements.
 Timely Submission: 25% of your grade will be based on having a complete solution on
time.
 Style: 25% of your grade will be based on having legible, and well-designed code.

Be sure to save your solution, and be prepared to share it with your Instructor or Mentor
during your next class, or check-in.
Tip! Your Instructor or Mentor may test your program with additional inputs, so be sure
to test it thoroughly with different scenarios.
Drop files to upload, or click to browse.
Submit Project

S-ar putea să vă placă și