Documente Academic
Documente Profesional
Documente Cultură
Transcript:
When you hear the term big data what do you think of aggregating data from a
spreadsheet with a million rows. Or maybe processing files that measure in the gigabyte
or terabyte range. Perhaps running reports on a database cluster. While there are many
ways to describe Big Data one thing is consistent. If you can do the work on a single
machine you're not doing big data. The unique challenge of Big Data is the need to
spread the work across a cluster of machines. The individual work being done is often
fairly simple and common. Typically, you will be loading data from a file or database do
some sort of transformation on it and write out the results back to the file or database.
Dividing up a big pile of data and processing it chunk by chunk may not seem like a
particularly difficult task but it moves the problem into the world of distributed computing
which is notoriously complex in subtle ways. During this course we will build up from a
simple data transformation application that can only do one thing at a time to a fully
distributed application that can run across as many computers and processors to which
you have access. It is possible to perform Big Data Processing with almost any
programming language for this course. We will be using python. This is widely used in
data science and is often used for data transformation applications. It is natural for
these smaller applications to sometimes grow and evolve into distributed applications as
the amount of data being processed starts to exceed what can reasonably be done by a
single computer.
You will be able to use the concepts you learned here and apply them to any
programming language in which you might eventually work.
Course Details
Time to complete: 40 Hours
Prerequisites: Python 3
Core Technologies and Tools: Python 3, Socket.io, and VSCode
1. Volume refers to the size of the datasets being stored. According to an article by
Northeastern University, the total amount of data in the world is projected to rise steeply
to 44 zettabytes by 2020. For those who are unfamiliar with the metric prefixes, 44
zettabytes expanded is 44,000,000,000,000,000,000,000 bytes! We usually think of data
as "big" when we have at least a million records being analyzed at a time.
2. Variety refers to the different forms the data can take, such as video, audio, and text.
3. Velocity refers to the speed of data generation and storage. This typically ties into the
"real-time" rate of data transmission. For example, Twitter's variety dimension may
consist mostly of text while YouTube's varietydimension is primarily video. In a
2013 blogpost by developers at Twitter, they shared that during a big movie release in
Japan, 143,199 Tweets were sent within the span of a single second.
Key Terms
Data Collection: The aggregation and storage of data.
Data Analysis: Analyzing the data in order to answer questions.
Data-driven decisions: A decision that is informed by the analysis of data.
Compute clustering: Distributing a workload across multiple computers to reduce wait
time.
Cluster: A group of connected computers.
Question 1 of 2
A
Velocity
B
Validity
While it's important to have valid data, it is not one of the 3Vs.
Correct
C
Variety
Incorrect
D
Volume
Question 2 of 2
Which dimension of big data represents the different kinds of data that
can be collected?
A
Velocity
B
Variety
Correct
C
Volume
Incorrect
D
Setting Up
1. Create a new folder on your computer which will contain all the content in this
module. Name the folder big-data to stay consistent with the assumptions made
in this module.
2. Ensure Visual Studio Code is installed locally on your machine. If not, click
here to download the installer. Once VS Code is installed, open it and navigate to
the folder created in the previous step (i.e. big-data).
3. Ensure Python 3 is installed by opening the command line tool on your operating
system: Terminal for macOS/Linux users and Command Prompt for Windows
users. Once opened, run the command python3 and observe what happens. If
Python 3 is installed, the three greater-than characters (>>>) will denote the
Python interpetor has started which means it's installed properly. If Python 3
is not installed, click here to download the latest version of Python 3.
4. Using the command line tool, navigate into the directory that was created
previously (i.e. big-data).
Note - "folder" and "directory" are two different words for the same thing. If you're using
a mouse to open it, it's called a folder. If you're using a command line tool to open it, it's
called a directory.
Hello Again, World!
Note - Make sure to open the big-data folder
In Visual Studio Code, create a new file called review.py. The purpose of this file will be
python3 review.py
Output:
Tip! Type each line of the code below rather than copying it to improve comprehension.
# Import needed libraries ('sys' is for reading standard input from th
e command line)
import sys
The comments were added to explain what the code does at each step. Be sure to
understand how the code is working at each step as you should have a working
knowledge of this. Once reviewed, run the same command used initially to run
the review.py Python script. Since standard input is being used, users will be able to
enter text continuously. In order to stop, press the enter key with no text to add, which is
commented in the code on line 18.
Input:
python3 review.py
apple
orange
null
banana
Output:
apple
orange
null
banana
Next, the review.py file will need to be updated to open and file and process the
contents. There are a few changes:
review.py
With the review.py script being complete, it's now time to run it. This time, there is no
need for standard input so the results will be displayed as soon as the file can be
opened and processed.
Input:
python3 review.py
Output:
This concludes the Python code review. If there were any areas that caused a long
moment of pause, that's quite alright. Practice makes perfect and is the purpose of the
Hands On projects.
Page 3: Lesson 1 Practice Hands-On
Directions
This Hands-On will not be graded, but we encourage you to complete it. However, the
best way to become great cyber security professional is to practice.
Hands-On: Part 1
Now that you have learned the importance of Big Data and reviewed crucial parts of the
Python language, it's now time to apply that knowledge in a fun but challenging project.
This Hands-On project will involve the tools and skills illustrated in this lesson such as
Visual Studio Code and the command line.
Description
Leverage your knowledge of the Python programming language to count the
occurences of a given word. Create a new Python file called project1_part1.py for this
project. The starter code below contains the list of values to be used in this lesson. Your
task will be to count the occurences of the search_word in the word_list and print the
results. For example, if the search_word is 'energy', print energy,1. But if
the search_word is 'sandwich', print sandwich,0. Your code should be able to handle
different values of search_word.
Starter Code:
search_word = 'the'
count = 0
Hands-On: Part 2
Create a new Python file called project1_part2.py for this project. In this file, create a
function which takes in a list of words and makes a new copy excluding the filter word.
The filter word will be retrieved from reading standard input.
Be sure to print both the original list and the filtered list.
Note - Be sure to use .strip() to remove unnecessary white space from standard
input.
Hands-On: Part 3
Create a new text file called project1_part3.txt and copy the following text into it:
project1_part3.txt
1934
2004
????
1987
2009
????
1902
1964
1913
1956
????
????
1965
1920
1963
????
????
1900
2009
1986
????
1907
1994
1938
1964
1913
????
1944
Additionally, create a new Python file called project1_part3.py for this project. In this
file, write the code that will open the text file created above and and load all of the
values into a list. Then, create a function which takes in the list of words and makes a
new copy excluding the filter word retrieved from reading standard input.
Note - Be sure to use .strip() to remove unnecessary white space from the file and
from standard input.
NEED HELP? - If you find yourself needing additional support, please reach out to a
mentor using the live chat, or you can submit a ticket through the helpdesk.
Requirements
Provided Data: Python application must use the unmodified data provided in the lesson.
Standard Input: Must be able to read in text from standard input.
Filter: Must be able to filter specific items from the list.
Print: Print the list data before and after it has been processed.
File Operations: Code must open file and read data from it.
Grading
Meets all Requirements: 50% of your grade will be based on meeting the
requirements.
Timely Submission: 25% of your grade will be based on having a complete solution on
time.
Style: 25% of your grade will be based on having legible, and well-designed code.
Be sure to save your solution, and be prepared to share it with your Instructor or Mentor
during your next class, or check-in.
Tip! Your Instructor or Mentor may test your program with additional inputs, so be sure
to test it thoroughly with different scenarios.
Drop files to upload, or click to browse.
Submit Project