Sunteți pe pagina 1din 23

PythoN

What is Python?
Python is a widely used high-level, general purpose, interpreted, dynamic
programming language.
Its design philosophy emphasizes code readability, and its syntax allows
programmers to express concepts in fewer lines of code than possible in
languages such as C++ or Java.

Python has been awarded a TIOBE Programming Language of the Year


award twice (in 2007 and 2010), which is given to the language with the
greatest growth in popularity over the course of a year, as measured by
the TIOBE index.
Large organizations that make use of Python include Google, Yahoo!,
CERN, NASA. The social news networking site, Reddit, is written entirely in
Python.

Python in Todays World!


Web Development
Software Development
Scientific & Numeric Computing
Artificial Intelligence

Desktop GUIs
Python is also script language and therefore supports scripts, i.e., programs
written for a special run-time environment that automate the execution of
tasks that could alternatively be executed one-by-one by a human operator.
It is a type of language which can be used to control other programs.

Why Python?
Python is a popular, general-purpose programming language with an
emphasis on being readable and allowing programmers to use fewer lines of
code to accomplish tasks than in older languages.
Python is an excellent tool for data analysis for four reasons:
Open source
Speed
Support
Scope

Python vs C++/Java
Java
Python programs are slower than Java programs.
Python codes are usually 3-5 times shorter than equivalent Java codes
Python programmer wastes no time declaring the types of arguments or variables

C++
C++ codes are generally 5-10 times longer than equivalent Python codes

Summary: Despite the slower runtime, Python is still sometimes preferred to C++/Java
due to the ease of programming by avoiding complex syntax and is highly readable
to interpret.

Installation
There are 2 approaches to install Python:
You can download Python directly from
its https://www.python.org/download/ and install individual components
and libraries you want
Alternately, you can download and install a package, which comes with
pre-installed libraries. I would recommend downloading Anaconda.
Another option could be Enthought Canopy Express.

IDLE Python Default GUI

One of the more popular environment used for Python computing is Ipython/Jupyter
Notebook.
It is an interactive computational environment, in which you can combine code execution,
rich text, mathematics, plots and rich media

Python Programming and Syntax


Some of the concepts associated with Python and other object-oriented
languages are:
Objects: Everything is Python is an object that has an identity (id) and a
value (mutable or immutable)
Class: A user-defined prototype for an object that defines a set of
attributes that characterize any object of the class. The attributes are data
members (class variables and instance variables) and methods, accessed
via dot notation.
Methods: A special kind of function that is defined in a class definition.

Python Libraries
Being an open source language, Python developers have been
developing libraries to ease performing various tasks.
A library contains multiple modules which in turn contain set of dedicated
functions.
Python comes with a Python Standard Library which contains extensive set
of built-in functions to carry out various operations.
The libraries can be imported into the code once their package has been
installed on a system.
Once a library is imported, its functions can be called in the program.

Python Libraries
Ways to import Python libraries:

In the first manner, we have defined an alias m to library math. We can now use
various functions from math library (e.g. factorial) by referencing it using the alias
m.factorial().
In the second manner, you have imported the entire name space in math i.e. you
can directly use factorial() without referring to math.

Data Science in Python


Owing to the popularity and ease of programming, various python libraries exist to perform
statistical and analytical operations.
Most common libraries used for these are:
NumPy stands for Numerical Python. The most powerful feature of NumPy is ndimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with other
low level languages like Fortran, C and C++
SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful
library for variety of high level science and engineering modules like discrete Fourier
transform, Linear Algebra, Optimization and Sparse matrices.

Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
You can use Pylab feature in ipython notebook (ipython notebook pylab = inline) to use these
plotting features inline. If you ignore the inline option, then pylab converts ipython environment
to an environment, very similar to Matlab. You can also use Latex commands to add math to
your plot.
Pandas for structured data operations and manipulations. It is extensively used for data
munging and preparation. Pandas were added relatively recently to Python and have been
instrumental in boosting Pythons usage in data scientist community.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a
lot of efficient tools for machine learning and statistical modeling including classification,
regression, clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to
explore data, estimate statistical models, and perform statistical tests. An extensive list of
descriptive statistics, statistical tests, plotting functions, and result statistics are available for
different types of data and each estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and
informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to make
visualization a central part of exploring and understanding data.

SAS vs R vs Python
SAS

Python

Market leader in the


industry
Has huge array of
functions
Easy to learn interface
Good technical support
Not always enriched
with latest statistical
functions
Expensive

Open source
counterpart of SAS
Mostly used in
academics, research
Latest techniques get
released quickly due to
open source nature
Well documented
Cost effective

Usage has been


growing over time
Open source platform
Libraries and functions
exist to carry out almost
any statistical operation
Since introduction of
Pandas it has become
very strong in operations
on structured data
Cost effective

Big Data
Big data means really a big data, it is a collection of large datasets that cannot be
processed using traditional computing techniques. Big data is not merely a data, rather it
has become a complete subject, which involves various tools, techniques and
frameworks.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text, Media Logs.
While looking into the technologies that handle big data, we examine the following two
classes of technology:
Operational

Analytical

Data Scope

Operational

Retrospective

End User

Customer

Data Scientist

Technology

NoSQL

MapReduce, MPP Database

Big Data Technology


Traditional Approach:

Googles Solution MapReduce Algorithm


MapReduce provides a new method of analyzing data that is
complementary to the capabilities provided by SQL, and a
system based on MapReduce that can be scaled up from
single servers to thousands of high and low end machines

Hadoop
Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
Hadoop runs applications using the
MapReduce algorithm, where the data
is processed in parallel on different CPU
nodes.
A distributed file system, HDFS (Hadoop
Diistributed File System) provides highthroughput access to application data.

Hadoop and Python


Pydoop is a Python interface to Hadoop that allows you to write MapReduce
applications in pure Python. It offers several features not commonly found in
other Python libraries for Hadoop:
A rich HDFS API
A MapReduce API that allows to write pure Python record readers / writers,
partitioners and combiners
Transparent Avro (de)serialization
Easy installation-free usage
Being actively improved

Python: Where to Start?


Books:
Python for Data Analysis OREILLY (Wes McKinney)
Data Analysis with Open Source Tools OREILLY (Phillip K. Janert)

Online Learning Sites:


www.udacity.com [recommended] Category: Data Science, Technology:
Python
www.datacamp.com
www.coursera.org

Thank you

S-ar putea să vă placă și