Sunteți pe pagina 1din 28

Tutorial Using Excel with Python and

Pandas

Why learn to work with Excel with Python? Excel is one of the most popular and widely-
used data tools; it’s hard to find an organization that doesn’t work with it in some way. From
analysts, to sales VPs, to CEOs, various professionals use Excel for both quick stats and
serious data crunching.

With Excel being so pervasive, data professionals must be familiar with it. Working with data
in Python or R offers serious advantages over Excel’s UI, so finding a way to work with
Excel using code is critical. Thankfully, there’s a great tool already out there for using Excel
with Python called pandas.

Pandas has excellent methods for reading all kinds of data from Excel files. You can also
export your results from pandas back to Excel, if that’s preferred by your intended audience.
Pandas is great for other routine data analysis tasks, such as:

 quick Exploratory Data Analysis (EDA)


 drawing attractive plots
 feeding data into machine learning tools like scikit-learn
 building machine learning models on your data
 taking cleaned and processed data to any number of data tools

Pandas is better at automating data processing tasks than Excel, including processing Excel
files.
In this tutorial, we are going to show you how to work with Excel files in pandas. We will
cover the following concepts.

 setting up your computer with the necessary software


 reading in data from Excel files into pandas
 data exploration in pandas
 visualizing data in pandas using the matplotlib visualization library
 manipulating and reshaping data in pandas
 moving data from pandas into Excel

Note that this tutorial does not provide a deep dive into pandas. To explore pandas more,
check out our course.

System Prerequisites
We will use Python 3 and Jupyter Notebook to demonstrate the code in this tutorial.
In addition to Python and Jupyter Notebook, you will need the following Python modules:

 matplotlib – data visualization


 NumPy – numerical data functionality
 OpenPyXL – read/write Excel 2010 xlsx/xlsm files
 pandas – data import, clean-up, exploration, and analysis
 xlrd – read Excel data
 xlwt – write to Excel
 XlsxWriter – write to Excel (xlsx) files

There are multiple ways to get set up with all the modules. We cover three of the most
common scenarios below.

 If you have Python installed via Anaconda package manager, you can install the
required modules using the command conda install. For example, to install pandas,
you would execute the command – conda install pandas.
 If you already have a regular, non-Anaconda Python installed on the computer, you
can install the required modules using pip. Open your command line program and
execute command pip install <module name> to install a module. You should
replace <module name> with the actual name of the module you are trying to install.
For example, to install pandas, you would execute command – pip install pandas.
 If you don’t have Python already installed, you should get it through the Anaconda
package manager. Anaconda provides installers for Windows, Mac, and Linux
Computers. If you choose the full installer, you will get all the modules you need,
along with Python and pandas within a single package. This is the easiest and fastest
way to get started.

The Data Set


In this tutorial, we will use a multi-sheet Excel file we created from Kaggle’s IMDB Scores
data. You can download the file here.

Our Excel file has three sheets: ‘1900s,’ ‘2000s,’ and ‘2010s.’ Each sheet has data for movies
from those years.

We will use this data set to find the ratings distribution for the movies, visualize movies with
highest ratings and net earnings and calculate statistical information about the movies. We
will be analyzing and exploring this data using Python and pandas, thus demonstrating
pandas capabilities for working with Excel data in Python.

Read data from the Excel file


We need to first import the data from the Excel file into pandas. To do that, we start by
importing the pandas module.

import pandas as pd

We then use the pandas’ read_excel method to read in data from the Excel file. The easiest
way to call this method is to pass the file name. If no sheet name is specified then it will read
the first sheet in the index (as shown below).

excel_file = 'movies.xls'
movies = pd.read_excel(excel_file)

Here, the read_excel method read the data from the Excel file into a pandas DataFrame
object. Pandas defaults to storing data in DataFrames. We then stored this DataFrame into a
variable called movies.
Pandas has a built-in DataFrame.head() method that we can use to easily display the first
few rows of our DataFrame. If no argument is passed, it will display first five rows. If a
number is passed, it will display the equal number of rows from the top.

movies.head()
Fa
Fa Fa Fa Fa
ce
ce ce ce ce
C A bo Re Re I
bo bo bo bo Fac U
on sp Gr ok vie vie M
Y La ok ok ok ok enu se
Co te Du ec Bu oss Li ws ws D
Titl e ng Li Li Li lik mbe r
Genres un nt rat t dg Ea … kes by by B
e a ua kes kes kes es r in V
try R ion R et rni – Us Cr S
r ge – – – – post ot
ati at ngs cas er tii co
Ac Ac Ac M ers es
ng io t s cs re
tor tor tor ovi
To
1 2 3 e
tal
Into
lera
nce:
Lov
e’s
Stru N
1 38
ggl Drama| ot 10
9 Na US 12 1. 59 Na 43 48 69 69 8.
0e History| Ra … 22 9.0 1 71 88
1 N A 3 33 07. N 6 1 1 .0 0
Thr War te 8
6 0
oug d
hou
t
the
Age
s
Ove
r
the
1 10 30
Hill
9 Crime| Na US Na 11 1. 00 00 1. 4.
1 to …2 2 0.0 4 0 1 5 1
2 Drama N A N 0 33 00. 00 0 8
the
0 0 0.0
Poo
rho
use
N
The 1 24
Drama| ot
Big 9 Na US 15 1. 50 Na 10 22 48 48 8.
2 Roman Ra …81 12 6.0 0 45
Par 2 N A 1 33 00. N 8 6 49 .0 3
ce|War te
ade 5 0
d
Met Ge N 26 12 11 26
1 Drama| Ge 14 1. 60 13 18. 20 41 8.
3 rop rm ot 43 … 23 00 1 18 0.
9 Sci-Fi rm 5 33 00 6 0 3 3 3
olis an Ra 5.0 0 41 0
Fa
Fa Fa Fa Fa
ce
ce ce ce ce
C A bo Re Re I
bo bo bo bo Fac U
on sp Gr ok vie vie M
Y La ok ok ok ok enu se
Co te Du ec Bu oss Li ws ws D
Titl e ng Li Li Li lik mbe r
Genres un nt rat t dg Ea … kes by by B
e a ua kes kes kes es r in V
try R ion R et rni – Us Cr S
r ge – – – – post ot
ati at ngs cas er tii co
Ac Ac Ac M ers es
ng io t s cs re
tor tor tor ovi
To
1 2 3 e
tal
2 an te 00
7 y d 0.0
N
Pan 1 Crime| Ge
Ge ot 99
dor 9 Drama| rm 11 1. Na 42 45 92 74 71 8.
4 rm Ra 50. … 20 3.0 1 84
a’s 2 Roman an 0 33 N 6 5 6 31 .0 0
an te 0
Box 9 ce y
d

5 rows × 25 columns

Excel files quite often have multiple sheets and the ability to read a specific sheet or all of
them is very important. To make this easy, the pandas read_excel method takes an argument
called sheetname that tells pandas which sheet to read in the data from. For this, you can
either use the sheet name or the sheet number. Sheet numbers start with zero. If the
sheetname argument is not given, it defaults to zero and pandas will import the first sheet.

By default, pandas will automatically assign a numeric index or row label starting with zero.
You may want to leave the default index as such if your data doesn’t have a column with
unique values that can serve as a better index. In case there is a column that you feel would
serve as a better index, you can override the default behavior by setting index_col property
to a column. It takes a numeric value for setting a single column as index or a list of numeric
values for creating a multi-index.

In the below code, we are choosing the first column, ‘Title’, as index (index=0) by passing
zero to the index_col argument.

movies_sheet1 = pd.read_excel(excel_file, sheetname=0, index_col=0)


movies_sheet1.head()
Fa
Fa Fa Fa Fa
ce
ce ce ce ce Re Re I
C A bo Fac
Gr bo bo bo bo U vi vi M
on sp ok enu
Y La Co Du oss Di ok ok ok ok se ew ew D
te ec Bu Li mb
e ng un rat Ea re Li Li Li lik r s s B
Genres nt t dg … ke er
a ua tr io rni ct ke ke ke es V by by S
R R et s– in
r ge y n ng or s– s– s– – ot Us Cr c
ati at cas post
s Ac Ac Ac M es er tii o
ng io t ers
tor tor tor ovi s cs re
To
1 2 3 e
tal
Titl
e
Int
oler
anc
e:
Lov
e’s
N D.
Str 1 38
Drama| U ot 1. W. 10
ugg 9 Na 12 59 Na 43 48 69 69 8.
History S Ra 3 Gr … 22 9.0 1 71 88
le 1 N 3 07. N 6 1 1 .0 0
|War A te 3 iff 8
Thr 6 0
d ith
oug
hou
t
the
Age
s
Ove
r Ha
the rry
1 10 30
Hill U N 1. F.
9 Crime| Na 11 00 00 1. 4.
to S a 3 Mi …2 2 0.0 4 0 1 5 1
2 Drama N 0 00. 00 0 8
the A N 3 lla
0 0 0.0
Poo rd
rho e
use
N Ki
The 1 24
Drama| U ot 1. ng
Big 9 Na 15 50 Na 10 22 48 48 8.
Roman S Ra 3 Vi …81 12 6.0 0 45
Par 2 N 1 00. N 8 6 49 .0 3
ce|War A te 3 do
ade 5 0
d r
N
1 Ge 60 Fri
Met Ge ot 1. 26 12 11 26
9 Drama| rm 14 00 tz 13 18. 20 41 8.
rop rm Ra 3 43 … 23 00 1 18 0.
2 Sci-Fi an 5 00 La 6 0 3 3 3
olis an te 3 5.0 0 41 0
7 y 0.0 ng
d
Fa
Fa Fa Fa Fa
ce
ce ce ce ce Re Re I
C A bo Fac
Gr bo bo bo bo U vi vi M
on sp ok enu
Y La Co Du oss Di ok ok ok ok se ew ew D
te ec Bu Li mb
e ng un rat Ea re Li Li Li lik r s s B
Genres nt t dg … ke er
a ua tr io rni ct ke ke ke es V by by S
R R et s– in
r ge y n ng or s– s– s– – ot Us Cr c
ati at cas post
s Ac Ac Ac M es er tii o
ng io t ers
tor tor tor ovi s cs re
To
1 2 3 e
tal
Titl
e
Ge
or
N g
Pan 1 Crime| Ge
Ge ot 1. 99 W
dor 9 Drama| rm 11 Na 42 45 92 74 71 8.
rm Ra 3 50. ilh … 20 3.0 1 84
a’s 2 Roman an 0 N 6 5 6 31 .0 0
an te 3 0 el
Box 9 ce y
d m
Pa
bst

5 rows × 24 columns

As you noticed above, our Excel data file has three sheets. We already read the first sheet in a
DataFrame above. Now, using the same syntax, we will read in rest of the two sheets too.

movies_sheet2 = pd.read_excel(excel_file, sheetname=1, index_col=0)


movies_sheet2.head()
Fa
Fa Fa Fa Fa
ce
ce ce ce ce U R R I
C A bo Fac
bo bo bo bo s ev ev M
on sp Gr ok enu
Y La C Du Di ok ok ok ok e ie ie D
te ec Bu oss Li mb
e ng ou rat re Li Li Li lik r ws ws B
Genres nt t dge Ea … ke er
a ua nt io ct ke ke ke es V by by S
R R t rni s– in
r ge ry n or s– s– s– – o Us C c
ati at ngs cas post
Ac Ac Ac M te er rti o
ng io t ers
tor tor tor ovi s s ics re
To
1 2 3 e
tal
Titl
e
102 Ke 2
2 Adventu 850 669
Dal En U 1. vi 20 6
0 re|Come 10 000 415 79 43 41 37 77 84 4.
ma gli S G 8 n …00. 1 4
0 dy|Famil 0.0 00. 59. 5.0 9.0 82 2 .0 .0 8
tia sh A 5 Li 0 1
0 y 0 0
ns ma 3
Fa
Fa Fa Fa Fa
ce
ce ce ce ce U R R I
C A bo Fac
bo bo bo bo s ev ev M
on sp Gr ok enu
Y La C Du Di ok ok ok ok e ie ie D
te ec Bu oss Li mb
e ng ou rat re Li Li Li lik r ws ws B
Genres nt t dge Ea … ke er
a ua nt io ct ke ke ke es V by by S
R R t rni s– in
r ge ry n or s– s– s– – o Us C c
ati at ngs cas post
Ac Ac Ac M te er rti o
ng io t ers
tor tor tor ovi s s ics re
To
1 2 3 e
tal
Titl
e
Be
3
2 430 370 tty
28 En U P 1. 12 10 23 4 19 11
0 Comedy 10 000 355 Th 66 6.
Da gli S G- 3 …00 00 86 0 1 5 4. 6.
0 |Drama 3.0 00. 15. o 4.0 0
ys sh A 13 7 0.0 0.0 4 9 0 0
0 0 0 ma
7
s
2 1
3 En U 1. 600 982 DJ
0 82. 93 70 58 33 11 4 10 22 4.
Stri Comedy gli S R 8 000 133 Po … 1
0 0 9.0 6.0 5.0 54 8 1 .0 .0 0
kes sh A 5 0.0 5.0 oh
0 5
Ha
ns
2 Pe 2
Ab En N 1. 650 641
0 U 10 tte 84 84 26 6 35 28 7.
erd Drama gli a 8 000 48. … 2.0 0.0 0
0 K 6.0 r 4.0 6 0 0 .0 .0 3
een sh N 5 0.0 0
0 M 1
ola
nd
Bil
All
ly
the 1
2 Drama| 570 155 Bo
Pre En U P 2. 13 15 1 18
0 Romanc 22 000 271 b 86 82 65 85 5.
tty gli S G- 3 …00 00 2 3 3.
0 e|Wester 0.0 00. 25. Th 1.0 0.0 2 .0 8
Ho sh A 13 5 0.0 6 8 0
0 n 0 0 or
rse 8
nt
s
on

5 rows × 24 columns

movies_sheet3 = pd.read_excel(excel_file, sheetname=2, index_col=0)


movies_sheet3.head()
Fa
Fa Fa Fa Fa R
ce R I
C A ce ce ce ce ev
bo Fac ev M
on s bo bo bo bo U ie
Gr ok enu ie D
Y La C te D p ok ok ok ok se w
Bu oss Dir Li mb w B
e ng ou nt ur ec Li Li Li lik r s
Genres dg Ea ect … ke er s S
a ua nt R ati t ke ke ke es V by
et rni or s– in by c
r ge ry at on R s– s– s– – ot C
ngs ca pos U o
in at Ac Ac Ac M es rti
st ters se r
g io to to to ov ic
To rs e
r1 r2 r3 ie s
tal
Tit
le
2
2 18 18
Adventure| Da 7
127 0 En U 1. 00 32 11 11 63 44 45
Biography| 94 nny 64 22 9 7.
Ho 1 gli S R 8 00 94 …00 98 00 0.0 0. 0.
Drama|Thr .0 Bo 2.0 3.0 1 6
urs 0. sh A 5 00. 66. 0.0 4 0 0 0
iller yle 7
0 0 0
9
Eri
3 2
30 c
Ba 0 En U N 5
88 00 Na Me 79 65 30 18 23 20 5.
cky 1 Drama gli S R a … 92 0.0 5
.0 00. N nde 5.0 9.0 1.0 84 .0 .0 2
ard 0. sh A N 4
0 lso
s 0
hn
2 G To
U 4
0 Comedy|D Ge er 11 2. 59 m
nr Na 24. 20. 20 2 18 76 6.
3 1 rama|Rom rm m 9. 3 77 Ty … 9.0 69 0.0
at N 0 0 00 1 .0 .0 8
0. ance an an 0 5 4.0 kw
ed 2
0 y er
8:
Th
e
2 Re
Mo 25 1
0 En U 1. 99 ed
rm Document 80 00 19 12. 21 1 30 28 7.
1 gli S R 7 85 Co … 5.0 0 0.0
on ary .0 00 1.0 0 0 3 .0 .0 1
0. sh A 8 1.0 wa
Pr 0.0 8
0 n
op
osit
ion
A
Tu
2 Be
rtle 5
0 Adventure| En Fr 2. n
’s P 88 Na Na 78 74 60 38 3 22 56 6.
1 Animation| gli an 3 Sta … 0 2.0
Tal G .0 N N 3.0 9.0 2.0 74 8 .0 .0 1
0. Family sh ce 5 sse
e: 5
0 n
Sa
m
Fa
Fa Fa Fa Fa R
ce R I
C A ce ce ce ce ev
bo Fac ev M
on s bo bo bo bo U ie
Gr ok enu ie D
Y La C te D p ok ok ok ok se w
Bu oss Dir Li mb w B
e ng ou nt ur ec Li Li Li lik r s
Genres dg Ea ect … ke er s S
a ua nt R ati t ke ke ke es V by
et rni or s– in by c
r ge ry at on R s– s– s– – ot C
ngs ca pos U o
in at Ac Ac Ac M es rti
st ters se r
g io to to to ov ic
To rs e
r1 r2 r3 ie s
tal
Tit
le
my
’s
Ad
ven
tur
es

5 rows × 24 columns

Since all the three sheets have similar data but for different recordsmovies, we will create a
single DataFrame from all the three DataFrames we created above. We will use the pandas
concat method for this and pass in the names of the three DataFrames we just created and
assign the results to a new DataFrame object, movies. By keeping the DataFrame name same
as before, we are over-writing the previously created DataFrame.

movies = pd.concat([movies_sheet1, movies_sheet2, movies_sheet3])

We can check if this concatenation by checking the number of rows in the combined
DataFrame by calling the method shape on it that will give us the number of rows and
columns.

movies.shape
(5042, 24)

Using the ExcelFile class to read multiple sheets

We can also use the ExcelFile class to work with multiple sheets from the same Excel file.
We first wrap the Excel file using ExcelFile and then pass it to read_excel method.

xlsx = pd.ExcelFile(excel_file)
movies_sheets = []
for sheet in xlsx.sheet_names:
movies_sheets.append(xlsx.parse(sheet))
movies = pd.concat(movies_sheets)

If you are reading an Excel file with a lot of sheets and are creating a lot of DataFrames,
ExcelFile is more convenient and efficient in comparison to read_excel. With ExcelFile,
you only need to pass the Excel file once, and then you can use it to get the DataFrames.
When using read_excel, you pass the Excel file every time and hence the file is loaded
again for every sheet. This can be a huge performance drag if the Excel file has many sheets
with a large number of rows.

Exploring the data


Now that we have read in the movies data set from our Excel file, we can start exploring it
using pandas. A pandas DataFrame stores the data in a tabular format, just like the way Excel
displays the data in a sheet. Pandas has a lot of built-in methods to explore the DataFrame we
created from the Excel file we just read in.

We already introduced the method head in the previous section that displays few rows from
the top from the DataFrame. Let’s look at few more methods that come in handy while
exploring the data set.

We can use the shape method to find out the number of rows and columns for the
DataFrame.

movies.shape
(5042, 25)

This tells us our Excel file has 5042 records and 25 columns or observations. This can be
useful in reporting the number of records and columns and comparing that with the source
data set.

We can use the tail method to view the bottom rows. If no parameter is passed, only the
bottom five rows are returned.

movies.tail()
Fa
Fa Fa Fa Fa
ce I
C A ce ce ce ce U R R
Gr bo Fac M
on s bo bo bo bo s ev ev
B os ok enu D
Y La C te D p ok ok ok ok e ie ie
u s Li mb B
Tit e ng ou nt ur ec Li Li Li lik r ws ws
Genres d Ea … ke er S
le a ua nt R ati t ke ke ke es V by by
ge rn s– in c
r ge ry at on R s– s– s– – o Us C
t in ca pos o
in at Ac Ac Ac M te er rti
gs st ters r
g io tor tor tor ovi s s ics
To e
1 2 3 e
tal
W
1 1 9
ar N Drama|Histor En T N 10 11
5 U Na 6. Na 88 50 45 2 44 10 8.
& a y|Romance|W gli V- a …00. 00 1.0
9 K N 0 N 8.0 2.0 28 7 .0 .0 2
Pe N ar sh 14 N 0 0
9 0 7
ace
Wi N En U N 1. N
1 Comedy|Dra 30. Na 68 51 42 18 10 7 56 19 7.
ng a gli S a 3 a … 5.0
6 ma 0 N 5.0 1.0 4.0 84 00 6 .0 .0 3
s N sh A N 3 N
Fa
Fa Fa Fa Fa
ce I
C A ce ce ce ce U R R
Gr bo Fac M
on s bo bo bo bo s ev ev
B os ok enu D
Y La C te D p ok ok ok ok e ie ie
u s Li mb B
Tit e ng ou nt ur ec Li Li Li lik r ws ws
Genres d Ea … ke er S
le a ua nt R ati t ke ke ke es V by by
ge rn s– in c
r ge ry at on R s– s– s– – o Us C
t in ca pos o
in at Ac Ac Ac M te er rti
gs st ters r
g io tor tor tor ovi s s ics
To e
1 2 3 e
tal
0 4
0 6
W
1 A
olf N En N 2. N 7
6 Drama|Horror ust Na Na 51 45 20 16 95 6. 2. 7.
Cr a gli a 0 a … 0.0 2
0 |Thriller ral N N 1.0 7.0 6.0 17 4 0 0 1
ee N sh N 0 N 6
1 ia
k
W
uth
1 6
eri N En N N N 27 29
6 Drama|Roma U 14 Na 69 42 0 33 9. 7.
ng a gli a a a …00 19 0 2.0
0 nce K 2.0 N 8.0 7.0 5 .0 0 7
He N sh N N N 0.0 6
2 3
igh
ts
Yu
-
Gi-
Oh 1
1 Action|Adven
! N Jap Ja N N N 2
6 ture|Animatio 24. Na Na Na 12 51 6. 7.
Du a an pa a a a …0.0 0 0.0 4
0 n|Family|Fant 0 N N N 4 .0 0 0
el N ese n N N N 1
3 asy
M 7
on
ste
rs

5 rows × 25 columns

In Excel, you’re able to sort a sheet based on the values in one or more columns. In pandas,
you can do the same thing with the sort_values method. For example, let’s sort our movies
DataFrame based on the Gross Earnings column.

sorted_by_gross = movies.sort_values(['Gross Earnings'], ascending=False)

Since we have the data sorted by values in a column, we can do few interesting things with it.
For example, we can display the top 10 movies by Gross Earnings.
sorted_by_gross["Gross Earnings"].head(10)
1867 760505847.0
1027 658672302.0
1263 652177271.0
610 623279547.0
611 623279547.0
1774 533316061.0
1281 474544677.0
226 460935665.0
1183 458991599.0
618 448130642.0
Name: Gross Earnings, dtype: float64

We can also create a plot for the top 10 movies by Gross Earnings. Pandas makes it easy to
visualize your data with plots and charts through matplotlib, a popular data visualization
library. With a couple lines of code, you can start plotting. Moreover, matplotlib plots work
well inside Jupyter Notebooks since you can displace the plots right under the code.

First, we import the matplotlib module and set matplotlib to display the plots right in the
Jupyter Notebook.

import matplotlib.pyplot as plt

We will draw a bar plot where each bar will represent one of the top 10 movies. We can do
this by calling the plot method and setting the argument kind to barh. This tells
matplotlib to draw a horizontal bar plot.

sorted_by_gross['Gross Earnings'].head(10).plot(kind="barh")
plt.show()

Let’s create a histogram of IMDB Scores to check the distribution of IMDB Scores across all
movies. Histograms are a good way to visualize the distribution of a data set. We use the
plot method on the IMDB Scores series from our movies DataFrame and pass it the
argument.

movies['IMDB Score'].plot(kind="hist")
plt.show()
This data visualization suggests that most of the IMDB Scores fall between six and eight.

Getting statistical information about the data


Pandas has some very handy methods to look at the statistical data about our data set. For
example, we can use the describe method to get a statistical summary of the data set.

movies.describe()
Face
Fac Face Face Fac Face Fac
boo
Gro ebo boo boo ebo boo enu Rev Rev
Asp k Use IM
Dur ss ok k k ok k mb iew iew
Yea ect Bud Like r DB
atio Ear Lik Like Like Lik likes er s by s by
r Rat get s– Vot Sco
n ning es – s– s– es – – in Use Crt
io cast es re
s Dire Acto Acto Act Mov post rs iics
Tota
ctor r1 r2 or 3 ie ers
l
c
493 502 471 4.55 4.15 493 502 502 5.04 502 499 504
o 5035 5029 5042 5042
5.00 8.00 4.00 100 900 8.00 0.00 9.00 200 2.00 3.00 2.00
u .000 .000 .000 .000
000 000 000 0e+ 0e+ 000 000 000 0e+ 000 000 000
n 000 000 000 000
0 0 0 03 03 0 0 0 03 0 0 0
t
m 200 3.97 4.84 8.36
107. 2.22 686. 6561 1652 645. 9700 7527 1.37 272. 140. 6.44
e 2.47 526 684 847
201 040 621 .323 .080 009 .959 .457 144 770 194 200
a 051 2e+ 1e+ 5e+
074 3 709 932 533 761 143 160 6 808 272 7
n 7 07 07 04
2.06 6.84 281 166 1.38
12.4 25.1 1.38 1502 4042 1816 1932 2.01 377. 121. 1.12
st 114 529 3.60 5.04 494
745 974 511 1.97 .774 5.10 2.07 368 982 601 518
d 9e+ 9e+ 240 172 0e+
99 41 3 7635 685 1925 0537 3 886 675 9
08 07 5 8 05
191 2.18 1.62 5.00
m 7.00 1.18 0.00 0.00 0.00 1.00 1.00 1.60
6.00 000 000 0.00 0.00 0.00 0.00 000
i 000 000 000 000 000 000 000 000
000 0e+ 0e+ 0000 0000 0000 0000 0e+
n 0 0 0 0 0 0 0 0
0 02 02 00
Face
Fac Face Face Fac Face Fac
boo
Gro ebo boo boo ebo boo enu Rev Rev
Asp k Use IM
Dur ss ok k k ok k mb iew iew
Yea ect Bud Like r DB
atio Ear Lik Like Like Lik likes er s by s by
r Rat get s– Vot Sco
n ning es – s– s– es – – in Use Crt
io cast es re
s Dire Acto Acto Act Mov post rs iics
Tota
ctor r1 r2 or 3 ie ers
l
199 6.00 5.34 8.59
2 93.0 1.85 7.00 614. 281. 133. 1411 0.00 65.0 50.0 5.80
9.00 000 098 0.00 925
5 000 000 000 5000 0000 000 .250 000 000 000 000
000 0e+ 8e+ 0000 0e+
% 00 0 0 00 00 000 000 0 00 00 0
0 06 06 03
200 2.00 2.55 3.43
5 103. 2.35 49.0 988. 595. 371. 3091 166. 1.00 156. 110. 6.60
5.00 000 175 710
0 000 000 000 0000 0000 500 .000 0000 000 000 000 000
000 0e+ 0e+ 0e+
% 000 0 00 00 00 000 000 00 0 000 000 0
0 07 07 04
201 4.50 6.23 9.63
7 118. 2.35 194. 1100 918. 636. 1375 3000 2.00 326. 195. 7.20
1.00 000 094 470
5 000 000 750 0.00 0000 000 8.75 .000 000 000 000 000
000 0e+ 4e+ 0e+
% 000 0 000 0000 00 000 0000 000 0 000 000 0
0 07 07 04
201 1.22 7.60 230 6400 1370 230 6567 3490 1.68 506
m 511. 16.0 43.0 813. 9.50
6.00 155 505 00.0 00.0 00.0 00.0 30.0 00.0 976 0.00
a 000 000 000 000 000
000 0e+ 8e+ 000 0000 0000 000 0000 0000 4e+ 000
x 000 00 00 000 0
0 10 08 00 0 0 00 0 0 06 0

The describe method displays below information for each of the columns.

 the count or number of values


 mean
 standard deviation
 minimum, maximum
 25%, 50%, and 75% quantile

Please note that this information will be calculated only for the numeric values.

We can also use the corresponding method to access this information one at a time. For
example, to get the mean of a particular column, you can use the mean method on that
column.

movies["Gross Earnings"].mean()
48468407.526809327

Just like mean, there are methods available for each of the statistical information we want to
access. You can read about these methods in our free pandas cheat sheet.

Reading files with no header and skipping records


Earlier in this tutorial, we saw some ways to read a particular kind of Excel file that had
headers and no rows that needed skipping. Sometimes, the Excel sheet doesn’t have any
header row. For such instances, you can tell pandas not to consider the first row as header or
columns names. And If the Excel sheet’s first few rows contain data that should not be read
in, you can ask the read_excel method to skip a certain number of rows, starting from the
top.

For example, look at the top few rows of this Excel file.

This file obviously has no header and first four rows are not actual records and hence should
not be read in. We can tell read_excel there is no header by setting argument header to None
and we can skip first four rows by setting argument skiprows to four.

movies_skip_rows = pd.read_excel("movies-no-header-skip-rows.xls",
header=None, skiprows=4)
movies_skip_rows.head(5)
1 1 1 2 2 2
0 1 2 3 4 5 6 7 8 9 … 18 19 21 23
5 6 7 0 2 4
Metr Ger Ger Not 1 1. 6000 1 1 12 4 26 8
19 Drama|Sci- 2643 2 20 111
0 opoli ma man Rat 4 3 000. …3 8. 00 1 1 0. .
27 Fi 5.0 3 3 841
s n y ed 5 3 0 6 0 0 3 0 3
Pand Crime|Dra Ger Ger Not 1 1. 4 8
19 9950 2 3. 45 92 743 8 71
1 ora’s ma|Romanc ma man Rat 1 3 NaN …2 1 .
29 .0 0 0 5 6 1 4 .0
Box e n y ed 0 3 6 0
The
Broa 1 1. 2808 6
19 Musical|Ro Eng US Pas 3790 7 2 4. 10 16 454 7 36
2 dway 0 3 000. … 8 .
29 mance lish A sed 00.0 7 8 0 9 7 6 1 .0
Melo 0 7 0 3
dy
Hell’
1. 3950 4 7
s 19 Eng US Pas 9 1 4. 45 27 375 5 35
3 Drama|War 2 000. NaN …3 1 .
Ange 30 lish A sed 6 2 0 7 9 3 3 .0
0 0 1 8
ls
A
Fare Unr 1. 9 1 9 6
19 Drama|Ro Eng US 7 8000 12 21 351 4 42
4 well ate 3 NaN …9 6 9. 1 .
32 mance|War lish A 9 00.0 84 3 9 6 .0
to d 7 8 4 0 6
Arms

5 rows × 25 columns
We skipped four rows from the sheet and used none of the rows as the header. Also, notice
that one can combine different options in a single read statement. To skip rows at the bottom
of the sheet, you can use option skip_footer, which works just like skiprows, the only
difference being the rows are counted from the bottom upwards.

The column names in the previous DataFrame are numeric and were allotted as default by the
pandas. We can rename the column names to descriptive ones by calling the method columns
on the DataFrame and passing the column names as a list.

movies_skip_rows.columns = ['Title', 'Year', 'Genres', 'Language',


'Country', 'Content Rating', 'Duration', 'Aspect Ratio', 'Budget', 'Gross
Earnings', 'Director', 'Actor 1', 'Actor 2', 'Actor 3', 'Facebook Likes -
Director', 'Facebook Likes - Actor 1', 'Facebook Likes - Actor 2',
'Facebook Likes - Actor 3', 'Facebook Likes - cast Total', 'Facebook likes
- Movie', 'Facenumber in posters', 'User Votes', 'Reviews by Users',
'Reviews by Crtiics', 'IMDB Score']
movies_skip_rows.head()
Fa
Fa Fa Fa
ceb Fa
ceb ceb ceb
C A oo ceb Re Re I
oo oo oo Fac U
on sp Gr k oo vie vie M
Y La k k k enu se
Co te Du ec Bu oss Li k ws ws D
Tit e ng Li Li Li mbe r
Genres un nt rat t dge Ea … kes lik by by B
le a ua kes kes kes r in V
try R ion R t rni – es Us Cr S
r ge – – – post ot
ati at ngs cas – er tii co
Ac Ac Ac ers es
ng io t Mo s cs re
tor tor tor
To vie
1 2 3
tal
N
Me 1 Ge
Ge ot 600 264 12 11 26
tro 9 Drama| rm 14 1. 13 18. 20 41 8.
0 rm Ra 000 35. … 23 00 1 18 0.
pol 2 Sci-Fi an 5 33 6 0 3 3 3
an te 0.0 0 0 41 0
is 7 y
d
Pan N
1 Crime| Ge
dor Ge ot
9 Drama| rm 11 1. Na 995 42 45 92 74 71 8.
1 a’s rm Ra … 20 3.0 1 84
2 Roman an 0 33 N 0.0 6 5 6 31 .0 0
Bo an te
9 ce y
x d
Th
e
Bro
1
ad Musical En Pa 379 280
9 US 10 1. 10 16 45 36 6.
2 wa |Roman glis ss 000 800 …77 28 4.0 8 71
2 A 0 37 9 7 46 .0 3
y ce h ed .0 0.0
9
Me
lod
y
Hel En Pa 395
1 Drama| US 1. Na 43 45 27 37 35 7.
3 l’s glis ss 96 000 … 12 4.0 1 53
9 War A 20 N 1 7 9 53 .0 8
An h ed 0.0
Fa
Fa Fa Fa
ceb Fa
ceb ceb ceb
C A oo ceb Re Re I
oo oo oo Fac U
on sp Gr k oo vie vie M
Y La k k k enu se
Co te Du ec Bu oss Li k ws ws D
Tit e ng Li Li Li mbe r
Genres un nt rat t dge Ea … kes lik by by B
le a ua kes kes kes r in V
try R ion R t rni – es Us Cr S
r ge – – – post ot
ati at ngs cas – er tii co
Ac Ac Ac ers es
ng io t Mo s cs re
tor tor tor
To vie
1 2 3
tal
gel 3
s 0
A
Far
1 U
ew Drama| En 800
9 US nr 1. Na 99 16 99. 12 21 35 42 6.
4 ell Roman glis 79 000 … 1 46
3 A ate 37 N 8 4 0 84 3 19 .0 6
to ce|War h .0
2 d
Ar
ms

5 rows × 25 columns

Now that we have seen how to read a subset of rows from the Excel file, we can learn how to
read a subset of columns.

Reading a subset of columns


Although read_excel defaults to reading and importing all columns, you can choose to import
only certain columns. By passing parse_cols=6, we are telling the read_excel method to
read only the first columns till index six or first seven columns (the first column being
indexed zero).

movies_subset_columns = pd.read_excel(excel_file, parse_cols=6)


movies_subset_columns.head()
Content
Title Year Genres Language Country Duration
Rating
Intolerance:
Love’s Struggle Not
0 1916 Drama|History|War NaN USA 123
Throughout the Rated
Ages
Over the Hill to
1 1920 Crime|Drama NaN USA NaN 110
the Poorhouse
Not
2 The Big Parade 1925 Drama|Romance|War NaN USA 151
Rated
Not
3 Metropolis 1927 Drama|Sci-Fi German Germany 145
Rated
Content
Title Year Genres Language Country Duration
Rating
Not
4 Pandora’s Box 1929 Crime|Drama|Romance German Germany 110
Rated

Alternatively, you can pass in a list of numbers, which will let you import columns at
particular indexes.

Applying formulas on the columns


One of the much-used features of Excel is to apply formulas to create new columns from
existing column values. In our Excel file, we have Gross Earnings and Budget columns. We
can get Net earnings by subtracting Budget from Gross earnings. We could then apply this
formula in the Excel file to all the rows. We can do this in pandas also as shown below.

movies["Net Earnings"] = movies["Gross Earnings"] - movies["Budget"]

Above, we used pandas to create a new column called Net Earnings, and populated it with the
difference of Gross Earnings and Budget. It’s worth noting the difference here in how
formulas are treated in Excel versus pandas. In Excel, a formula lives in the cell and updates
when the data changes – with Python, the calculations happen and the values are stored – if
Gross Earnings for one movie was manually changed, Net Earnings won’t be updated.

Let’s use the sot_values method to sort the data by the new column we created and
visualize the top 10 movies by Net Earnings.

sorted_movies = movies[['Net Earnings']].sort_values(['Net Earnings'],


ascending=[False])sorted_movies.head(10)['Net Earnings'].plot.barh()
plt.show()

Pivot Table in pandas


Advanced Excel users also often use pivot tables. A pivot table summarizes the data of
another table by grouping the data on an index and applying operations such as sorting,
summing, or averaging. You can use this feature in pandas too.

We need to first identify the column or columns that will serve as the index, and the
column(s) on which the summarizing formula will be applied. Let’s start small, by choosing
Year as the index column and Gross Earnings as the summarization column and creating a
separate DataFrame from this data.

movies_subset = movies[['Year', 'Gross Earnings']]


movies_subset.head()
Year Gross Earnings
0 1916.0 NaN
1 1920.0 3000000.0
2 1925.0 NaN
3 1927.0 26435.0
4 1929.0 9950.0

We now call pivot_table on this subset of data. The method pivot_table takes a
parameter index. As mentioned, we want to use Year as the index.

earnings_by_year = movies_subset.pivot_table(index=['Year'])
earnings_by_year.head()
Gross Earnings
Year
1916.0 NaN
1920.0 3000000.0
1925.0 NaN
1927.0 26435.0
1929.0 1408975.0

This gave us a pivot table with grouping on Year and summarization on the sum of Gross
Earnings. Notice, we didn’t need to specify Gross Earnings column explicitly as pandas
automatically identified it the values on which summarization should be applied.

We can use this pivot table to create some data visualizations. We can call the plot method
on the DataFrame to create a line plot and call the show method to display the plot in the
notebook.

earnings_by_year.plot()
plt.show()
We saw how to pivot with a single column as the index. Things will get more interesting if
we can use multiple columns. Let’s create another DataFrame subset but this time we will
choose the columns, Country, Language and Gross Earnings.

movies_subset = movies[['Country', 'Language', 'Gross Earnings']]


movies_subset.head()
Country Language Gross Earnings
0 USA NaN NaN
1 USA NaN 3000000.0
2 USA NaN NaN
3 Germany German 26435.0
4 Germany German 9950.0

We will use columns Country and Language as the index for the pivot table. We will use
Gross Earnings as summarization table, however, we do not need to specify this explicitly as
we saw earlier.

earnings_by_co_lang = movies_subset.pivot_table(index=['Country',
'Language'])
earnings_by_co_lang.head()
Gross Earnings
Country Language
Afghanistan Dari 1.127331e+06
Argentina Spanish 7.230936e+06
Aruba English 1.007614e+07
Australia Aboriginal 6.165429e+06
Dzongkha 5.052950e+05

Let’s visualize this pivot table with a bar plot. Since there are still few hundred records in this
pivot table, we will plot just a few of them.
earnings_by_co_lang.head(20).plot(kind='bar', figsize=(20,8))
plt.show()

Exporting the results to Excel


If you’re going to be working with colleagues who use Excel, saving Excel files out of
pandas is important. You can export or write a pandas DataFrame to an Excel file using
pandas to_excel method. Pandas uses the xlwt Python module internally for writing to
Excel files. The to_excel method is called on the DataFrame we want to export.We also
need to pass a filename to which this DataFrame will be written.

movies.to_excel('output.xlsx')

By default, the index is also saved to the output file. However, sometimes the index doesn’t
provide any useful information. For example, the movies DataFrame has a numeric auto-
increment index, that was not part of the original Excel data.

movies.head()
Fa
Fa Fa Fa
ce
ce ce ce Re Re
C A bo I
bo bo bo Fac U vi vi
on sp Gr ok M Ne
La ok ok ok enu se ew ew
Y Co te Du ec Bu oss Li D t
Titl ng Li Li lik mbe r s s
ea Genres un nt rat t dg Ea … kes B Ea
e ua kes kes es r in V by by
r try R ion R et rni – S rni
ge – – – post ot Us Cr
ati at ngs cas co ngs
Ac Ac M ers es er tii
ng io t re
tor tor ovi s cs
To
2 3 e
tal
Into 1 Drama| N 10
Na US 12 1. 38 Na 22. 48 69 88 69 8. Na
0 lera 9 History| ot … 9.0 1.0 71
N A 3.0 33 59 N 0 1 1 .0 .0 0 N
nce 1 War Ra 8
Fa
Fa Fa Fa
ce
ce ce ce Re Re
C A bo I
bo bo bo Fac U vi vi
on sp Gr ok M Ne
La ok ok ok enu se ew ew
Y Co te Du ec Bu oss Li D t
Titl ng Li Li lik mbe r s s
ea Genres un nt rat t dg Ea … kes B Ea
e ua kes kes es r in V by by
r try R ion R et rni – S rni
ge – – – post ot Us Cr
ati at ngs cas co ngs
Ac Ac M ers es er tii
ng io t re
tor tor ovi s cs
To
2 3 e
tal
: 6. te 07.
Lov 0 d 0
e’s
Str
ugg
le
Thr
oug
hou
t
the
Ag
es
Ov
er
the 1
10 30 29
Hill 9
Crime| Na US Na 11 1. 00 00 1. 1. 4. 00
1 to 2 …2.0 0.0 4 0 1.0 5
Drama N A N 0.0 33 00. 00 0 0 8 00
the 0.
0 0.0 0.0
Poo 0
rho
use
1 N
The 24
9 Drama| ot
Big Na US 15 1. 50 Na 12. 10 22 48 45 48 8. Na
2 2 Roman Ra … 6.0 0.0
Par N A 1.0 33 00. N 0 8 6 49 .0 .0 3 N
5. ce|War te
ade 0
0 d
1 N -
Ge 60
Met 9 Ge ot 26 12 11 41 26 59
Drama| rm 14 1. 00 23. 18. 20 8.
3 rop 2 rm Ra 43 … 00 1.0 18 3. 0. 73
Sci-Fi an 5.0 33 00 0 0 3 3
olis 7. an te 5.0 0 41 0 0 56
y 0.0
0 d 5.0
Pan 1 N
Crime| Ge
dor 9 Ge ot 99
Drama| rm 11 1. Na 20. 45 92 74 84 71 8. Na
4 a’s 2 rm Ra 50. … 3.0 1.0
Roman an 0.0 33 N 0 5 6 31 .0 .0 0 N
Bo 9. an te 0
ce y
x 0 d
5 rows × 26 columns

You can choose to skip the index by passing along index-False.

movies.to_excel('output.xlsx', index=False)

We need to be able to make our output files look nice before we can send it out to our co-
workers. We can use pandas ExcelWriter class along with the XlsxWriter Python module
to apply the formatting.

We can do use these advanced output options by creating a ExcelWriter object and use this
object to write to the EXcel file.

writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter')


movies.to_excel(writer, index=False, sheet_name='report')
workbook = writer.bookworksheet = writer.sheets['report']

We can apply customizations by calling add_format on the workbook we are writing to.
Here we are setting header format as bold.

header_fmt = workbook.add_format({'bold': True})


worksheet.set_row(0, None, header_fmt)

Finally, we save the output file by calling the method save on the writer object.

writer.save()

As an example, we saved the data with column headers set as bold. And the saved file looks
like the image below.

Like this, one can use XlsxWriter to apply various formatting to the output Excel file.

pandas.DataFrame.to_excel
DataFrame.to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', float_format=None,

columns=None, header=True, index=True, index_label=None, startrow=0, startcol=0,


engine=None, merge_cells=True, encoding=None, inf_rep='inf', verbose=True,
freeze_panes=None)[source]

Write object to an Excel sheet.

To write a single object to an Excel .xlsx file it is only necessary to specify a target
file name. To write to multiple sheets it is necessary to create an ExcelWriter object
with a target file name, and specify a sheet in the file to write to.

Multiple sheets may be written to by specifying unique sheet_name. With all data
written to the file it is necessary to save the changes. Note that creating an
ExcelWriter object with a file name that already exists will result in the contents of
the existing file being erased.

excel_writer : str or ExcelWriter object

File path or existing ExcelWriter.

sheet_name : str, default ‘Sheet1’

Name of sheet which will contain DataFrame.

na_rep : str, default ‘’

Missing data representation.

float_format : str, optional

Parameters: Format string for floating point numbers. For example


float_format="%.2f" will format 0.1234 to 0.12.

columns : sequence or list of str, optional

Columns to write.

header : bool or list of str, default True

Write out the column names. If a list of string is given it is


assumed to be aliases for the column names.

index : bool, default True

Write row names (index).


index_label : str or sequence, optional

Column label for index column(s) if desired. If not specified,


and header and index are True, then the index names are
used. A sequence should be given if the DataFrame uses
MultiIndex.

startrow : int, default 0

Upper left cell row to dump data frame.

startcol : int, default 0

Upper left cell column to dump data frame.

engine : str, optional

Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also


set this via the options io.excel.xlsx.writer,
io.excel.xls.writer, and io.excel.xlsm.writer.

merge_cells : bool, default True

Write MultiIndex and Hierarchical Rows as merged cells.

encoding : str, optional

Encoding of the resulting excel file. Only necessary for xlwt,


other writers support unicode natively.

inf_rep : str, default ‘inf’

Representation for infinity (there is no native representation


for infinity in Excel).

verbose : bool, default True

Display more information in the error logs.

freeze_panes : tuple of int (length 2), optional

Specifies the one-based bottommost row and rightmost


column that is to be frozen.

New in version 0.20.0..


See also

to_csv

Write DataFrame to a comma-separated values (csv) file.

ExcelWriter

Class for writing DataFrame objects into excel sheets.

read_excel

Read an Excel file into a pandas DataFrame.

read_csv

Read a comma-separated values (csv) file into DataFrame.

Notes

For compatibility with to_csv(), to_excel serializes lists and dicts to strings before
writing.

Once a workbook has been saved it is not possible write further data without rewriting
the whole workbook.

Examples

Create, write to and save a workbook:

>>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],


... index=['row 1', 'row 2'],
... columns=['col 1', 'col 2'])
>>> df1.to_excel("output.xlsx") # doctest: +SKIP

To specify the sheet name:

>>> df1.to_excel("output.xlsx",
... sheet_name='Sheet_name_1') # doctest: +SKIP

If you wish to write to more than one sheet in the workbook, it is necessary to specify
an ExcelWriter object:

>>> df2 = df1.copy()


>>> with pd.ExcelWriter('output.xlsx') as writer: # doctest: +SKIP
... df1.to_excel(writer, sheet_name='Sheet_name_1')
... df2.to_excel(writer, sheet_name='Sheet_name_2')

To set the library that is used to write the Excel file, you can pass the engine keyword
(the default engine is automatically chosen depending on the file extension):
>>> df1.to_excel('output1.xlsx', engine='xlsxwriter') # doctest:
+SKIP

S-ar putea să vă placă și