Sunteți pe pagina 1din 28

New York City taxi trips data Analysis

Data Description:
Dataset 1: Trip data for February month (12 datasets for 12 months from January to
December for year 2013)

Attributes description:

13990176 rows with 14 attributes

Medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID.

Hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID.

Vendor id: e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT)..

rate_code: taximeter rate.

store_and_fwd_flag: unknown attribute.

Pickup datetime: start time of the trip, mm-dd-yyyy hh24:mm:ss EDT.

Dropoff datetime: end time of the trip, mm-dd-yyyy hh24:mm:ss EDT.

Passenger count: number of passengers on the trip, default value is one.

Trip time in secs: trip time measured by the taximeter in seconds.

Trip distance: trip distance measured by the taximeter in miles.

Pickup_longitude and pickup_latitude: GPS coordinates at the start of the trip.

Dropoff longitude and dropoff latitude: GPS coordinates at the end of the trip.

Dataset 2: Fare data for February month (12 datasets for 12 months from January
to December for year 2013)

Attributes description:

13990176 rows with 11 attributes

Medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID.

Hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID.

Vendor id: e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT).

Pickup datetime: start time of the trip, mm-dd-yyyy hh24: mm:ss EDT.

Payment type: Cash or credit card.

Fare amount: the meter fare, it should include the Newark surcharge, in USD.

Surcharge: Extra fees, such as rush hour and overnight surcharges, in USD.

Mta tax: Metropolitan commuter transportation mobility tax, in USD.

tip amount: tip amount, in USD.

Tolls amount: total price paid for tolls, summed across all tolls for the trip, in USD.

Total amount: all charges that are presented to the passenger at time of fare payment (includes tip for non-cash trips), in USD.

Challenges with respect to the data:


1. Each trip dataset size is about 2.2 GB and fare dataset size is about 700MB to 1 GB. Taking both the trip and
fare datasets for a particular month and analyzing them on a single system with a 4GB RAM is hindering my
analysis in terms of time. So I took a sample of 1st week data from February month (nearly 700MB after
merging both trip and fare data (~36 lakhs records).
2. Trip time in secs column has some ambiguity, sometimes recorded in seconds and other times in minutes.
3. The pickup lat/Lon and dropoff lat/Lon coordinates are recorded in some places to (0.0).
4. Dont know how the trip distance is calculated. I used geo location coordinates to find the straight line
distance.
APPROACH:
Entire analysis is done using a local disk backend and some in memory
computations.
The pickup lat/Lon location and dropoff lat/Lon location outside the new York city
region are kept to NA, I didnt removed the NA records as most of other import
information will be lost.
The trip time is calculated by subtracting the pickup date time from dropoff date
time.

Final dataset:
3590754 rows with 21attributes. (February 1st week)
Quantile plots for few of the attributes: Single variable analysis.
Total medallions: 13306

The above graph shows that most of the medallions were not used, while some were used more than 500 times in a
week (~71 times a day).
X axis: Each medallion.
Y axis: medallion frequency.

Total number of hacks: 29836.plotting for number of trips made by each hack for top 5000 and bottom 5000 gives
the above graph. It tells that

Most of the hacks (drivers) were not utilized, where as a few hacks have driven over 400 trips in a week (57
trips per day) - some up to an average of over 7 trips a day.

It looks like ~70% of cab rides have a single passenger, and zero passengers was also reported.
14% of cab rides have double passengers.
4% of cab rides have 3 passengers.
2% of cab rides have 4 passengers.
7% of cab rides have 5 passengers.
3% of cab rides have 6 passengers.

~74% of the cabs travel less than 14 minutes.


Most of the cab rides took in between 5 minutes to 10 minutes.
Maximum cab ride distance is ~38 minutes.

I'm not sure how trip distance is calculated, but almost 90% of cab rides are 6 miles or less.
Only 8% of cab rides covered in between 6 miles to 11 miles.
2% of cab rides covered in between 11 to 18 miles.

The left graph shows that there is an outlier with 480 dollars, eliminating it with a condition of <100 dollars
gives us the right side plot.
Median cab fare is 10 dollars from the right side graph.
93% of the cab rides gets less than ~25 dollars.
4% gets in between ~25 to ~40.
3% in between~ 40 to ~56 dollars.

Calculating tip Quantiles by payment type:

By removing the outlier from the left graph by placing a condition tip amount<25, we get the right side graph.

Summary( payment type )

levels : 5
> freqTable head <
CRD : 1972931
CSH : 1606335
NOC :
DIS :

7494
2339

This gives us some initial interesting insight. Credit card and cash are used nearly in equal proportion.
Other than CRD and UNK, tip amounts are zero.

Relationship between two variables:


1) Pickup longitude and Pickup latitude

Each point in the graph indicates the pickup location by considering both the pickup latitude and longitude.
Pick up locations are used to map the city streets and analyze the traffic mobility within city streets This is how the latitude and
longitude columns are used in my analysis.

This graph is a more refined representation of the previous graph.


Points represent pickup location (lat/lon).
Each color space is the pickup frequency for a location in the city. The green color space seems to be
interesting.

Zooming in on Manhattan: North West (lat = 40.917577, Lon =


-74.259090)
South East (lat = 40.477399, Lon = -73.700272)

Each point is a pickup


location.
Each line is a street in
Manhattan town.
The thick black space is
the busiest area in
Manhattan town.
Taking the total dataset
for entire month of
February would have
given a clear picture about
the top busiest areas in
the town.

More refined graph


with the frequency of
each pickup location.
The green space is
the busiest area.
As mentioned before
the middle thick
green is the busiest
area in the city.

2)

Total amount paid and Distance:

The graph represents the relationship between the total amount paid and the distance covered.
The green space in the graph shows the highest count (how many times both the paid amount and distance
travelled happened), cabs carrying passengers to lower distances is highest.
Similarly each color space indicates the frequency of amount paid and distance travelled.

NUMBER OF TRIPS MADE IN EACH WEEK DAY:

Graph shows the number of trips made in each week day, where every point represents the total trips in lakhs
made on that particular day.
Friday and Saturday records high number of trips where as Sunday and Monday records low number of trips,
the rest of the days lie in between these two.

Hour wise analysis on every week day:

Graph represents the number of trips made at every hour on all weekdays, the variation in the number
of trips in different days at different time periods.
Looking at Monday to Friday, starting from 5AM to 8AM there is a steep increase in the number of trips,
then there is a gradual decrease up to 10AM and trip numbers remains constantly fluctuating up to
3PM.

Again from 4PM to 6 PM the trip numbers keep on increasing.


Looking at Saturday and Sunday the pattern is totally different. From 12 AM to 5AM there is decrease in
trip numbers and again goes on increasing from 5AM TO 1PM, again from 1PM to 4 PM the numbers
remain constant.
On Friday and Saturday the trip numbers are still high in the mid night up to 11:59PM, where as for
Sunday the behavior is totally different.
The peak hours considered from this graph are 5AM TO 7:59AM, 4PM TO 6:59PM.

Zooming on the morning peak hours (5AM TO 7:59AM).

The graph shows the number of trips made in each week day in the morning peak hours, where each point
represents number of trips in that particular hour from (0-59 minutes). Time period is 5AM TO 7:59AM.
Saturday and Sunday morning hours shows a high decrease in the number of trips when compared with rest
of the days.

Analysis on the percentage of occupied cabs and the


unoccupied cabs on the road from 00:00 AM to 11:59 AM.

Graph represents the percentage of occupied cabs and unoccupied cabs at different time instances
from 00:00AM to 11:59AM.
Every point on red line represents the percentage of occupied cabs and points on blue line indicates
percentage of unoccupied cabs in that particular hour from 0 to 59 minutes.
Considering Saturday from 3AM to 3:59AM, cabs occupied percentage is ~40%, where as unoccupied
percentage is ~60%.
On Wednesday from 12 AM to 12:59 AM both the occupied and unoccupied percentages remain same.

Analysis on the percentage of occupied cabs and the


unoccupied cabs on the road from 12 PM to 23:59 PM.

Graph represents the percentage of occupied cabs and unoccupied cabs at different time instances from 12PM
to 23:59PM.
Every point on red line represents the percentage of occupied cabs and points on blue line indicates
percentage of unoccupied cabs in that particular hour from 0 to 59 minutes
Considering Saturday from 23:00PM to 23:59AM, cabs occupied percentage is ~79.6%, where as unoccupied
percentage is ~19.4%.

On Sunday from 7PM TO 7:59PM the occupied percentage is ~53% and unoccupied percentage is 47%.

More refined analysis on distance:

Tolls paid by distance category:

How the distance is calculated I am


unaware of it.
First I calculated the log distance in
miles between pickup and drop off
(lat/lons) and converted that numeric
distance into factors and indexed
them.
Now I divided the data on distance
and payment method.
This graph on x axis shows the mean
tip based on distance and payment
method and y axis is the indexed
distance.
Each point here tells the mean tip
value for a particular distance and
payment type.
Almost all tip payments are done
through CRD and UNK. The value lies
between 20-25 dollars for card. The
tip given decreases to 20 with
increase in distance (higher index)
either by CRD or UNK.

X axis: Distance category.


Y axis: Mean toll value
paid.
For distance indexed up
to 20 the toll value paid is
zero.
More than 20, the mean
toll value gets increases
and value is paid by all
payment methods.

Relationship between fare without tolls and distance travelled:

X axis: distance travelled, Y axis: Total fare excluding the toll value.
The green space indicates the highest frequency of occurrence of both the distance and fare amount
(excluding tolls).
Most of the cabs carry passengers to lower distances when compared with other frequency counts.

Future work:

This analysis is made only for 1st week data taken from February month, here I
used local disk as my backend in coming up with this results. But the same analysis
can be made using GOOGLE Big query cloud store to fasten the process.
Same code can be made to run on the entire 1 month data , provided high
processing machine.
More interesting insights can be drawn by comparing the results from different
months.

References:
Data downloading and data understanding
http://publish.illinois.edu/dbwork/open-data/
http://www.andresmh.com/nyctaxitrips/
Previous work

http://hafen.github.io/taxi/#background
R packages

Datadr, Trelliscope, geosphere, parallel, ggplots, lubridate, hexbin etc.

http://tessera.io/docs-datadr/#key_value_pairs
http://tessera.io/docs-trelliscope/