0 evaluări0% au considerat acest document util (0 voturi)

55 vizualizări28 paginiBig data on a single computer. The line sounds to be crazy, don't worry i used a sample of Big Data set in my analysis and it is not all about related to machine learning.The document contains analysis carried out on yellow medallions in New York city. It was documented in a step by step manner.

Jul 31, 2016

© © All Rights Reserved

DOCX, PDF, TXT sau citiți online pe Scribd

Big data on a single computer. The line sounds to be crazy, don't worry i used a sample of Big Data set in my analysis and it is not all about related to machine learning.The document contains analysis carried out on yellow medallions in New York city. It was documented in a step by step manner.

© All Rights Reserved

0 evaluări0% au considerat acest document util (0 voturi)

55 vizualizări28 paginiBig data on a single computer. The line sounds to be crazy, don't worry i used a sample of Big Data set in my analysis and it is not all about related to machine learning.The document contains analysis carried out on yellow medallions in New York city. It was documented in a step by step manner.

© All Rights Reserved

Sunteți pe pagina 1din 28

Data Description:

Dataset 1: Trip data for February month (12 datasets for 12 months from January to

December for year 2013)

Attributes description:

Medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID.

Hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID.

Vendor id: e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT)..

Dropoff longitude and dropoff latitude: GPS coordinates at the end of the trip.

Dataset 2: Fare data for February month (12 datasets for 12 months from January

to December for year 2013)

Attributes description:

Medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID.

Hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID.

Vendor id: e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT).

Pickup datetime: start time of the trip, mm-dd-yyyy hh24: mm:ss EDT.

Fare amount: the meter fare, it should include the Newark surcharge, in USD.

Surcharge: Extra fees, such as rush hour and overnight surcharges, in USD.

Tolls amount: total price paid for tolls, summed across all tolls for the trip, in USD.

Total amount: all charges that are presented to the passenger at time of fare payment (includes tip for non-cash trips), in USD.

1. Each trip dataset size is about 2.2 GB and fare dataset size is about 700MB to 1 GB. Taking both the trip and

fare datasets for a particular month and analyzing them on a single system with a 4GB RAM is hindering my

analysis in terms of time. So I took a sample of 1st week data from February month (nearly 700MB after

merging both trip and fare data (~36 lakhs records).

2. Trip time in secs column has some ambiguity, sometimes recorded in seconds and other times in minutes.

3. The pickup lat/Lon and dropoff lat/Lon coordinates are recorded in some places to (0.0).

4. Dont know how the trip distance is calculated. I used geo location coordinates to find the straight line

distance.

APPROACH:

Entire analysis is done using a local disk backend and some in memory

computations.

The pickup lat/Lon location and dropoff lat/Lon location outside the new York city

region are kept to NA, I didnt removed the NA records as most of other import

information will be lost.

The trip time is calculated by subtracting the pickup date time from dropoff date

time.

Final dataset:

3590754 rows with 21attributes. (February 1st week)

Quantile plots for few of the attributes: Single variable analysis.

Total medallions: 13306

The above graph shows that most of the medallions were not used, while some were used more than 500 times in a

week (~71 times a day).

X axis: Each medallion.

Y axis: medallion frequency.

Total number of hacks: 29836.plotting for number of trips made by each hack for top 5000 and bottom 5000 gives

the above graph. It tells that

Most of the hacks (drivers) were not utilized, where as a few hacks have driven over 400 trips in a week (57

trips per day) - some up to an average of over 7 trips a day.

It looks like ~70% of cab rides have a single passenger, and zero passengers was also reported.

14% of cab rides have double passengers.

4% of cab rides have 3 passengers.

2% of cab rides have 4 passengers.

7% of cab rides have 5 passengers.

3% of cab rides have 6 passengers.

Most of the cab rides took in between 5 minutes to 10 minutes.

Maximum cab ride distance is ~38 minutes.

I'm not sure how trip distance is calculated, but almost 90% of cab rides are 6 miles or less.

Only 8% of cab rides covered in between 6 miles to 11 miles.

2% of cab rides covered in between 11 to 18 miles.

The left graph shows that there is an outlier with 480 dollars, eliminating it with a condition of <100 dollars

gives us the right side plot.

Median cab fare is 10 dollars from the right side graph.

93% of the cab rides gets less than ~25 dollars.

4% gets in between ~25 to ~40.

3% in between~ 40 to ~56 dollars.

By removing the outlier from the left graph by placing a condition tip amount<25, we get the right side graph.

levels : 5

> freqTable head <

CRD : 1972931

CSH : 1606335

NOC :

DIS :

7494

2339

This gives us some initial interesting insight. Credit card and cash are used nearly in equal proportion.

Other than CRD and UNK, tip amounts are zero.

1) Pickup longitude and Pickup latitude

Each point in the graph indicates the pickup location by considering both the pickup latitude and longitude.

Pick up locations are used to map the city streets and analyze the traffic mobility within city streets This is how the latitude and

longitude columns are used in my analysis.

Points represent pickup location (lat/lon).

Each color space is the pickup frequency for a location in the city. The green color space seems to be

interesting.

-74.259090)

South East (lat = 40.477399, Lon = -73.700272)

location.

Each line is a street in

Manhattan town.

The thick black space is

the busiest area in

Manhattan town.

Taking the total dataset

for entire month of

February would have

given a clear picture about

the top busiest areas in

the town.

with the frequency of

each pickup location.

The green space is

the busiest area.

As mentioned before

the middle thick

green is the busiest

area in the city.

2)

The graph represents the relationship between the total amount paid and the distance covered.

The green space in the graph shows the highest count (how many times both the paid amount and distance

travelled happened), cabs carrying passengers to lower distances is highest.

Similarly each color space indicates the frequency of amount paid and distance travelled.

Graph shows the number of trips made in each week day, where every point represents the total trips in lakhs

made on that particular day.

Friday and Saturday records high number of trips where as Sunday and Monday records low number of trips,

the rest of the days lie in between these two.

Graph represents the number of trips made at every hour on all weekdays, the variation in the number

of trips in different days at different time periods.

Looking at Monday to Friday, starting from 5AM to 8AM there is a steep increase in the number of trips,

then there is a gradual decrease up to 10AM and trip numbers remains constantly fluctuating up to

3PM.

Looking at Saturday and Sunday the pattern is totally different. From 12 AM to 5AM there is decrease in

trip numbers and again goes on increasing from 5AM TO 1PM, again from 1PM to 4 PM the numbers

remain constant.

On Friday and Saturday the trip numbers are still high in the mid night up to 11:59PM, where as for

Sunday the behavior is totally different.

The peak hours considered from this graph are 5AM TO 7:59AM, 4PM TO 6:59PM.

The graph shows the number of trips made in each week day in the morning peak hours, where each point

represents number of trips in that particular hour from (0-59 minutes). Time period is 5AM TO 7:59AM.

Saturday and Sunday morning hours shows a high decrease in the number of trips when compared with rest

of the days.

unoccupied cabs on the road from 00:00 AM to 11:59 AM.

Graph represents the percentage of occupied cabs and unoccupied cabs at different time instances

from 00:00AM to 11:59AM.

Every point on red line represents the percentage of occupied cabs and points on blue line indicates

percentage of unoccupied cabs in that particular hour from 0 to 59 minutes.

Considering Saturday from 3AM to 3:59AM, cabs occupied percentage is ~40%, where as unoccupied

percentage is ~60%.

On Wednesday from 12 AM to 12:59 AM both the occupied and unoccupied percentages remain same.

unoccupied cabs on the road from 12 PM to 23:59 PM.

Graph represents the percentage of occupied cabs and unoccupied cabs at different time instances from 12PM

to 23:59PM.

Every point on red line represents the percentage of occupied cabs and points on blue line indicates

percentage of unoccupied cabs in that particular hour from 0 to 59 minutes

Considering Saturday from 23:00PM to 23:59AM, cabs occupied percentage is ~79.6%, where as unoccupied

percentage is ~19.4%.

On Sunday from 7PM TO 7:59PM the occupied percentage is ~53% and unoccupied percentage is 47%.

unaware of it.

First I calculated the log distance in

miles between pickup and drop off

(lat/lons) and converted that numeric

distance into factors and indexed

them.

Now I divided the data on distance

and payment method.

This graph on x axis shows the mean

tip based on distance and payment

method and y axis is the indexed

distance.

Each point here tells the mean tip

value for a particular distance and

payment type.

Almost all tip payments are done

through CRD and UNK. The value lies

between 20-25 dollars for card. The

tip given decreases to 20 with

increase in distance (higher index)

either by CRD or UNK.

Y axis: Mean toll value

paid.

For distance indexed up

to 20 the toll value paid is

zero.

More than 20, the mean

toll value gets increases

and value is paid by all

payment methods.

X axis: distance travelled, Y axis: Total fare excluding the toll value.

The green space indicates the highest frequency of occurrence of both the distance and fare amount

(excluding tolls).

Most of the cabs carry passengers to lower distances when compared with other frequency counts.

Future work:

This analysis is made only for 1st week data taken from February month, here I

used local disk as my backend in coming up with this results. But the same analysis

can be made using GOOGLE Big query cloud store to fasten the process.

Same code can be made to run on the entire 1 month data , provided high

processing machine.

More interesting insights can be drawn by comparing the results from different

months.

References:

Data downloading and data understanding

http://publish.illinois.edu/dbwork/open-data/

http://www.andresmh.com/nyctaxitrips/

Previous work

http://hafen.github.io/taxi/#background

R packages

http://tessera.io/docs-datadr/#key_value_pairs

http://tessera.io/docs-trelliscope/

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.