Bine ați venit la Scribd!

Hadoop Training #1: Thinking at Scale

Încărcat de

100% au considerat acest document util (1 vot)

3K vizualizări20 pagini

You know your data is big – you found Hadoop. What implications must you consider when working at this scale? This lecture addresses common challenges and general best practices for scaling with your data. Check http://www.cloudera.com/hadoop-training-basic for training videos.

Drepturi de autor

Formate disponibile

PDF, TXT sau citiți online pe Scribd

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Raportați acest document

Drepturi de autor:

Attribution Non-Commercial (BY-NC)

Formate disponibile

Descărcați ca PDF, TXT sau citiți online pe Scribd

Indicator pentru conținut neadecvat

100% au considerat acest document util (1 vot)

3K vizualizări20 pagini

Hadoop Training #1: Thinking at Scale

Încărcat de

Dmytro Shteflyuk

Drepturi de autor:

Attribution Non-Commercial (BY-NC)

Formate disponibile

Descărcați ca PDF, TXT sau citiți online pe Scribd

Indicator pentru conținut neadecvat

Salt la pagina

Sunteți pe pagina 1din 20

Căutați în document

Thinking at Scale

This presentation includes course content © University of Washington

Redistributed under the Creative Commons Attribution 3.0 license.
All other contents:
© 2009 Cloudera, Inc.
Overview
• Characteristics of large-scale computation
• Pitfalls of distributed systems
• Designing for scalability

© 2009 Cloudera, Inc.

By the numbers…

• Max data in memory: 32 GB

• Max data per computer: 12 TB
• Data processed by Google every month:
400 PB … in 2007
• Average job size: 180 GB
• Time that would take to read sequentially
off a single drive: 45 minutes

© 2009 Cloudera, Inc.

What does this mean?
• We can process data very quickly but we
can read/write it very slowly
• Solution: parallel reads

• 1 HDD = 75 MB/sec
• 1000 HDDs = 75 GB/sec
– Now you’re talking!

© 2009 Cloudera, Inc.

Sharing is Slow
• Grid computing: not new
– MPI, PVM, Condor…
• Grid focus: distribute the workload
– NetApp filer or other SAN drives many
compute nodes
• Modern focus: distribute the data
– Reading 100 GB off a single filer would leave
nodes starved – just store data locally

© 2009 Cloudera, Inc.

Sharing is Tricky
• Exchanging data requires synchronization
– Deadlock becomes a problem
• Finite bandwidth is available
– Distributed systems can “drown themselves”
– Failovers can cause cascading failure
• Temporal dependencies are complicated
– Difficult to reason about partial restarts

© 2009 Cloudera, Inc.

Ken Arnold, CORBA designer:

“Failure is the defining difference between

distributed and local programming”

© 2009 Cloudera, Inc.

Reliability Demands
• Support partial failure
– Total system must support graceful decline in
application performance rather than a full halt

Reliability Demands
• Data Recoverability
– If components fail, their workload must be
picked up by still-functioning units

Reliability Demands
• Individual Recoverability
– Nodes that fail and restart must be able to
rejoin the group activity without a full group
restart

Reliability Demands
• Consistency
– Concurrent operations or partial internal
failures should not cause externally visible
nondeterminism

Reliability Demands
• Scalability
– Adding increased load to a system should not
cause outright failure, but a graceful decline
– Increasing resources should support a
proportional increase in load capacity

A Radical Way Out…
• Nodes talk to each other as little as
possible – maybe never
– “Shared nothing” architecture
• Programmer should not explicitly be
allowed to communicate between nodes
• Data is spread throughout machines in
advance, computation happens where it’s
stored.

Motivations for MapReduce
• Data processing: > 1 TB
• Massively parallel (hundreds or thousands
of CPUs)
• Must be easy to use
– High-level applications written in MapReduce
– Programmers don’t worry about socket(), etc.

Locality
• Master program divvies up tasks based on
location of data: tries to have map tasks
on same machine as physical file data, or
at least same rack
• Map task inputs are divided into 64—128
MB blocks: same size as filesystem
chunks
– Process components of a single file in parallel

Fault Tolerance
• Tasks designed for independence
• Master detects worker failures
• Master re-executes tasks that fail while in
progress
• Restarting one task does not require
communication with other tasks
• Data is replicated to increase availability,
durability

Optimizations
• No reduce can start until map is complete:
– A single slow disk controller can rate-limit the
whole process
• Master redundantly executes “slow-
moving” map tasks; uses results of first
copy to finish

Conclusions
• Computing with big datasets is a
fundamentally different challenge than
doing “big compute” over a small dataset
• New ways of thinking about problems
needed
– New tools provide means to capture this
– MapReduce, HDFS, etc. can help

S-ar putea să vă placă și

Hadoop Training #5: MapReduce Algorithm
Document31 pagini
Hadoop Training #5: MapReduce Algorithm
Dmytro Shteflyuk
100% (2)
Hadoop Training #4: Programming With Hadoop
Document46 pagini
Hadoop Training #4: Programming With Hadoop
Dmytro Shteflyuk
100% (2)
Configuring Hadoop Security With Cloudera Manager
Document52 pagini
Configuring Hadoop Security With Cloudera Manager
Boo Cori
Încă nu există evaluări
Cloudera Hadoop Admin Notes PDF
Document65 pagini
Cloudera Hadoop Admin Notes PDF
Pankaj Sharma
Încă nu există evaluări
Getting Started with Big Data Query using Apache Impala
De la Everand
Getting Started with Big Data Query using Apache Impala
Agus Kurniawan
Încă nu există evaluări
Learn Hive in 24 Hours
De la Everand
Learn Hive in 24 Hours
Alex Nordeen
Încă nu există evaluări
Cloudera A Complete Guide - 2019 Edition
De la Everand
Cloudera A Complete Guide - 2019 Edition
Gerardus Blokdyk
Încă nu există evaluări
Monitoring Hadoop
De la Everand
Monitoring Hadoop
Gurmukh Singh
Încă nu există evaluări
Hadoop Cluster Deployment
De la Everand
Hadoop Cluster Deployment
Danil Zburivsky
Încă nu există evaluări
Spark Summit East 2015 - Adv Dev Ops - Student Slides
Document219 pagini
Spark Summit East 2015 - Adv Dev Ops - Student Slides
Chánh Lê
Încă nu există evaluări
HADOOP
Document35 pagini
HADOOP
Ekapop Verasakulvong
100% (1)
Apache Hive Interview Questions
Document6 pagini
Apache Hive Interview Questions
sourashree
100% (1)
BIG DATA & Hadoop Interview Questions With Answers
Document9 pagini
BIG DATA & Hadoop Interview Questions With Answers
Mikky James
Încă nu există evaluări
Cloudera Administrator Training
Document373 pagini
Cloudera Administrator Training
Naga
100% (6)
Hadoop Administration
Document97 pagini
Hadoop Administration
arjun.ec633
Încă nu există evaluări
Data Engineer Interview Questions
Document6 pagini
Data Engineer Interview Questions
Ghulam Mustafa
Încă nu există evaluări
Cloudera Developer Training
Document483 pagini
Cloudera Developer Training
equest7916
100% (1)
Hadoop & Big Data
Document36 pagini
Hadoop & Big Data
Paresh Bhatia
Încă nu există evaluări
Hadoop Tutorial
Document199 pagini
Hadoop Tutorial
satya2003m
50% (2)
Hadoop Architect Brochure
Document13 pagini
Hadoop Architect Brochure
akinur
Încă nu există evaluări
Big Data Masters Program
Document13 pagini
Big Data Masters Program
Arun Singh
Încă nu există evaluări
02 Hadoop Architecture and HDFS
Document74 pagini
02 Hadoop Architecture and HDFS
arjun.ec633
Încă nu există evaluări
Apache Hive
Document77 pagini
Apache Hive
Ashok Kumar K R
Încă nu există evaluări
Hadoop and Java Ques - Ans
Document222 pagini
Hadoop and Java Ques - Ans
ravi
Încă nu există evaluări
AWS Certified Data Analytics - Specialty Exam Guide - v1.0!08!23-2019 - FINAL
Document2 pagini
AWS Certified Data Analytics - Specialty Exam Guide - v1.0!08!23-2019 - FINAL
Naga Balaram-pandu
0% (1)
Streamsets: By: Avleen Kaur
Document23 pagini
Streamsets: By: Avleen Kaur
Rekha G
Încă nu există evaluări
Primer On Big Data Testing
Document24 pagini
Primer On Big Data Testing
Surojeet Sengupta
Încă nu există evaluări
Data Migration To Hadoop
Document26 pagini
Data Migration To Hadoop
krishna
100% (2)
Big Data Hadoop Architect - V4
Document20 pagini
Big Data Hadoop Architect - V4
Ali Akbar
Încă nu există evaluări
Hive Interview Questions Answers
Document6 pagini
Hive Interview Questions Answers
rksekhar
Încă nu există evaluări
Apache Oozie Essentials
De la Everand
Apache Oozie Essentials
Singh Jagat Jasjit
Încă nu există evaluări
Big Data Introduction PDF
Document180 pagini
Big Data Introduction PDF
valtech20086605
Încă nu există evaluări
7-AdvHive HBase
Document82 pagini
7-AdvHive HBase
venkatraji719568
Încă nu există evaluări
Apache Hue-Cloudera
Document63 pagini
Apache Hue-Cloudera
Dheepika
Încă nu există evaluări
Cloudera Kafka
Document50 pagini
Cloudera Kafka
Rindra Wiska
100% (1)
Spark Sample Resume 2
Document7 pagini
Spark Sample Resume 2
pupscribd
100% (1)
NiFi A Complete Guide - 2021 Edition
De la Everand
NiFi A Complete Guide - 2021 Edition
Gerardus Blokdyk
Încă nu există evaluări
Hadoop Admin Interview Questions and Answers
Document9 pagini
Hadoop Admin Interview Questions and Answers
sriny123
Încă nu există evaluări
Sqoop Commands
Document4 pagini
Sqoop Commands
Senthil Kumar
Încă nu există evaluări
ETL A Clear and Concise Reference
De la Everand
ETL A Clear and Concise Reference
Gerardus Blokdyk
Încă nu există evaluări
Hadoop Big Data Administration
Document6 pagini
Hadoop Big Data Administration
dsunte
Încă nu există evaluări
Introduction To Apache Spark (Spark) : - by Praveen
Document19 pagini
Introduction To Apache Spark (Spark) : - by Praveen
vasari8882573
Încă nu există evaluări
Big Data Hadoop Interview Questions and Answers
Document26 pagini
Big Data Hadoop Interview Questions and Answers
CosmicBlue
Încă nu există evaluări
Hive Performance Tuning
Document33 pagini
Hive Performance Tuning
Anwar
Încă nu există evaluări
DynaTrace Analysis Overview
Document12 pagini
DynaTrace Analysis Overview
Sreenivasulu Reddy Sanam
Încă nu există evaluări
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
De la Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
Încă nu există evaluări
100 Interview Questions On Hadoop - Hadoop Online Tutorials
Document22 pagini
100 Interview Questions On Hadoop - Hadoop Online Tutorials
amarbhai
100% (1)
File Formats in Big Data
Document13 pagini
File Formats in Big Data
Meghna Sharma
Încă nu există evaluări
Hadoop
Document7 pagini
Hadoop
Amaleswar
Încă nu există evaluări
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
Document7 pagini
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
nalinbhatt
Încă nu există evaluări
Hadoop Ecosystem
Document16 pagini
Hadoop Ecosystem
poojan thakkar
Încă nu există evaluări
1.hadoop Admin Brochure
Document11 pagini
1.hadoop Admin Brochure
krishna
Încă nu există evaluări
Hadoop Admin
Document13 pagini
Hadoop Admin
rsreddy.ch5919
Încă nu există evaluări
Hadoop Performance Tuning
Document13 pagini
Hadoop Performance Tuning
Impetus
100% (1)
Bigdata With Python
Document19 pagini
Bigdata With Python
Amrit Chhetrib
Încă nu există evaluări
Guided By:: Miss. Rupali Zambre
Document20 pagini
Guided By:: Miss. Rupali Zambre
john
Încă nu există evaluări
250 Hadoop Interview Questions and Answers For Experienced Hadoop Developers - Hadoop Online Tutorials
Document35 pagini
250 Hadoop Interview Questions and Answers For Experienced Hadoop Developers - Hadoop Online Tutorials
gowri1111
Încă nu există evaluări
Data Lake Architecture Strategy A Complete Guide - 2021 Edition
De la Everand
Data Lake Architecture Strategy A Complete Guide - 2021 Edition
Gerardus Blokdyk
Încă nu există evaluări
Real Time Hadoop Interview Questions From Various Interviews
Document6 pagini
Real Time Hadoop Interview Questions From Various Interviews
Saurabh Gupta
Încă nu există evaluări
Spark Tuning
Document26 pagini
Spark Tuning
ajquinonesp
Încă nu există evaluări
Redis Cluster
Document17 pagini
Redis Cluster
최현준
50% (2)
Redis Cluster
Document17 pagini
Redis Cluster
최현준
50% (2)
Why RSA Works PDF
Document19 pagini
Why RSA Works PDF
a3kira
Încă nu există evaluări
CSS 3 Help Cheat Sheet
Document1 pagină
CSS 3 Help Cheat Sheet
Dmytro Shteflyuk
100% (1)
Go Programming
Document60 pagini
Go Programming
Dmytro Shteflyuk
80% (5)
Sigmod278 Silberstein
Document12 pagini
Sigmod278 Silberstein
Dmytro Shteflyuk
Încă nu există evaluări
Authors: Thanks To:: Miek Gieben Go Authors Google Go Nuts Mailing List
Document272 pagini
Authors: Thanks To:: Miek Gieben Go Authors Google Go Nuts Mailing List
Dmytro Shteflyuk
Încă nu există evaluări
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
Document72 pagini
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
Dmytro Shteflyuk
100% (1)
CSS 2.1 Help Cheat Sheet
Document1 pagină
CSS 2.1 Help Cheat Sheet
Dmytro Shteflyuk
Încă nu există evaluări
Scaling MySQL Writes Through Partitioning
Document38 pagini
Scaling MySQL Writes Through Partitioning
Dmytro Shteflyuk
Încă nu există evaluări
Calpont InfiniDB Administrator Guide (For Version 1.0.3)
Document106 pagini
Calpont InfiniDB Administrator Guide (For Version 1.0.3)
Dmytro Shteflyuk
100% (2)
Backup Strategies With MySQL Enterprise Backup
Document33 pagini
Backup Strategies With MySQL Enterprise Backup
Dmytro Shteflyuk
Încă nu există evaluări
Window Vista Business 20070810
Document29 pagini
Window Vista Business 20070810
josephpaulkr
Încă nu există evaluări
RFM: A Precursor To Data Mining
Document10 pagini
RFM: A Precursor To Data Mining
Dmytro Shteflyuk
Încă nu există evaluări
Amdahl's Law in The Multicore Era
Document6 pagini
Amdahl's Law in The Multicore Era
Dmytro Shteflyuk
100% (1)
Scribd Architecture Overview
Document19 pagini
Scribd Architecture Overview
Dmytro Shteflyuk
100% (20)
Scaling Rails Applications in The Cloud
Document59 pagini
Scaling Rails Applications in The Cloud
Dmytro Shteflyuk
100% (3)
The Complete Google Analytics Power User Guide PDF
Document45 pagini
The Complete Google Analytics Power User Guide PDF
Sergi Soto Roldan
Încă nu există evaluări
Amdahl's Law in The Multicore Era - HPCA Keynote 02/2008
Document61 pagini
Amdahl's Law in The Multicore Era - HPCA Keynote 02/2008
Dmytro Shteflyuk
Încă nu există evaluări
Instuments Types With Examples
Document9 pagini
Instuments Types With Examples
Ammar Khalid
Încă nu există evaluări
Thermoelectric Module TEC1-12706
Document4 pagini
Thermoelectric Module TEC1-12706
reis emperor
Încă nu există evaluări
OBJECTIVE TYPE QUESTIONS AND ANSWERS FOR ETO - Electro Technical Officer
Document39 pagini
OBJECTIVE TYPE QUESTIONS AND ANSWERS FOR ETO - Electro Technical Officer
amit
100% (2)
D74942GC40 15947 Us
Document4 pagini
D74942GC40 15947 Us
William Lee
Încă nu există evaluări
Implementation of SPWM Technique For 3 - VSI Using STM32F4 Discovery Board Interfaced With MATLAB
Document5 pagini
Implementation of SPWM Technique For 3 - VSI Using STM32F4 Discovery Board Interfaced With MATLAB
kirat
Încă nu există evaluări
S.N. Title of Specification Specification NO
Document4 pagini
S.N. Title of Specification Specification NO
Abhishek Pandey
Încă nu există evaluări
74LS145
Document6 pagini
74LS145
Deepesh Mishra
Încă nu există evaluări
3GPP TR 29.998-05-1: Technical Report
Document25 pagini
3GPP TR 29.998-05-1: Technical Report
dzo007
Încă nu există evaluări
Punit Resume Work
Document3 pagini
Punit Resume Work
api-321948941
Încă nu există evaluări
Insurance Premium Option: Calculation Sheet
Document44 pagini
Insurance Premium Option: Calculation Sheet
royalviren
Încă nu există evaluări
Blues30Nv: User Manual
Document36 pagini
Blues30Nv: User Manual
rsojos19544
Încă nu există evaluări
MSI Presentation
Document12 pagini
MSI Presentation
abdullah
Încă nu există evaluări
Digital Electronics Refrence
Document3 pagini
Digital Electronics Refrence
Apoorva Karadi
Încă nu există evaluări
Icp-Oes Plasma Quant Pq9000
Document16 pagini
Icp-Oes Plasma Quant Pq9000
Mari Sherlin Salisi-Chua
Încă nu există evaluări
Computer Workstation Ergonomics Self Assessment Checklist
Document3 pagini
Computer Workstation Ergonomics Self Assessment Checklist
Prashanth Vijender
100% (2)
Ordering Guide & Technical Information: Absolute Process Control Know Where You Are... Regardless
Document56 pagini
Ordering Guide & Technical Information: Absolute Process Control Know Where You Are... Regardless
Eduard Ramos
Încă nu există evaluări
Datasheet Abb Trafo
Document3 pagini
Datasheet Abb Trafo
Fernando Arnulfo G.
Încă nu există evaluări
DSX s300btx User Manual
Document2 pagini
DSX s300btx User Manual
AngelLazarte
100% (1)
Loop Impedance - B2
Document8 pagini
Loop Impedance - B2
madhavan
Încă nu există evaluări
NOJA-5002-08 OSM15 310, OSM27 310, OSM38 300 and RC10 Controller User Manual - Print (Adobe) PDF
Document208 pagini
NOJA-5002-08 OSM15 310, OSM27 310, OSM38 300 and RC10 Controller User Manual - Print (Adobe) PDF
Nhat Nguyen Van
100% (1)
Fsd0a A Generic Core Prodbrief v1.0
Document1 pagină
Fsd0a A Generic Core Prodbrief v1.0
Mehanathan Maggie Mikey
Încă nu există evaluări
SR - N O. Option1 Option2 Option3 Option4 Answ Er
Document17 pagini
SR - N O. Option1 Option2 Option3 Option4 Answ Er
Chhavi Chawla
Încă nu există evaluări
Installing An Operating System
Document83 pagini
Installing An Operating System
Maricel Fraga Azcarraga
100% (1)
ST ND RD TH ND TH TH TH
Document4 pagini
ST ND RD TH ND TH TH TH
prdpks2000
Încă nu există evaluări
Cedes: GLS 126 NT and GLS 151 NT
Document2 pagini
Cedes: GLS 126 NT and GLS 151 NT
Loloy Noryb
Încă nu există evaluări
Circuit Breaker NTS Questions, Interview Questions and All About Circuit Breaker
Document46 pagini
Circuit Breaker NTS Questions, Interview Questions and All About Circuit Breaker
Engr Mubashar Ali
50% (2)
2CSG257153R4051 Anr96prf 230 Network Analyser
Document2 pagini
2CSG257153R4051 Anr96prf 230 Network Analyser
Luis Eduardo
Încă nu există evaluări
.Automatic Room Light Controller Using Arduino and PIR Sensor
Document4 pagini
.Automatic Room Light Controller Using Arduino and PIR Sensor
LÂM PHẠM NHƯ
100% (1)
Health Index Monitoring Assessment of A Transformer
Document6 pagini
Health Index Monitoring Assessment of A Transformer
Omkar Wagh
Încă nu există evaluări
Cheatsheet
Document2 pagini
Cheatsheet
cboersma10
Încă nu există evaluări