Sunteți pe pagina 1din 18

Large scale data

processing pipelines
at trivago: a use case
2016-11-15, Sevilla, Spain
Clemens Valiente
Clemens Valiente
Lead Big Data Engineer
trivago Dsseldorf

Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years

Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Data driven PR and External Communication

Price information collected


from the various booking
websites and shown to our
visitors also gives us a
thorough overview over trends
and development of hotel
prices. This knowledge then is
used by our Content Marketing
& Communication Department
(CMC) to write stories and
articles.

3
The past: Data pipeline 2010 2015

Java Software Business


Engineering Intelligence

CMC

4
The past: Data pipeline 2010 2015
Facts & Figures
Price dimensions Restrictions Size of data
- Around one million hotels - Only single night stays - We collected a total of 56
- 250 booking websites - Only prices from European billion prices in those five
- Travellers search for up to visitors years
180 days in advance - Prices cached up to 30 - Towards the end of this
- Data collected over five minutes pipeline in early 2015 the
years - One price per hotel, average was around 100
website and arrival date million prices per day
per day written to BI
- Insert ignore: The first
price per key wins

5
The past: Data pipeline 2010 2015

Java Software Business


Engineering Intelligence

CMC

6
Refactoring the pipeline: Requirements

Scales with an arbitrary amount of data (future proof)


reliable and resilient
low performance impact on Java backend
long term storage of raw input data
fast processing of filtered and aggregated data
Open source
we want to log everything:
more prices
Length of stay, room type, breakfast info, room category, domain
with more information
Net & gross price, city tax, resort fee, affiliate fee, VAT

7
Present data pipeline 2016 ingestion

San Francisco

Dsseldorf

Hong Kong

8
Present data pipeline 2016 processing

Camus
CMC

9
Present data pipeline 2016 facts & figures

Cluster specifications Data Size (price log) Data processing


- 51 machines - 2.6 trillion messages - Camus: 30 mappers writing
- 1.7 PB disc space, 60% collected so far data in 10 minute intervals
used - 7 billion messages/day - First aggregation/filtering
- 3.6 TB memory in Yarn - 160 TB of data stage in Hive runs in 30
- 1440 VCores (24-32 Cores minutes with 5 days of CPU
per machine) time spent
- Impala Queries across
>100 GB of result tables
usually done within a few
seconds

10
Present data pipeline 2016 results after one
and a half years in production
Very reliable, barely any downtime or service interuptions of the system
Java team is very happy less load on their system
BI team is very happy more data, more ressources to process it
CMC team is very happy
Faster results
Better quality of results due to more data
More detailed results
=> Shorter research phase, more and better stories
=> Less requests & workload for BI

11
Present data pipeline 2016 use cases &
status quo
Uses for price information Other data sources and Status quo
- Monitoring price parity in usage - Our entire business runs on
hotel market - Clicklog information from and through the kafka
- Anomaly and fraud our website and mobile hadoop pipeline
detection app - Almost all departments rely
- Price feed for online - Used for marketing on data, insights and
marketing performance analysis, metrics delivered by
- Display of price product tests, invoice hadoop
development and delivering generation etc - Most of the company could
price alerts to website not do their job without
visitors hadoop data

12
Future data pipeline 2016/2017
Message format:
CSV
Protobuf / Avro

Kafka Connect
or Gobblin Kylin / Hbase

Stream processing
Kafka Streams CMC
Streaming SQL

13
Future data pipeline 2016/2017
Message format:
CSV
Protobuf / Avro

Stream processing
Kafka Streams CMC
Streaming SQL

14
Future data pipeline 2016/2017

CMC

Streams

local state

* https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
15
Key challenges and learnings

Mastering hadoop Using hadoop Bad parts


- Finding your log files - Offer easy hadoop access - HUE (the standard GUI)
- Interpreting error messages to users (Impala / Hive - Write oozie workflows and
correctly JDBC with visualisation coordinators in xml, not
- Understanding settings tools) through the Hue interface
and how to use them to - Educate users on how to - Monitoring impala
solve problem write good code, strict - Still some hard to find bugs
- Store data in wide, guidelines and code in Hive & Impala
denormalised Hive tables in review
parquet format and nested - deployment process:
data types jenkins deploys git
repository with oozie
definitions and hive scripts
to hdfs
16
Clemens Valiente
Lead Big Data Engineer

Thank you! trivago Dsseldorf

Originally a mathematician
Now it is time Studied at Uni Erlangen
At trivago for 5 years
for questions
and comments! Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Resources

Gobblin: https://github.com/linkedin/gobblin
Impala connector for dplyr: https://github.com/piersharding/dplyrimpaladb
Querying Kafka Stream's local state: https://www.confluent.io/blog/unifying-stream-processing-and-
interactive-queries-in-apache-kafka/
Hive on Spark:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
Parquet: https://parquet.apache.org/documentation/latest/
ProtoBuf: https://developers.google.com/protocol-buffers/

18

S-ar putea să vă placă și