Trivago Pipeline

Large scale data
processing pipelines
at trivago: a use case
2016-11-15, Sevilla, Spain
Clemens Valiente
Clemens Valiente
Lead Big Data Engineer
trivago Dsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Data driven PR and External Communication
Price information collected

from the various booking
websites and shown to our
visitors also gives us a
thorough overview over trends
and development of hotel
prices. This knowledge then is
used by our Content Marketing
& Communication Department
(CMC) to write stories and
articles.
3
The past: Data pipeline 2010 2015
Java Software Business

Engineering Intelligence
CMC
4
Facts & Figures
Price dimensions Restrictions Size of data
- Around one million hotels - Only single night stays - We collected a total of 56
- 250 booking websites - Only prices from European billion prices in those five
- Travellers search for up to visitors years
180 days in advance - Prices cached up to 30 - Towards the end of this
- Data collected over five minutes pipeline in early 2015 the
years - One price per hotel, average was around 100
website and arrival date million prices per day
per day written to BI
- Insert ignore: The first
price per key wins
5
Java Software Business

Engineering Intelligence
CMC
6
Refactoring the pipeline: Requirements
Scales with an arbitrary amount of data (future proof)

reliable and resilient
low performance impact on Java backend
long term storage of raw input data
fast processing of filtered and aggregated data
Open source
we want to log everything:
more prices
Length of stay, room type, breakfast info, room category, domain
with more information
Net & gross price, city tax, resort fee, affiliate fee, VAT
7
Present data pipeline 2016 ingestion
San Francisco
Dsseldorf
Hong Kong
8
Present data pipeline 2016 processing
Camus
CMC
9
Present data pipeline 2016 facts & figures
Cluster specifications Data Size (price log) Data processing

- 51 machines - 2.6 trillion messages - Camus: 30 mappers writing
- 1.7 PB disc space, 60% collected so far data in 10 minute intervals
used - 7 billion messages/day - First aggregation/filtering
- 3.6 TB memory in Yarn - 160 TB of data stage in Hive runs in 30
- 1440 VCores (24-32 Cores minutes with 5 days of CPU
per machine) time spent
- Impala Queries across
>100 GB of result tables
usually done within a few
seconds
10
Present data pipeline 2016 results after one
and a half years in production
Very reliable, barely any downtime or service interuptions of the system
Java team is very happy less load on their system
BI team is very happy more data, more ressources to process it
CMC team is very happy
Faster results
Better quality of results due to more data
More detailed results
=> Shorter research phase, more and better stories
=> Less requests & workload for BI
11
Present data pipeline 2016 use cases &
status quo
Uses for price information Other data sources and Status quo
- Monitoring price parity in usage - Our entire business runs on
hotel market - Clicklog information from and through the kafka
- Anomaly and fraud our website and mobile hadoop pipeline
detection app - Almost all departments rely
- Price feed for online - Used for marketing on data, insights and
marketing performance analysis, metrics delivered by
- Display of price product tests, invoice hadoop
development and delivering generation etc - Most of the company could
price alerts to website not do their job without
visitors hadoop data
12
Future data pipeline 2016/2017
Message format:
CSV
Protobuf / Avro
Kafka Connect
or Gobblin Kylin / Hbase
Stream processing
Kafka Streams CMC
Streaming SQL
13
Message format:
CSV
Protobuf / Avro
Stream processing
Kafka Streams CMC
Streaming SQL
14
CMC
Streams
local state
* https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
15
Key challenges and learnings
Mastering hadoop Using hadoop Bad parts

- Finding your log files - Offer easy hadoop access - HUE (the standard GUI)
- Interpreting error messages to users (Impala / Hive - Write oozie workflows and
correctly JDBC with visualisation coordinators in xml, not
- Understanding settings tools) through the Hue interface
and how to use them to - Educate users on how to - Monitoring impala
solve problem write good code, strict - Still some hard to find bugs
- Store data in wide, guidelines and code in Hive & Impala
denormalised Hive tables in review
parquet format and nested - deployment process:
data types jenkins deploys git
repository with oozie
definitions and hive scripts
to hdfs
16
Clemens Valiente
Lead Big Data Engineer
Thank you! trivago Dsseldorf
Originally a mathematician
Now it is time Studied at Uni Erlangen
At trivago for 5 years
for questions
and comments! Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Resources
Gobblin: https://github.com/linkedin/gobblin
Impala connector for dplyr: https://github.com/piersharding/dplyrimpaladb
Querying Kafka Stream's local state: https://www.confluent.io/blog/unifying-stream-processing-and-
interactive-queries-in-apache-kafka/
Hive on Spark:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
Parquet: https://parquet.apache.org/documentation/latest/
ProtoBuf: https://developers.google.com/protocol-buffers/
18

Trivago Pipeline

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Trivago Pipeline

Încărcat de

Drepturi de autor:

Formate disponibile

Large scale data

Price information collected

Java Software Business

Java Software Business

Scales with an arbitrary amount of data (future proof)

Cluster specifications Data Size (price log) Data processing

Mastering hadoop Using hadoop Bad parts

Thank you! trivago Dsseldorf

S-ar putea să vă placă și