Documente Academic
Documente Profesional
Documente Cultură
processing pipelines
at trivago: a use case
2016-11-15, Sevilla, Spain
Clemens Valiente
Clemens Valiente
Lead Big Data Engineer
trivago Dsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Data driven PR and External Communication
3
The past: Data pipeline 2010 2015
CMC
4
The past: Data pipeline 2010 2015
Facts & Figures
Price dimensions Restrictions Size of data
- Around one million hotels - Only single night stays - We collected a total of 56
- 250 booking websites - Only prices from European billion prices in those five
- Travellers search for up to visitors years
180 days in advance - Prices cached up to 30 - Towards the end of this
- Data collected over five minutes pipeline in early 2015 the
years - One price per hotel, average was around 100
website and arrival date million prices per day
per day written to BI
- Insert ignore: The first
price per key wins
5
The past: Data pipeline 2010 2015
CMC
6
Refactoring the pipeline: Requirements
7
Present data pipeline 2016 ingestion
San Francisco
Dsseldorf
Hong Kong
8
Present data pipeline 2016 processing
Camus
CMC
9
Present data pipeline 2016 facts & figures
10
Present data pipeline 2016 results after one
and a half years in production
Very reliable, barely any downtime or service interuptions of the system
Java team is very happy less load on their system
BI team is very happy more data, more ressources to process it
CMC team is very happy
Faster results
Better quality of results due to more data
More detailed results
=> Shorter research phase, more and better stories
=> Less requests & workload for BI
11
Present data pipeline 2016 use cases &
status quo
Uses for price information Other data sources and Status quo
- Monitoring price parity in usage - Our entire business runs on
hotel market - Clicklog information from and through the kafka
- Anomaly and fraud our website and mobile hadoop pipeline
detection app - Almost all departments rely
- Price feed for online - Used for marketing on data, insights and
marketing performance analysis, metrics delivered by
- Display of price product tests, invoice hadoop
development and delivering generation etc - Most of the company could
price alerts to website not do their job without
visitors hadoop data
12
Future data pipeline 2016/2017
Message format:
CSV
Protobuf / Avro
Kafka Connect
or Gobblin Kylin / Hbase
Stream processing
Kafka Streams CMC
Streaming SQL
13
Future data pipeline 2016/2017
Message format:
CSV
Protobuf / Avro
Stream processing
Kafka Streams CMC
Streaming SQL
14
Future data pipeline 2016/2017
CMC
Streams
local state
* https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
15
Key challenges and learnings
Originally a mathematician
Now it is time Studied at Uni Erlangen
At trivago for 5 years
for questions
and comments! Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Resources
Gobblin: https://github.com/linkedin/gobblin
Impala connector for dplyr: https://github.com/piersharding/dplyrimpaladb
Querying Kafka Stream's local state: https://www.confluent.io/blog/unifying-stream-processing-and-
interactive-queries-in-apache-kafka/
Hive on Spark:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
Parquet: https://parquet.apache.org/documentation/latest/
ProtoBuf: https://developers.google.com/protocol-buffers/
18