Documente Academic
Documente Profesional
Documente Cultură
@shirshanka, @tusharis
Data Protection in a Digital World
PLAYING CATCH-UP WITH INNOVATION
GDPR
LinkedIn’s
OUR VISION Vision
metric scripts
Business facing
decision making
500M 10M 10M 11B 29K
Members companies jobs endorsements schools
The LinkedIn Privacy Paradox
MEMBER PRIVACY <> MEMBER DISCOVERY
Example
Member Experiences
production code
Member Data
System of Intelligence
Business Decisions
Data Democracy
ALL THE DATA, ALL THE TIME
At Rest
In Motion
Scale Scale
O(10) clusters O(10) clusters
~2.3 Trillion messages ~10K machines
~450 TB ~100 PB
Data Integration
Azure
Blob, Data
REST
JDBC Lake
SFTP Storage
At Rest
In Motion
Apache Gobblin: Simplifying Data Integration
SFTP Azure
Blob, Data
REST
Lake
JDBC Storage
SFTP
@LinkedIn
Hundreds of TB per day
Open source @ https://gobblin.apache.org/
Thousands of datasets
Stream + Batch
~30 different source systems
Adopted by LinkedIn, Intel, PayPal, Apple, IBM,
Swisscom, Prezi, AppLift, NerdWallet and many more… 80%+ of data ingest
Less Data
REQUIREMENTS
Engineering:
Need the ability to delete some specific subset or all data associated
with a specific LinkedIn member from all our data systems
Data Deletion
IMPLICATIONS FOR HADOOP
Task
Work
Source Unit
Extract Convert Quality Write Data
Publish
Task
Work Extract Convert Quality Write
Unit
Gobblin: Extending for Purge
Task
Work
Unit Extract Convert Quality Write
HDFS
HDFS
Task Data
Publish
Gobblin: Data Lifecycle Management at Scale
STATUS AND CHALLENGES
Status
Number of datasets: many thousands
Amount of data scanned for purge: XXX TB/day
Challenges
Immutable Storage Formats + Right to Erasure = Unhappy Disks
“Widespread implementation will surely lead to innovation in these formats!”
The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION
Where is dataset X?
How did it get created?
ML driven
refinements
The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION
At Rest
In Motion
Challenge for Infrastructure Providers
HARD TO CHANGE ANYTHING UNDERNEATH!
Hard to move to
better formats
without breaking
everyone or
copying data twice
My Raw Data
Semantic Challenges
HARD TO CHANGE ANYTHING UPSTREAM!
Have to change
data processing
logic everywhere!
My Raw Data
We need “microservices” for Data
AN API TO MANAGE EVOLUTION
My Data API
My Raw Data
We built Dali to solve this
A DATA ACCESS LAYER FOR LINKEDIN
Logical FileSystem
Dali: Implementation Details in Context
Processing Engine
(MR, Spark)
Data Catalog Dali FileSystem
View Def +
UDFs
Basic Restrictions
Access to dataset based on business need
Privacy by Default
Analysts shouldn’t get access to raw PII by
default
Different Types
Simple to Complex
Consent-based Access
Access to certain data elements only available
if member has consented for that particular use-
case
Solving for Compliant Access
STEP 1: DATA + METADATA
Meta
PROFILE DATA
Data
Schema = {
int memberId
String firstName
Raw
String lastName
Dataset Position[] positions
MemberProfile educationHistory[] educationHistory
…
}
Privacy Preferences
STEP 2: A MEMBER’S PREFERENCES
Privacy Preferences
A BITMAP DATASET: ONE PER MEMBER
Member Privacy
Preferences
Solving for Compliant Access With Dali
Member’s Delete
Use
Requests Dali Case =
Reader Purge
Library
Meta
Data
Raw Purged
Dataset Dataset
The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION
METADATA
WhereHows*