Tamingthecompliancebeast Stratanyc 171001162525 PDF

ever-evolving
Taming the ^Compliance Beast:

Lessons learnt at LinkedIn
Shirshanka Das, Principal Staff Engineer, LinkedIn

Tushar Shanbhag, Head of Data Products, LinkedIn
Sept 28, 2017
@shirshanka, @tusharis
Data Protection in a Digital World
PLAYING CATCH-UP WITH INNOVATION
GDPR
LinkedIn’s
OUR VISION Vision
Create economic opportunity for every

member of the global workforce
production code
metric scripts
Business facing
decision making
500M 10M 10M 11B 29K
Members companies jobs endorsements schools
The LinkedIn Privacy Paradox
MEMBER PRIVACY <> MEMBER DISCOVERY
“On one hand, the company has

500+ million members trusting
the company to protect highly
sensitive data.
On the other hand, one only
joins the largest professional
network on the Internet because
they want to be found !"

Kalinda Raina,
Head of Global Privacy, LinkedIn
Members First is a Core Value for LinkedIn
MEMBER PRIVACY WHILE DELIVERING MEMBER VALUE
Example
Member value is proportional to knowledge

production code
metric scripts
Member privacy is paramount for LinkedIn
Well-connected. Few connections. We strive to maintain this fine balance

Get relevance right. Give them inventory.
Data Is the Lifeblood of LinkedIn
MEMBER EXPERIENCES + BUSINESS DECISIONS
Member Experiences
production code
Member Data
System of Intelligence
Business Decisions
Data Democracy
ALL THE DATA, ALL THE TIME
I want to analyze as much data as

possible so my models are accurate
I want to discover data that’s needed for my

analysis as fast as possible
LinkedIn Data Science
We needed data democracy to
deliver member value I want to access that data as quickly as
possible for my analysis 
Data Protection
STORE, PROCESS, DELETE,..
I want my personal data to be stored only

where needed and not propagated
unnecessarily
I want my personal data to be deleted when

LinkedIn Members I close my account or request deletion
Need to Ensure Member Privacy
I want my personal data to only be
processed if essential and only if I consent
The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION
More Data Less Data
Discover Data Discover Violations
Easy Access Restricted Access

LinkedIn’s Data Ecosystem
The Data Paradox
More Data Less Data

Data Hubs at LinkedIn
At Rest
In Motion
Scale Scale
O(10) clusters O(10) clusters
~2.3 Trillion messages ~10K machines
~450 TB ~100 PB
Data Integration
Azure
Blob, Data
REST
JDBC Lake
SFTP Storage
At Rest
In Motion
Apache Gobblin: Simplifying Data Integration
SFTP Azure
Blob, Data
REST
Lake
JDBC Storage
SFTP
@LinkedIn
Hundreds of TB per day
Open source @ https://gobblin.apache.org/
Thousands of datasets
Stream + Batch
~30 different source systems
Adopted by LinkedIn, Intel, PayPal, Apple, IBM,
Swisscom, Prezi, AppLift, NerdWallet and many more… 80%+ of data ingest
Less Data
REQUIREMENTS
Legal: Right to Erasure or Right to be Forgotten
“Delete all my personal data without undue delay when it is no

longer necessary / when consent has been withdrawn”
Engineering:
Need the ability to delete some specific subset or all data associated
with a specific LinkedIn member from all our data systems
Data Deletion
IMPLICATIONS FOR HADOOP
Understand HDFS data: organization, formats, …
Cycle asynchronously, within an SLA, deleting

records, without affecting running jobs
Challenges
Quarantine exceptional records for manual triage
A lot of data, different formats
Can scale to processing hundreds of PB of data

Gobblin: The Logical Pipeline
Task
Work
Extract Convert Quality Write
Unit
Task
Work
Source Unit
Extract Convert Quality Write Data
Publish
Task
Work Extract Convert Quality Write
Unit
Gobblin: Extending for Purge
Member’s Delete If needs purge

Requests then drop
else continue
Task
Work
Unit Extract Convert Quality Write
HDFS
HDFS
Task Data
Publish
Gobblin: Data Lifecycle Management at Scale
STATUS AND CHALLENGES
Status
Number of datasets: many thousands
Amount of data scanned for purge: XXX TB/day
Challenges
Immutable Storage Formats + Right to Erasure = Unhappy Disks
“Widespread implementation will surely lead to innovation in these formats!”
The Data Paradox
More Data DATA LIFECYCLE MANAGEMENT Less Data

The Data Paradox

WhereHows
FIND DATA, NAVIGATE RELATIONSHIPS
Where is dataset X?
How did it get created?
Data Discovery Usage : In production since 2014

Metadata based Search Experience Users : Data Scientists, Product Engineers
for Data Scientists
Use Cases: Discovery, Impact Analysis
Open source @ github.com/linkedin/wherehows

WhereHows
SEARCH SCREENSHOTS
WhereHows
LINEAGE SCREENSHOTS
Discovering Violations
ANSWERING HARDER QUESTIONS
Which datasets at LinkedIn contain PII or highly

confidential data?
How many contain member-member messages?

Use Cases
More than just Discovery How many of them are accessible by team X?
Have all datasets been purged within SLA?

Discovering Violations
REQUIREMENTS
Comprehensive coverage of data systems at LinkedIn

We have > 20 systems!
SQL, NoSQL, Indexes, Blob Stores, …
Metadata Deeper understanding of each dataset

Wide + Deep
Schema is not enough
Need to understand semantics
WhereHows Architecture @ 10,000 ft
A METADATA REFINERY APPROACH
ML driven
refinements
The Data Paradox
Discover Data METADATA Discover Violations

The Data Paradox

Many Transformation Engines @ LinkedIn
FREEDOM OF EXPRESSION
At Rest
In Motion
Challenge for Infrastructure Providers
HARD TO CHANGE ANYTHING UNDERNEATH!
Native readers, dependencies on path, format hard-coded

My Raw Data
(Pig scripts)
Hard to move to
better formats
without breaking
everyone or
copying data twice
My Raw Data
Semantic Challenges
HARD TO CHANGE ANYTHING UPSTREAM!
Data is unclean (bad data on certain dates)

Data models are in constant flux (split event into multiple)
Have to change
data processing
logic everywhere!
My Raw Data
We need “microservices” for Data
AN API TO MANAGE EVOLUTION
My Data API
My Raw Data
We built Dali to solve this
A DATA ACCESS LAYER FOR LINKEDIN
Abstract away underlying physical details to

allow users to focus solely on the logical
concerns
Logical Tables + Views
Logical FileSystem
Dali: Implementation Details in Context
Dataflow APIs Query Layers

Dali CLI (MR, Spark, (Pig, Hive,
Scalding) Spark)
Dali Datasets (Tables+Views)
Processing Engine
(MR, Spark)
Data Catalog Dali FileSystem
View Def +
UDFs
Dataset Git + Artifactory

Data Source
Owner Data Sink
Access Restrictions
REQUIREMENTS
Basic Restrictions
Access to dataset based on business need
Privacy by Default
Analysts shouldn’t get access to raw PII by
default
Different Types
Simple to Complex
Consent-based Access
Access to certain data elements only available
if member has consented for that particular use-
case
Solving for Compliant Access
STEP 1: DATA + METADATA
NAME : is_pii MEMBER_ID

MEMBER_ID : is_pii NAME
Meta
PROFILE DATA
Data
Schema = {
int memberId
String firstName
Raw
String lastName
Dataset Position[] positions
MemberProfile educationHistory[] educationHistory
…
}
Privacy Preferences
STEP 2: A MEMBER’S PREFERENCES
Privacy Preferences
A BITMAP DATASET: ONE PER MEMBER
Member Privacy
Preferences
Solving for Compliant Access With Dali
Member Privacy Dali Reader responsibility:

Processing
Preferences Logic
Given:
(Dataset, Metadata, UseCase)
Use
Dali Case = X Generate:
Reader
Library
Dataset and Column-level
Meta transformations
Data (obfuscate, null, …)
Auto-join with Member

Raw Privacy Preferences
Dataset (filter out data elements that
are not consented to)
Solving for Compliant Purging With Dali + Gobblin
Member Privacy Gobblin

Preferences Purger
Member’s Delete
Use
Requests Dali Case =
Reader Purge
Library
Meta
Data
Raw Purged
Dataset Dataset
The Data Paradox
Easy Access DATA ACCESS LAYER Restricted Access

The Data Paradox : Solved !

The Technology Blueprint
DATA DEMOCRACY + DATA PROTECTION
DATA ACCESS LAYER DATA LIFECYCLE MANAGEMENT
Dali Apache Gobblin*
METADATA
WhereHows*
* Open Source : We can collaborate on these together!

Privacy : Technology + Process
SUSTAINABILITY IS CRITICAL
Product : Security & Privacy Review
Data : Data Model Review
Privacy By Design Legal : Regulation change -> Tech requirements

Core company value, implemented
by Technology & Process
Company-wide : “Horizontal” Initiatives
Key Takeaways
THE BEAST IS REAL
Stricter regulations in a digital world
Increasingly more complex to implement

Data Protection
Getting Stricter and more complex This is an accelerating global trend
Key Takeaways
THE BEAST CAN BE TAMED !
Privacy By Design : baked into technology

stack & product development process
Standardization : To solve at scale, certain

parts need to be centralized and standardized
Learnings at LinkedIn
We’ve established a blueprint to
sustainably address privacy Company-wide : Needs co-ordinated effort
across various functions
The Data Paradox : Solved !

Thank You!

Tamingthecompliancebeast Stratanyc 171001162525 PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Tamingthecompliancebeast Stratanyc 171001162525 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

ever-evolving

Taming the ^Compliance Beast:

Shirshanka Das, Principal Staff Engineer, LinkedIn

Sept 28, 2017

Create economic opportunity for every

“On one hand, the company has

Member value is proportional to knowledge

Member privacy is paramount for LinkedIn

Well-connected. Few connections. We strive to maintain this fine balance

I want to analyze as much data as

I want to discover data that’s needed for my

I want my personal data to be stored only

I want my personal data to be deleted when

More Data Less Data

Discover Data Discover Violations

Easy Access Restricted Access

More Data Less Data

Discover Data Discover Violations

Easy Access Restricted Access

Legal: Right to Erasure or Right to be Forgotten

“Delete all my personal data without undue delay when it is no

Understand HDFS data: organization, formats, …

Cycle asynchronously, within an SLA, deleting

Can scale to processing hundreds of PB of data

Member’s Delete If needs purge

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data Discover Violations

Easy Access Restricted Access

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data Discover Violations

Easy Access Restricted Access

Data Discovery Usage : In production since 2014

Open source @ github.com/linkedin/wherehows

Which datasets at LinkedIn contain PII or highly

How many contain member-member messages?

Have all datasets been purged within SLA?

Comprehensive coverage of data systems at LinkedIn

Metadata Deeper understanding of each dataset

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data METADATA Discover Violations

Easy Access Restricted Access

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data METADATA Discover Violations

Easy Access Restricted Access

Native readers, dependencies on path, format hard-coded

Data is unclean (bad data on certain dates)

Abstract away underlying physical details to

Logical Tables + Views

Dataflow APIs Query Layers

Dali Datasets (Tables+Views)

Dataset Git + Artifactory

NAME : is_pii MEMBER_ID

Member Privacy Dali Reader responsibility:

Auto-join with Member

Member Privacy Gobblin

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data METADATA Discover Violations

Easy Access DATA ACCESS LAYER Restricted Access

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data METADATA Discover Violations

Easy Access DATA ACCESS LAYER Restricted Access

DATA ACCESS LAYER DATA LIFECYCLE MANAGEMENT

Dali Apache Gobblin*