Sunteți pe pagina 1din 55

ever-evolving

Taming the ^Compliance Beast:


Lessons learnt at LinkedIn

Shirshanka Das, Principal Staff Engineer, LinkedIn


Tushar Shanbhag, Head of Data Products, LinkedIn

Sept 28, 2017

@shirshanka, @tusharis
Data Protection in a Digital World
PLAYING CATCH-UP WITH INNOVATION

GDPR
LinkedIn’s
OUR VISION Vision

Create economic opportunity for every


member of the global workforce
production code

metric scripts

Business facing
decision making
500M 10M 10M 11B 29K
Members companies jobs endorsements schools
The LinkedIn Privacy Paradox
MEMBER PRIVACY <> MEMBER DISCOVERY

“On one hand, the company has


500+ million members trusting
the company to protect highly
sensitive data.
On the other hand, one only
joins the largest professional
network on the Internet because
they want to be found !"     
     
Kalinda Raina,
Head of Global Privacy, LinkedIn
Members First is a Core Value for LinkedIn
MEMBER PRIVACY WHILE DELIVERING MEMBER VALUE

Example

Member value is proportional to knowledge


production code
metric scripts

Member privacy is paramount for LinkedIn

Well-connected. Few connections. We strive to maintain this fine balance


Get relevance right. Give them inventory.
Data Is the Lifeblood of LinkedIn
MEMBER EXPERIENCES + BUSINESS DECISIONS

Member Experiences
production code

Member Data
System of Intelligence

Business Decisions
Data Democracy
ALL THE DATA, ALL THE TIME

I want to analyze as much data as


possible so my models are accurate

I want to discover data that’s needed for my


analysis as fast as possible
LinkedIn Data Science
We needed data democracy to
deliver member value I want to access that data as quickly as
possible for my analysis

Data Protection
STORE, PROCESS, DELETE,..

I want my personal data to be stored only


where needed and not propagated
unnecessarily

I want my personal data to be deleted when


LinkedIn Members I close my account or request deletion
Need to Ensure Member Privacy
I want my personal data to only be
processed if essential and only if I consent
The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION

More Data Less Data

Discover Data Discover Violations

Easy Access Restricted Access


LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION

More Data Less Data

Discover Data Discover Violations

Easy Access Restricted Access


Data Hubs at LinkedIn

At Rest
In Motion

Scale Scale
O(10) clusters O(10) clusters
~2.3 Trillion messages ~10K machines
~450 TB ~100 PB
Data Integration

Azure
Blob, Data
REST
JDBC Lake
SFTP Storage

At Rest
In Motion
Apache Gobblin: Simplifying Data Integration
SFTP Azure
Blob, Data
REST
Lake
JDBC Storage
SFTP

@LinkedIn
Hundreds of TB per day
Open source @ https://gobblin.apache.org/
Thousands of datasets
Stream + Batch
~30 different source systems
Adopted by LinkedIn, Intel, PayPal, Apple, IBM,
Swisscom, Prezi, AppLift, NerdWallet and many more… 80%+ of data ingest
Less Data
REQUIREMENTS

Legal: Right to Erasure or Right to be Forgotten

“Delete all my personal data without undue delay when it is no


longer necessary / when consent has been withdrawn”

Engineering:

Need the ability to delete some specific subset or all data associated
with a specific LinkedIn member from all our data systems
Data Deletion
IMPLICATIONS FOR HADOOP

Understand HDFS data: organization, formats, …

Cycle asynchronously, within an SLA, deleting


records, without affecting running jobs
Challenges
Quarantine exceptional records for manual triage
A lot of data, different formats

Can scale to processing hundreds of PB of data


Gobblin: The Logical Pipeline
Task
Work
Extract Convert Quality Write
Unit

Task
Work
Source Unit
Extract Convert Quality Write Data
Publish

Task
Work Extract Convert Quality Write
Unit
Gobblin: Extending for Purge

Member’s Delete If needs purge


Requests then drop
else continue

Task
Work
Unit Extract Convert Quality Write
HDFS
HDFS
Task Data
Publish
Gobblin: Data Lifecycle Management at Scale
STATUS AND CHALLENGES

Status
Number of datasets: many thousands
Amount of data scanned for purge: XXX TB/day
Challenges
Immutable Storage Formats +  Right to Erasure = Unhappy Disks
“Widespread implementation will surely lead to innovation in these formats!”
The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data Discover Violations

Easy Access Restricted Access


The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data Discover Violations

Easy Access Restricted Access


LinkedIn’s Data Ecosystem
WhereHows
FIND DATA, NAVIGATE RELATIONSHIPS

Where is dataset X?
How did it get created?

Data Discovery Usage : In production since 2014


Metadata based Search Experience Users : Data Scientists, Product Engineers
for Data Scientists
Use Cases: Discovery, Impact Analysis

Open source @ github.com/linkedin/wherehows


WhereHows
SEARCH SCREENSHOTS
WhereHows
LINEAGE SCREENSHOTS
Discovering Violations
ANSWERING HARDER QUESTIONS

Which datasets at LinkedIn contain PII or highly


confidential data?

How many contain member-member messages?


Use Cases
More than just Discovery How many of them are accessible by team X?

Have all datasets been purged within SLA?


Discovering Violations
REQUIREMENTS

Comprehensive coverage of data systems at LinkedIn


We have > 20 systems!
SQL, NoSQL, Indexes, Blob Stores, …

Metadata Deeper understanding of each dataset


Wide + Deep
Schema is not enough
Need to understand semantics
WhereHows Architecture @ 10,000 ft
A METADATA REFINERY APPROACH

ML driven
refinements
The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data METADATA Discover Violations

Easy Access Restricted Access


The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data METADATA Discover Violations

Easy Access Restricted Access


Many Transformation Engines @ LinkedIn
FREEDOM OF EXPRESSION

At Rest
In Motion
Challenge for Infrastructure Providers
HARD TO CHANGE ANYTHING UNDERNEATH!

Native readers, dependencies on path, format hard-coded


My Raw Data
(Pig scripts)

Hard to move to
better formats
without breaking
everyone or
copying data twice
My Raw Data
Semantic Challenges
HARD TO CHANGE ANYTHING UPSTREAM!

Data is unclean (bad data on certain dates)


Data models are in constant flux (split event into multiple)

Have to change
data processing
logic everywhere!
My Raw Data
We need “microservices” for Data
AN API TO MANAGE EVOLUTION

My Data API
My Raw Data
We built Dali to solve this
A DATA ACCESS LAYER FOR LINKEDIN

Abstract away underlying physical details to


allow users to focus solely on the logical
concerns

Logical Tables + Views

Logical FileSystem
Dali: Implementation Details in Context

Dataflow APIs Query Layers


Dali CLI (MR, Spark, (Pig, Hive,
Scalding) Spark)

Dali Datasets (Tables+Views)

Processing Engine
(MR, Spark)
Data Catalog Dali FileSystem

View Def +
UDFs

Dataset Git + Artifactory


Data Source
Owner Data Sink
Access Restrictions
REQUIREMENTS

Basic Restrictions
Access to dataset based on business need
Privacy by Default
Analysts shouldn’t get access to raw PII by
default
Different Types
Simple to Complex
Consent-based Access
Access to certain data elements only available
if member has consented for that particular use-
case
Solving for Compliant Access
STEP 1: DATA + METADATA

NAME : is_pii MEMBER_ID


MEMBER_ID : is_pii NAME

Meta
PROFILE DATA
Data
Schema = {
int memberId
String firstName
Raw
String lastName
Dataset Position[] positions
MemberProfile educationHistory[] educationHistory

}
Privacy Preferences
STEP 2: A MEMBER’S PREFERENCES
Privacy Preferences
A BITMAP DATASET: ONE PER MEMBER

Member Privacy
Preferences
Solving for Compliant Access With Dali

Member Privacy Dali Reader responsibility:


Processing
Preferences Logic
Given:
(Dataset, Metadata, UseCase)
Use
Dali Case = X Generate:
Reader
Library
Dataset and Column-level
Meta transformations
Data (obfuscate, null, …)

Auto-join with Member


Raw Privacy Preferences
Dataset (filter out data elements that
are not consented to)
Solving for Compliant Purging With Dali + Gobblin

Member Privacy Gobblin


Preferences Purger

Member’s Delete
Use
Requests Dali Case =
Reader Purge
Library
Meta
Data

Raw Purged
Dataset Dataset
The Data Paradox
DATA DEMOCRACY <> DATA PROTECTION

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data METADATA Discover Violations

Easy Access DATA ACCESS LAYER Restricted Access


The Data Paradox : Solved !
DATA DEMOCRACY <> DATA PROTECTION

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data METADATA Discover Violations

Easy Access DATA ACCESS LAYER Restricted Access


The Technology Blueprint
DATA DEMOCRACY + DATA PROTECTION

DATA ACCESS LAYER DATA LIFECYCLE MANAGEMENT

Dali Apache Gobblin*

METADATA
WhereHows*

* Open Source : We can collaborate on these together!


Privacy : Technology + Process
SUSTAINABILITY IS CRITICAL

Product : Security & Privacy Review

Data : Data Model Review

Privacy By Design Legal : Regulation change -> Tech requirements


Core company value, implemented
by Technology & Process
Company-wide : “Horizontal” Initiatives
Key Takeaways
THE BEAST IS REAL

Stricter regulations in a digital world

Increasingly more complex to implement


Data Protection
Getting Stricter and more complex This is an accelerating global trend
Key Takeaways
THE BEAST CAN BE TAMED !

Privacy By Design : baked into technology


stack & product development process

Standardization : To solve at scale, certain


parts need to be centralized and standardized
Learnings at LinkedIn
We’ve established a blueprint to
sustainably address privacy Company-wide : Needs co-ordinated effort
across various functions
The Data Paradox : Solved !
DATA DEMOCRACY <> DATA PROTECTION

More Data DATA LIFECYCLE MANAGEMENT Less Data

Discover Data METADATA Discover Violations

Easy Access DATA ACCESS LAYER Restricted Access


Thank You!

S-ar putea să vă placă și