Expected Process:: Profiling Requirement

Profiling requirement:
We need a set of python program APIs for capturing the descriptive statistics and data profiling metrics for a
given dataset and update the result to a MYSQL table in the form of JSON.
There are total 4 APIs we need.
API 1 - Calculate basic statistics:
INPUT: SPARK SQL Query, Output Table Name, # of Columns Per Packet, Parallelism
OUTPUT: Job submission conformation
SPARK SQL Query: SQL Query to be executed on SPARK SQL

Output Table Name: Name of the table to be created with output in MYSQL
# of Columns per packet: Number of columns for which descriptive statistics to be run in one shot
Parallelism: # of concurrent packets to run
Expected process:
This program should connect to SPARK SQL environment and execute SQL query which is provided as input. The
result set should be passed to one of the scalable python libraries (Pyspark, Kaolas, SPARK MLIB, Pandas etc.) to
calculate the basis descriptive statistics like –Type, COUNT, Distinct Count, # NULLs, Min, Max, Median, Max.
https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced-processing-time-from-
hours-to-minutes-with-koalas.html
Output Table Structure:
CURRENT_PACKET TOTAL_PACKETS FIELD JSON STATUS OVERALLSTATUS

1 5 COL1 JSON_OBJECT Completed In progress
…. …. …. …. …. In progress
…. …. …. …. …. In progress
3 5 COL31 JSON_OBJECT In progress In progress
CURRENT_PACKET – Packet Number for the specified column

TOTAL_PACKETS – Total Number of Packets (Total # of fields / # of columns per packet )
FIELD – Column name from the dataset
JSON – Descriptive statistics result in JSON array
STATUS – Status on descriptive statistics for the specified column
OVERALL STATUS – Overall status of the descriptive statistics for the entire dataset
This process should create a table and insert entries for all the columns as shown in the above image.
The descriptive statistics process should run for in parallel for specified parallelism.
As and when the statistics calculation is completed for each packet, the JSON object should be updated
with statistics along with the STATUS column. Once all packets are processed, overall status column
should be updated with status “Completed”.
API 2 – Return the descriptive statistics process status, if completed – return result
INPUT: Output Table Name (MySQL)

OUTPUT: OVERALL_RUN_STATUS, RESULT
Output Table Name: MYSQL Table name created in API 1

OVERALL_RUN_STATUS: Overall completion status of the descriptive statistic process API 1
RESULT: If overall run status is completed, return JSON object for each field. If overall run status is in-
progress, return json object for the fields for which the descriptive stats are completed.
API 3 - Calculate Detailed Profiling Metrics:
INPUT: SPARK SQL Query, Output Table Name, # of Column Per Packet, Parallelism
OUTPUT: Job submission conformation
SPARK SQL Query: SQL Query to executed on SPARK environment

Output Table Name: Name of the table to be created with output in MYSQL
# of Columns per packet: Number of columns for which descriptive statistics to be run in one shot
Parallelism: # of concurrent process to run
Expected process:
This program should connect to SPARK SQL environment and execute SQL query which provided as
input. The result set should be passed to one of the scalable python libraries (Kaolas, SPARK MLIB etc) to
calculate detailed data profiling metrics as show in the below link (Refer to Overview, Variables,
Correlations, Missing Values section):
https://blog.usejournal.com/pandas-profiling-to-boost-exploratory-data-analysis-8e718238bcd1
CURRENT_PACKET TOTAL_PACKETS FIELD JSON STATUS OVERALLSTATUS
…. …. …. …. …. In progress
…. …. …. …. …. In progress
3 5 COL31 JSON_OBJECT In progress In progress
CURRENT_PACKET – Packet Number for the specified column

TOTAL_PACKETS – Total Number of Packets (Total # of fields / # of columns per packet )
FIELD – Column name from the dataset
JSON – Detailed profiling metrics result in JSON array
STATUS – Status on descriptive statistics for the specified column
OVERALL STATUS – Overall status of the descriptive statistics for the entire dataset
This process should create a table and insert entries for all the columns as show in the above image. The
descriptive statistics process should run for in parallel for specified parallelism.
As and when the profiling metrics calculation is completed for each packet, the JSON object should be updated
with statistics along with the STATUS column. Once all packets are processed overall status column should be
updated with status “Completed”.
API 4 – Return the descriptive statistics process status, if completed – return result
INPUT: Output Table Name

OUTPUT: OVERALL_RUN_STATUS, RESULT
Output Table Name: MYSQL Table name created in API 3

OVERALL_RUN_STATUS: Overall completion status of the detailed data profiling metrics process API 3
RESULT: If overall run status is completed, return JSON object for each field. If overall run status is in progress,
return json object for the fields for which the detailed data profiling metrics are completed.

Expected Process:: Profiling Requirement

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Expected Process:: Profiling Requirement

Încărcat de

Drepturi de autor:

Formate disponibile

Profiling requirement:

There are total 4 APIs we need.

API 1 - Calculate basic statistics:

SPARK SQL Query: SQL Query to be executed on SPARK SQL

Output Table Structure:

CURRENT_PACKET TOTAL_PACKETS FIELD JSON STATUS OVERALLSTATUS

CURRENT_PACKET – Packet Number for the specified column

INPUT: Output Table Name (MySQL)

Output Table Name: MYSQL Table name created in API 1

API 3 - Calculate Detailed Profiling Metrics:

SPARK SQL Query: SQL Query to executed on SPARK environment

CURRENT_PACKET – Packet Number for the specified column

INPUT: Output Table Name

Output Table Name: MYSQL Table name created in API 3

S-ar putea să vă placă și