Sunteți pe pagina 1din 3

Profiling requirement:

We need a set of python program APIs for capturing the descriptive statistics and data profiling metrics for a
given dataset and update the result to a MYSQL table in the form of JSON.

There are total 4 APIs we need.

API 1 - Calculate basic statistics:

INPUT: SPARK SQL Query, Output Table Name, # of Columns Per Packet, Parallelism
OUTPUT: Job submission conformation

SPARK SQL Query: SQL Query to be executed on SPARK SQL


Output Table Name: Name of the table to be created with output in MYSQL
# of Columns per packet: Number of columns for which descriptive statistics to be run in one shot
Parallelism: # of concurrent packets to run

Expected process:
This program should connect to SPARK SQL environment and execute SQL query which is provided as input. The
result set should be passed to one of the scalable python libraries (Pyspark, Kaolas, SPARK MLIB, Pandas etc.) to
calculate the basis descriptive statistics like –Type, COUNT, Distinct Count, # NULLs, Min, Max, Median, Max.
https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced-processing-time-from-
hours-to-minutes-with-koalas.html

Output Table Structure:

CURRENT_PACKET TOTAL_PACKETS FIELD JSON STATUS OVERALLSTATUS


1 5 COL1 JSON_OBJECT Completed In progress
1 5 COL2 JSON_OBJECT Completed In progress
1 5 COL3 JSON_OBJECT Completed In progress
1 5 COL4 JSON_OBJECT Completed In progress
…. …. …. …. …. In progress
2 5 COL21 JSON_OBJECT Completed In progress
2 5 COL22 JSON_OBJECT Completed In progress
2 5 COL23 JSON_OBJECT Completed In progress
2 5 COL24 JSON_OBJECT Completed In progress
…. …. …. …. …. In progress
3 5 COL31 JSON_OBJECT In progress In progress

CURRENT_PACKET – Packet Number for the specified column


TOTAL_PACKETS – Total Number of Packets (Total # of fields / # of columns per packet )
FIELD – Column name from the dataset
JSON – Descriptive statistics result in JSON array
STATUS – Status on descriptive statistics for the specified column
OVERALL STATUS – Overall status of the descriptive statistics for the entire dataset
This process should create a table and insert entries for all the columns as shown in the above image.
The descriptive statistics process should run for in parallel for specified parallelism.

As and when the statistics calculation is completed for each packet, the JSON object should be updated
with statistics along with the STATUS column. Once all packets are processed, overall status column
should be updated with status “Completed”.

API 2 – Return the descriptive statistics process status, if completed – return result

INPUT: Output Table Name (MySQL)


OUTPUT: OVERALL_RUN_STATUS, RESULT

Output Table Name: MYSQL Table name created in API 1


OVERALL_RUN_STATUS: Overall completion status of the descriptive statistic process API 1
RESULT: If overall run status is completed, return JSON object for each field. If overall run status is in-
progress, return json object for the fields for which the descriptive stats are completed.

API 3 - Calculate Detailed Profiling Metrics:

INPUT: SPARK SQL Query, Output Table Name, # of Column Per Packet, Parallelism
OUTPUT: Job submission conformation

SPARK SQL Query: SQL Query to executed on SPARK environment


Output Table Name: Name of the table to be created with output in MYSQL
# of Columns per packet: Number of columns for which descriptive statistics to be run in one shot
Parallelism: # of concurrent process to run

Expected process:
This program should connect to SPARK SQL environment and execute SQL query which provided as
input. The result set should be passed to one of the scalable python libraries (Kaolas, SPARK MLIB etc) to
calculate detailed data profiling metrics as show in the below link (Refer to Overview, Variables,
Correlations, Missing Values section):

https://blog.usejournal.com/pandas-profiling-to-boost-exploratory-data-analysis-8e718238bcd1
CURRENT_PACKET TOTAL_PACKETS FIELD JSON STATUS OVERALLSTATUS
1 5 COL1 JSON_OBJECT Completed In progress
1 5 COL2 JSON_OBJECT Completed In progress
1 5 COL3 JSON_OBJECT Completed In progress
1 5 COL4 JSON_OBJECT Completed In progress
…. …. …. …. …. In progress
2 5 COL21 JSON_OBJECT Completed In progress
2 5 COL22 JSON_OBJECT Completed In progress
2 5 COL23 JSON_OBJECT Completed In progress
2 5 COL24 JSON_OBJECT Completed In progress
…. …. …. …. …. In progress
3 5 COL31 JSON_OBJECT In progress In progress

CURRENT_PACKET – Packet Number for the specified column


TOTAL_PACKETS – Total Number of Packets (Total # of fields / # of columns per packet )
FIELD – Column name from the dataset
JSON – Detailed profiling metrics result in JSON array
STATUS – Status on descriptive statistics for the specified column
OVERALL STATUS – Overall status of the descriptive statistics for the entire dataset

This process should create a table and insert entries for all the columns as show in the above image. The
descriptive statistics process should run for in parallel for specified parallelism.

As and when the profiling metrics calculation is completed for each packet, the JSON object should be updated
with statistics along with the STATUS column. Once all packets are processed overall status column should be
updated with status “Completed”.

API 4 – Return the descriptive statistics process status, if completed – return result

INPUT: Output Table Name


OUTPUT: OVERALL_RUN_STATUS, RESULT

Output Table Name: MYSQL Table name created in API 3


OVERALL_RUN_STATUS: Overall completion status of the detailed data profiling metrics process API 3
RESULT: If overall run status is completed, return JSON object for each field. If overall run status is in progress,
return json object for the fields for which the detailed data profiling metrics are completed.

S-ar putea să vă placă și