Documente Academic
Documente Profesional
Documente Cultură
We need a set of python program APIs for capturing the descriptive statistics and data profiling metrics for a
given dataset and update the result to a MYSQL table in the form of JSON.
INPUT: SPARK SQL Query, Output Table Name, # of Columns Per Packet, Parallelism
OUTPUT: Job submission conformation
Expected process:
This program should connect to SPARK SQL environment and execute SQL query which is provided as input. The
result set should be passed to one of the scalable python libraries (Pyspark, Kaolas, SPARK MLIB, Pandas etc.) to
calculate the basis descriptive statistics like –Type, COUNT, Distinct Count, # NULLs, Min, Max, Median, Max.
https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced-processing-time-from-
hours-to-minutes-with-koalas.html
As and when the statistics calculation is completed for each packet, the JSON object should be updated
with statistics along with the STATUS column. Once all packets are processed, overall status column
should be updated with status “Completed”.
API 2 – Return the descriptive statistics process status, if completed – return result
INPUT: SPARK SQL Query, Output Table Name, # of Column Per Packet, Parallelism
OUTPUT: Job submission conformation
Expected process:
This program should connect to SPARK SQL environment and execute SQL query which provided as
input. The result set should be passed to one of the scalable python libraries (Kaolas, SPARK MLIB etc) to
calculate detailed data profiling metrics as show in the below link (Refer to Overview, Variables,
Correlations, Missing Values section):
https://blog.usejournal.com/pandas-profiling-to-boost-exploratory-data-analysis-8e718238bcd1
CURRENT_PACKET TOTAL_PACKETS FIELD JSON STATUS OVERALLSTATUS
1 5 COL1 JSON_OBJECT Completed In progress
1 5 COL2 JSON_OBJECT Completed In progress
1 5 COL3 JSON_OBJECT Completed In progress
1 5 COL4 JSON_OBJECT Completed In progress
…. …. …. …. …. In progress
2 5 COL21 JSON_OBJECT Completed In progress
2 5 COL22 JSON_OBJECT Completed In progress
2 5 COL23 JSON_OBJECT Completed In progress
2 5 COL24 JSON_OBJECT Completed In progress
…. …. …. …. …. In progress
3 5 COL31 JSON_OBJECT In progress In progress
This process should create a table and insert entries for all the columns as show in the above image. The
descriptive statistics process should run for in parallel for specified parallelism.
As and when the profiling metrics calculation is completed for each packet, the JSON object should be updated
with statistics along with the STATUS column. Once all packets are processed overall status column should be
updated with status “Completed”.
API 4 – Return the descriptive statistics process status, if completed – return result