Sunteți pe pagina 1din 8

Vipul Sinha

Email :

Sr.Big Data/ Hadoop Developer


8+ years of professional experience involving project development,

implementation, deployment and maintenance using Java/J2EE and Big Data
related technologies.
Hadoop Developer with 4+ years of working experience in designing and
implementing complete end-to-end Hadoop based data analytical solutions using
HDFS, MapReduce, Spark, Yarn, Kafka, PIG, HIVE, Sqoop, Storm, Flume, Oozie,
Impala, HBase etc.
Good experience in creating data ingestion pipelines, data transformations, data
management, data governance and real time streaming at an enterprise level.
Experience in importing and exporting different formats of data into HDFS, HBASE
from different RDBMS databases and vice versa using Sqoop.
Exposure to Cloudera development environment and management using Cloudera
Experience in analyzing data using HiveQL, Pig Latin, Hbase, Mongo and custom
MapReduce programs in Java.
Experience in extending Hive and Pig core functionality by writing custom UDFs
using Java.
Developed analytical components using Spark and Spark Stream
Background with traditional databases such as Oracle, SQL Server, MySQL.
Good knowledge and Hands-on experience in storing, processing unstructured data
using NOSQL databases like HBase and MongoDB.
Good knowledge in distributed coordination system ZooKeeper and experience
with Data Warehousing and ETL.
Hands on experience in setting up workflow using Apache Oozie workflow engine
for managing and scheduling Hadoop jobs.
Experience in creating complex SQL Queries and SQL tuning, writing PL/SQL blocks
like stored procedures, Functions, Cursors, Index, triggers and packages.
Good knowledge of database connectivity (JDBC) for databases like Oracle, DB2,
SQL Server, MySQL, NoSQL, MS Access.
Profound experience in creating real time data streaming solutions using Apache
Spark/Spark Streaming, Kafka.
Developed analytical components using Spark and Spark Stream
Worked on a prototype Apache Spark Streaming project, and converted our
existing Java Strom Topology.
Proficient in visualizing data using Tableau, QlikView, Microstratergy and MS Excel.
Experience in developing ETL scripts for data acquisition and transformation using
Informatica and Talend.
Used Maven extensively for building MapReduce jar files and deployed it to
Amazon Web Services (AWS) using EC2 virtual Servers in the cloud and Experience
in build scripts to do continuous integrations systems like Jenkins.
Experienced in Java Application Development, Client/Server Applications,
Internet/Intranet based applications using Core Java, J2EE patterns, Spring,

Hibernate, Struts, JMS, Web Services (SOAP/REST), Oracle, SQL Server and other
relational databases.
Experience writing Shell scripts in Linux OS and integrating them with other
Experienced in using agile methodologies including extreme programming, SCRUM
and Test Driven Development (TDD).
Experienced in creating and analyzing Software Requirement Specifications SRS
and Functional Specification Document FSD. Strong knowledge of Software
Development Life Cycle SDLC.
Devoted to professionalism, highly organized, ability to work under strict deadline
schedules with attention to details, possess excellent written and communication

Technical Expertise :
Hadoop/ Big Data
Sqoop, Storm,
Programming Languges
and Kafka
Java/J2EE & Web Technologies
Angular JS,
Development Tools


No SQL Technologies
Frame Works
Script Languages
Distributed Platforms
Data Base
Operating Systems
Software Packages
Web App. Servers
Version Contrl
Web Techonologies
AngularJS, SOAP,

HDFS Map Reduce, Spark, Yarn, Kafka, PIG, HIVE,

Flume, Oozie, Impala, Hue, Zookeeper

C, Java, PL/SQL, Pig Latin, Python, Hive QL, Scala,


AJAX, Java Script, JQuery,

Agile/Scrum, UML, Rational Unified Process and Waterfall
Eclips, Net Beans, SVN, GIT, Ant, Maven, SOAP UI, JMX,
Explorer, XML, Spy, QC, QTP, JIRA, SQL Developer, QTOAD
Cassandra, Mongo DB, HBase
Struts, Hibernate and Spring MVC
Unix Shell, Script, Perl
Horton Works, Cloud ERA, MapR
Oracle 11G/12C, My SQL Server, TeraData, IBM DB2
Windows XP/Vista/7/8/10, Unix, Linux
MS Office 2007/2010/2016
Web Logic, Web Sphere, Apache Tomcat, Web App.
Tabeau, QlickView, MicroStratetgy and MS Excel
HTML, XML, CSS, Java Script, JQuery, AJAX,



Feb 2015 to

The American Family Insurance is one of the largest insurance companies in the United
States. The focus is to optimize processes for insurance policies. This project aims to
create insightful data, which is needed for the Population focus analysis. The complete
data relevant to the policy holders is collected and aggregated to make a rich record
under single key and further aggregated data is analyzed to provide insights into

predicting auto claims, project growth in premiums and create relevant procedures to be
performed on insurance policies and claims.

Coordinated with business customers to gather business requirements. And also

interacted with other technical peers to derive Technical requirements.
Developed MapReduce, Pig and Hive scripts to cleanse, validate and transform
Implemented Map Reduce programs to handle semi/unstructured data like xml,
json, Avro data files and sequence files for log files.
Developed Sqoop jobs to import and store massive volumes of data in HDFS and
Designed and developed PIG data transformation scripts to work against
unstructured data from various data points and created a base line.
Experienced in implementing Spark RDD transformations, actions to implement
business analysis.
Designed Data Quality Framework to perform schema validation and data profiling
on Spark (pySpark).
Experience in creating real time data streaming solutions using Apache
Spark/Spark Streaming, Kafka.
Leveraged spark (pySpark) to manipulate unstructured data and apply text mining
on user's table utilization data.
Worked on creating and optimizing Hive scripts for data analysts based on the
Created Hive UDFs to encapsulate complex and reusable logic for the end users.
Developing predictive analytic using Apache Spark Scala APIs.
Experienced in migrating HiveQL into Impala to minimize query response time.
Designed an agent-based computational framework based on Scala, Breeze to
scale computations for many simultaneous users in real-time.
Orchestrated Oozie workflow engine to run multiple Hive and Pig jobs.
Experienced with using different kind of compression techniques like Lzo, Snappy,
Bzip2, Gzip to save data and optimize data transfer over network using Avro,
Parquet, and Orcfile.
Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
Implemented data injection systems by creating Kafka brokers, Java producers,
Consumers, custom encoders.
Implemented Partitioning, Dynamic Partitions and Bucketing in Hive for efficient
data access.
Developed Spark code using Scala and Spark-Sql Streaming for faster testing and
processing of data.
Developed a data pipeline using Kafka and Strom to store data into HDFS.
Developed some utility helper classes to get data from HBase tables.
Good experience in troubleshooting performance issues and tuning Hadoop cluster.
Knowledge in Spark Core, Streaming, Data Frames and SQL, MLib, GraphX.
Implemented Caching for Spark Transformations, action to use as reusable

Extracted files from Cassandra through Sqoop and placed in HDFS and processed.
Used maven to build and deploy the Jars for MapReduce, Pig and Hive UDFs.
Developed workflows in Oozie.
Extensively used the Hue browser for interacting with Hadoop components.
Monitored workload, job performance and capacity planning using Cloudera
Worked on Amazon Web Services.
Cluster coordination services through Zookeeper.
Involved in agile methodologies, daily scrum meetings, spring planning's.

Environment: Linux (CentOS, RedHat), UNIX Shell, Pig, Hive, MapReduce, YARN, Spark
1.4.1, Eclipse, Core Java, JDK1.7, Oozie Workflows, AWS, S3, EMR, ETL, Cloudera, Talend,
HBASE, SQOOP, Scala, Kafka, Go-lang, Python, Cassandra, maven, Horton works, Cloudera
CHASE BANK, Jersey City, NJ

Jan 2014 to Jan

Chase provides innovative payment, travel and expense management solutions for
individuals and businesses of all sizes. Purpose of the project is to create Enterprise Data
Hub so that various business units can use the data from Hadoop to do Data Analytics.
This application is used to provide analytics to clients based on their social media data
from different sources like Facebook, Twitter, blogs and websites. Internally this is used
for Customer churn analysis, Sentiment analysis and Customer experience analytics.


Written Map Reduce code that will take input as customer related flat file and parse
the same data to extract the meaningful (domain specific) information for further
Extensively worked on creating Combiners, Partitioning and Distributed cache to
improve the performance of Map Reduce jobs.
Experience in using Sequence files, ORC, AVRO file formats.
Uploaded and processed more than 30 terabytes of data from various structured
and unstructured sources into HDFS.
Created Hive External tables with partitioning to store the processed data from
Map Reduce.
Implemented Hive optimized joins to gather data from different sources and run
ad-hoc queries on top of them.
Experience in creating real time data streaming solutions using Apache Spark/Spark Streaming, Kafka.
Used Pig to do data transformations, event joins, filter and some pre-aggregations
before storing the data into HDFS.
Developed Pig Latin scripts to extract the data from the web server output files to
load into HDFS.

Developed workflow in Oozie to automate the tasks of loading the data into HDFS
and pre-processing with Pig.
Optimized PIG jobs by using different compression techniques and performance
Worked with Cassandra and utilized NoSQL for non-relation data storage and
Importing and exporting data from relational data stores and MongoDB to HDFS
using Sqoop vice versa.
Performed various performance optimizations like using distributed cache for small
datasets, Partition and Bucketing in hive and Map Side Joins.
Wrote Hive Generic UDF's to perform business logic operations at record level.
Used Flume to collect, aggregate, and store the web log data from different
sources like web servers, mobile devices and pushed to HDFS.
Involved in troubleshooting errors in Shell, Hive and Map Reduce.
Worked on debugging, performance tuning of Hive & Pig Jobs.
Responsible for importing and exporting data from HDFS to MySQL database and
vice-versa using Sqoop.
Involved in admin related issues of HBase and other NoSQL databases.
Monitored and Debugged Hadoop jobs/Applications running in production using
Implemented 100 node CDH4 Hadoop cluster on Red hat Linux using Cloudera
Query indexed data for analytics using Apache Solr.
Exported the analyzed data to the relational databases using Sqoop for Tableau
visualization and to generate reports for the BI team.
Automated workflow using Shell Scripts.
Responsible for managing and reviewing Hadoop log files.
Used ZooKeeper for enabling synchronization across the cluster.
Performed both major and minor upgrades to the existing cluster and also
commissioning and decommissioning of nodes to balance node across the cluster.
Continuous monitoring and managing the Hadoop cluster through Cloudera
Having good working experience in Agile/Scrum methodologies, technical
discussion with client and communication using scrum calls daily for project
analysis specs and development aspects.

Environment: Hadoop, Java 1.7, UNIX, Shell Scripting, HDFS, Talend ,HBase, AWS,
NOSQL, MapReduce, YARN, Hive, PIG, ORACLE, MongoDB, Kafka , ETL , Go-lang,
Zookeeper, Sqoop.

MetLife, Somerset NJ

Sep 2012 to Dec

Used Hadoop to take advantage of available data to make better decisions that
significantly enhanced organizational success.


Developed multiple map reduce jobs in PIG and Hive for data cleaning and preprocessing.
Extensively involved in Design phase and delivered Design documents.
Importing and exporting data into HDFS and Hive using SQOOP.
Writing MapReduce jobs to standardize the data and clean it and calculate
Established custom MapReduce programs in order to analyze data and used Pig
Latin to clean unwanted data.
Written Hive jobs to parse the logs and structure them in tabular format to
facilitate effective querying on the log data.
Involved in creating Hive tables, loading with data and writing hive queries that will
run internally in map reduce way.
Used Hive to analyze the partitioned and bucketed data and compute various
metrics for reporting.
Experienced in managing and reviewing the Hadoop log files.
Load and Transform large sets of structured and semi structured data.
Installed Oozie workflow engine to run multiple Hive and Pig jobs which run
independently with time and data availability
Used Zookeeper for providing coordinating services to the cluster.
Involved in Unit testing and delivered Unit test plans and results documents.
Exported data from HDFS environment into RDBMS using Sqoop for report
generation and visualization purpose.

Environment: Apache Hadoop, Map Reduce, Talend ,HDFS, AWS, Hive, Pig,
ETL,SQOOP, HBase, Zookeeper, UNIX shell scripting, Eclipse.
PNC Bank (Columbus, OH)
Aug 2012





Involved in Business requirements gathering, Design, Development and unit

testing of Bill Pay Account Accelerator (BPAA) & Alphanumeric ID projects.
Involved in maintenance & development of and their related web sites
like PNC virtual wallet, Wealth Management & Mutual Funds etc.
Responsible for developing use cases, class and sequence diagram for the modules
using UML and Rational Rose.
Involved in preparation of docs like Functional Specification document and
Deployment Instruction documents.
Set up the deployment environment on WebSphere 6.1 Developed system
preferences UI screens using JSP2.0 and HTML.
Used Java Script for Client side validations.
Code and Unit Test according to client standards. Provide production support and
quickly resolving the issues until Integration Test is passed.
Fix defects as needed during the QA phase, support QA testing, troubleshoot
defects and identify the source of defects.

Used JMS for Point-to-Point asynchronous messaging for high transactional Banking
Involved in preparation of unit and system test cases and testing of the module in
3 phases named unit testing and system testing and regression testing.
Involved in writing shell scripts, Ant scripts for Unix OS for application deployments
on production region.
Developed core banking business components as a Web Service for enterprisewide SOA Architecture strategy.
Used Rational Clear Case as source control management system.
Implemented SOA architecture with web services using SOAP, WSDL, UDDI and
Involved in deployments in all environments like Dev, Test, UAT and prod
Involved in design Credit Card Service layer on mainframe with MQ series and WBI.
Provide XML based messaging service to front-end applications.
Extensively used IBM RAD 7.1 IDE for building, testing, and deploying applications.
Responsible for developing use cases, class and sequence diagram for the modules
using UML and Rational Rose 2000
Worked with Single Sign-On (SSO) using SAML for retrieving data from third party
applications like Yodlee.

Environment:Java (jdk1.5), J2EE, WebSphere 6.1, IBM RAD 7.5, Rational ClearCase 7.0,
XML, JAXP, XSL, XSLT, XML Schema(XSD), WSDL 2.0, SAML 2.0,ETL, AJAX 1.0, Web
Services, SOA, JSP 2.2, CSS, Servlets, JProfiler, Struts 2.0, Spring, Rational HATS,
JavaScript, JCF, HTML, IBM DB2, JMS, AXIS 2, Swing, MQ, Open source technologies (ANT,
LOG4j and Junit), Oracle 10g, UNIX.
ICICI Bank, Hyderabad,
Sep 2011



Project Name: Online Banking System

Online billing system (OBS) project provides enhancements to existing online banking
application. OBS is a system, which basically provides many online features like Account
information, showing balance, history information, bank transactions and other stuff.
OBS displays monthly statements to customer in selected formats like PDF and HTML
formats up to last 12 months and can download. OBS also contains provision to collect
payment information, online order processing and payment processing.

Assisted in designing and programming for the system, which includes

development of Process Flow Diagram, Entity Relationship Diagram, Data Flow
Diagram and Database Design.
Involved in Transactions, login and Reporting modules, and customized report
generation using Controllers, Testing and debugging the whole project for proper
functionality and documenting modules developed.
Designed front-end components using JSF.
Involved in developing Java APIs, which communicates with the Java Beans.

Implemented MVC architecture using Java, Custom and JSTL tag libraries.
Involved in development of POJO classes and writing Hibernate query language
(HQL) queries.
Implemented MVC architecture and DAO design pattern for maximum abstraction
of the application and code reusability.
Created Stored Procedures using SQL/PL-SQL for data modification.
Used XML, XSL for Data presentation, Report generation and customer feedback
Used Java Beans to automate the generation of Dynamic Reports and for customer
Developed JUnit test cases for regression testing and integrated with ANT build.
Implemented Logging framework using Log4J.
Involved in code review and documentation review of technical artifacts.

Environment: J2EE/Java, JSP, Servlets, JSF, Hibernate, Spring, JavaBeans, XML, XSL,
HTML, DHTML, JavaScript, CVS, JDBC, Log4J, Oracle 9i, IBM WebSphere Application Server