Sunteți pe pagina 1din 13

SQL to Hadoop and back again, Part 1: Basic data interchange techniques

Martin C. Brown (mc@mcslp.com) Director of Documentation 24 September 2013

In this series of articles, we'll look at a range of different methods for integration between Apache Hadoop and traditional SQL databases, including simple data exchange methods, live data sharing and exchange between the two systems, and the use of SQL-based layers on top of Apache Hadoop, including HBase and Hive, to act as the method of integration. Here in Part 1, we examine some of the basic architectural aspects of exchanging information and the basic techniques for performing data interchange. View more content in this series

Big data and SQL


"Big data" is a term that has been used regularly now for almost a decade. And it, along with technologies such as NoSQL, are seen as the replacements for the long successful RDBMS solutions that use SQL. Today, DB2, Oracle, Microsoft SQL Server MySQL, and PostgreSQL dominate the SQL space and still make up a considerable proportion of the overall market. Big data and the database systems and services that go along with it have become additional cogs in the gears of modern systems. But how do you integrate your existing SQL-based data stores with Hadoop so you can take advantage of the different technologies when you need them? Let's examine the basic architectural aspects of exchanging information and basic techniques for performing data interchange.

Data and querying considerations


The most important consideration when exchanging information between SQL and Hadoop is the data format of the information. The format should be driven entirely from the perspective of the information and the reason it is being exported. Simply exporting your data and then importing it into Hadoop doesn't solve any problems. You need to know exactly what you are importing, why you are importing it, and what you expect to get out of the process. Before we look at the specifics of why you are exchanging the data to begin with, first consider the nature of the data exchange. Is it one-way? Or is it two-way?
Copyright IBM Corporation 2013 SQL to Hadoop and back again, Part 1: Basic data interchange techniques Trademarks Page 1 of 13

developerWorks

ibm.com/developerWorks/

One-way data exchange from SQL to Hadoop, or Hadoop to SQL is practical in situations where the data is being transported to take advantage of the query functionality, and the source is not the companion database solution. For example, pure-textual data, or the raw results of a computational or analysis program, might be stored in Hadoop, processed with MapReduce, and stored in SQL (see Figure 1).

Figure 1. Hadoop to SQL translations

The reverse is less common, where information is extracted from SQL into Hadoop, but it can be used to process SQL-based content that provides a lot of textual content, such as blogs, forums, CRM and other systems (see Figure 2).

Figure 2. SQL to Hadoop translations

Two-way data exchange is more common and provides the best of both worlds in terms of the data exchange and data processing (see Figure 3).

Figure 3. Bidirectional translations

Although there are many examples, the most common one is to take large, linear datasets and text datasets from the SQL and convert that into summarized information that can be processed by a Hadoop cluster. The summarized information can then be imported back into your SQL store. This is particularly useful where the large dataset would take too long to process within an SQL query. An example would be a large corpus of review scores or word/term counts.

IBM InfoSphere BigInsights


InfoSphere BigInsights makes integrating between Hadoop and SQL databases much simpler, since it provides the necessary tools and mechanics to export and import data between different databases. Using InfoSphere BigInsights you can define database sources, views, queries and other selection criteria, and then automatically convert that into a variety of formats before importing that collection directly into Hadoop (see Resources for more information). SQL to Hadoop and back again, Part 1: Basic data interchange techniques Page 2 of 13

ibm.com/developerWorks/

developerWorks

For example, you can create a query that extracts the data and populates a JSON array with the record data. Once exported, a job can be created to process and crunch the data before either displaying it, or importing the processed data and exporting the data back into DB2. Download InfoSphere BigInsights Quick Start Edition, a complimentary, downloadable version of InfoSphere BigInsights. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets.

As a rule, there are three primary reasons for interfacing between SQL and Hadoop: 1. Exporting for storage Hadoop provides a practical solution for storing large amounts of infrequently used data in a format that can be queried, processed, and extracted. For example: Usage logs, access logs, and error information are all practical for insertion into a Hadoop cluster to take advantage of the HDFS architecture. A secondary feature of this type of export is that the information can later be processed or parsed and converted into a format that can be used again. 2. Exporting for analysis Two common cases are exporting for reimport to SQL and exporting the analysis output to be used directly in your application (analyzing and storing the result in JSON, for example). Hadoop provides the advantage here by allowing for distributed large-scale processing of information rather than the single-table host processing provided in SQL. With the analysis route, the original information is generally kept, but the analysis process is used to provide summary or statistical bases that work alongside the original data. 3. Exporting for processing Processing-based exports are designed to take the original raw source information, process and reduce or simplify it, then store that information back to replace the original data. This type of exchange is most commonly used where the source information has been captured, but the raw original information is no longer required. For example, logging data of various forms can be easily resolved into a simpler structure, either by looking for specific event types or by summarizing the data to counts of specific errors or occurrences. The raw data is often not required here. Reducing that data through Hadoop and loading the summary stats back, saves time and makes the content easier to query. With these basic principles in mind, let's look at the techniques for a basic data exchange between SQL and Hadoop.

Exporting data from SQL


When exporting from SQL, the biggest consideration is the format of the information that you generate. Because Hadoop is not a tabular database, it makes sense to choose a flexible format for data that will processed in Hadoop. One option is the CSV format if you want to work with pure tabular information, but you can also use raw text with suitable separators or identifiers. For complex structures, it can make more sense to output information in a structure that allows for easy separation and distribution. An example would be generating record data as JSON and exporting blocks of data, for example 10,000 records per file. Using a flexible encapsulation format like JSON solves many of the data interchange headaches.

SQL to Hadoop and back again, Part 1: Basic data interchange techniques

Page 3 of 13

developerWorks

ibm.com/developerWorks/

Using a standard dump or query export


Most SQL databases and interfaces have a method for exporting data in specific formats. For example, within MySQL you can create a CSV file using the command line, as shown in Listing 1.

Listing 1. Creating a CSV file using the command line


SELECT title, subtitle, servings, description into OUTFILE FIELDS TERMINATED BY ',' FROM recipes t; 'result.csv'

In DB2 the same solution exists (see Listing 2).

Listing 2. Creating a CSV file in DB2


EXPORT TO result.csv OF DEL MODIFIED BY NOCHARDEL SELECT title, subtitle, servings, description FROM recipes;

The resulting file can be loaded straight into Hadoop through HDFS. Generating the same output with a simple script in Perl, Python, or Ruby is as straightforward.

Writing a custom program


Depending upon the dataset, using a custom application to export the data may be more practical. This is true particularly with structured data where the information you want to output is based on the content of multiple tables and structures. In general, the easiest method is to take your structured data, agree on an output format or structure (so it can be parsed within Hadoop), and then dump that information out. For example, when processing recipe data to look for common themes and threads, you can use the internal tool to load the recipe record, include the ingredients, description, method, and other data, then use the constructed recipe object to output the information for processing in Hadoop, storing each recipe as a JSON object (see Listing 3).

Listing 3. Exporting complex data


use use use use JSON; Foodware; Foodware::Public; Foodware::Recipe;

my $fw = Foodware->new(); my $recipes = $fw->{_dbh}->get_generic_multi('recipe','recipeid', { active => 1}); my $js = new JSON; foreach my $recipeid (keys %{$recipes}) { my $recipe = new Foodware::Recipe($fw,$recipeid,{ measgroup => 'Metric', tempgroup => 'C',}); my $id = $recipe->{title}; $id =~ s/[ ',\(\)]//g; my $record = { _id => $id,

SQL to Hadoop and back again, Part 1: Basic data interchange techniques

Page 4 of 13

ibm.com/developerWorks/

developerWorks

title => $recipe->{title}, subtitle => $recipe->{subtitle}, servings => $recipe->{servings}, cooktime => $recipe->{metadata_bytag}->{totalcooktime}, preptime => $recipe->{metadata_bytag}->{totalpreptime}, totaltime => $recipe->{metadata_bytag}->{totaltime}, keywords => [keys %{$recipe->{keywordbytext}} ], method => $recipe->{method}, ingredients => $recipe->{ingredients}, comments => $recipe->{comments}, }; foreach my $ingred (@{$recipe->{ingredients}}) { push(@{$record->{ingredients}}, { meastext => $ingred->{'measuretext'}, ingredient => $ingred->{'ingredonly'}, ingredtext => $ingred->{'ingredtext'}, } ); } print to_json($record),"\n"; }

The data is exported to a file that contains the recipe data (see Listing 4).

Listing 4. File containing the recipe data


{ "_id" : "WarmpotatotunaandCheshiresalad", "comments" : null, "preptime" : "20", "servings" : "4", "keywords" : [ "diet@wheat-free", "diet@peanut-free", "diet@corn-free", "diet@citrus-free", "meal type@salads", "diet@shellfish-free", "main ingredient@fish", "diet@demi-veg", "convenience@add bread for complete meal", "diet@gluten-free" ], "subtitle" : "A change from cold salads...", "totaltime" : "35", "cooktime" : "15", "ingredients" : [ { "scaled_fromqty" : 100, "_error_ingredid" : 1, ... } ] }

The result can be loaded directly into HDFS and processed by a suitable MapReduce job to extract the information required. One benefit of this structured approach is that it enables you to perform any requiring preprocessing on the output, including structuring the information in a format you can use within your Hadoop MapReduce infrastructure.
SQL to Hadoop and back again, Part 1: Basic data interchange techniques Page 5 of 13

developerWorks

ibm.com/developerWorks/

The phrase "importing into Hadoop" really means you simply need to copy the information into HDFS for it to be available (see Listing 5).

Listing 5. Copying the information into HDFS


$ hdfs dfs mkdir recipes $ hdfs dfs -copyFromLocal recipes.json recipes

Once the files are copied in, they can be used by your Hadoop MapReduce jobs as required. For better flexibility within HDFS, the output can be chunked into multiple files, and those files can be loaded. Depending upon your use case and processing requirements, extracting the data into individual files (one per notional record) may be more efficient for the distributed processing.

Using Sqoop to move data


Sqoop is an additional tool for Hadoop that connects to an existing database using a JDBC driver and imports tables or databases from the source JDBC connection directly into HDFS. For the vast majority of imports where raw data from the SQL tables is being imported into Hadoop verbatim without processing, Sqoop offers the simplest and most efficient process for moving the data. For example, all of the tables within a single database can be loaded using Listing 6.

Listing 6. Loading all tables within a single database


$ sqoop import-all-tables --connect jdbc:mysql://192.168.0.240/cheffy --username=cheffy

For those drivers that support it, use the --direct option to directly read the data and then write it into HDFS. The process is much faster, as it requires no intervening files. When loading data in this way, directories are created within HDFS according to the table names. For example, within the recipe data set is the access log information in the access_log table, and the imported data is written into text files within the access_log directory (see Listing 7).

Listing 7. Viewing imported data from Sqoop


$ hdfs dfs -ls Found 6 items -rw-r--r-3 drwxr-xr-x -rw-r--r-3 -rw-r--r-3 -rw-r--r-3 -rw-r--r-3 access_log cloudera cloudera cloudera cloudera cloudera cloudera cloudera cloudera cloudera cloudera cloudera cloudera 0 0 36313694 36442312 36797470 36321038 2013-08-15 2013-08-15 2013-08-15 2013-08-15 2013-08-15 2013-08-15 09:37 09:37 09:37 09:37 09:37 09:37 access_log/_SUCCESS access_log/_logs access_log/part-m-00000 access_log/part-m-00001 access_log/part-m-00002 access_log/part-m-00003

By default, the files are split into approximately 30MB blocks, and the data is separated by commas (see Listing 8).

Listing 8. CSV converted Sqoop table data


1,1,1135322067,09890012-11583713-542922105,recipeview,779 2,1,1135322405,09890012-11583713-542922105,recipeview,288 3,89,1135327750,26458011-11487731-455118105,search-ingredient, 4,89,1135327750,26458011-11487731-455118105,ingredient,pork 5,89,1135327750,26458011-11487731-455118105,ingredient,cheese 6,89,1135327765,26458011-11487731-455118105,recipeview,1421

SQL to Hadoop and back again, Part 1: Basic data interchange techniques

Page 6 of 13

ibm.com/developerWorks/

developerWorks

To select individual tables, use the code in Listing 9.

Listing 9. Selecting individual tables


$ sqoop import-all-tables --connect jdbc:mysql://192.168.0.240/cheffy --username =cheffy --table access_log

And to select individual columns from that table, use the code in Listing 10.

Listing 10. Selecting individual columns


$ sqoop import-all-tables --connect jdbc:mysql://192.168.0.240/cheffy --username=cheffy --table access_log --columns id,userid,operation

Rather than individually selecting tables and columns, a more practical approach is to use a query to specify the information to output. When using this method, you must use the $CONDITIONS variable in your statement and specify the column to use when dividing up the data into individual packets using the --split-by option as shown in Listing 11.

Listing 11. Specifying information to output


$ sqoop import-all-tables --connect jdbc:mysql://192.168.0.240/cheffy --username =cheffy --query 'select recipeid,recipe,description from recipe WHERE $CONDITION' --split-by id

One limitation of Sqoop, however, is that it provides limited ability to format and construct the information. For complex data, the export and load functions of a custom tool may provide better functionality.

Extracting data from Hadoop


When processing raw, processed data back from Hadoop, you need to take the files output by your Hadoop job. As with exporting, you should ensure that your Hadoop job outputs the information in a format that you can read back effectively.

Importing to SQL
Using CSV is simple and straightforward, but for more complex structures, you might want to consider the JSON route again because it makes the entire conversion and translation process so easy. Getting the information out requires use of the HDFS tool to get your output files back to a filesystem where you can perform a load $ hdfs dfs -copyToLocal processed_logs/*, for example. Once you have the files, you can load the information using whatever method suits the source information and structure.

Exporting from Sqoop


As with the import process, Sqoop provides a simplified method for translating information from your Hadoop job back into an SQL table. When outputting the resulting information from Sqoop, use the CSV format for the easiest export. Then to import the information, you will need to create a suitable table to accept the processed
SQL to Hadoop and back again, Part 1: Basic data interchange techniques Page 7 of 13

developerWorks

ibm.com/developerWorks/

logs. For example, from our access logs, the Hadoop output has mapped the data into summaries of the number of operations, so it's necessary to first create a suitable table: CREATE TABLE summary_logs (operation CHAR(80), count int). Then the information can be imported directly from Hadoop into your SQL table (see Listing 12).

Listing 12. Exporting from Hadoop into SQL


$ sqoop export --connect jdbc:mysql://192.168.0.240/cheffy --username=root --export-dir processed_log --table processed_log 13/08/15 10:04:34 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 13/08/15 10:04:34 INFO tool.CodeGenTool: Beginning code generation 13/08/15 10:04:35 INFO manager.SqlManager: Executing SQL statement: SELECT t. * FROM `access_log` AS t LIMIT 1 13/08/15 10:04:35 INFO manager.SqlManager: Executing SQL statement: SELECT t. * FROM `access_log` AS t LIMIT 1 13/08/15 10:04:35 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce 13/08/15 10:04:35 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-core.jar Note: /tmp/sqoop-cloudera/compile/8034e8d9feb8c1b0f69a52fede8d1da7/access_log.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 13/08/15 10:04:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/8034e8d9feb8c1b0f69a52fede8d1da7/access_log.jar 13/08/15 10:04:37 INFO mapreduce.ExportJobBase: Beginning export of access_log 13/08/15 10:04:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/08/15 10:04:39 INFO input.FileInputFormat: Total input paths to process : 4 13/08/15 10:04:39 INFO input.FileInputFormat: Total input paths to process : 4 13/08/15 10:04:39 INFO mapred.JobClient: Running job: job_201308150649_0006 13/08/15 10:04:40 INFO mapred.JobClient: map 0% reduce 0% 13/08/15 10:04:57 INFO mapred.JobClient: map 2% reduce 0% ... 13/08/15 10:08:06 INFO mapred.JobClient: CPU time spent (ms)=27470 13/08/15 10:08:06 INFO mapred.JobClient: Physical memory (bytes) snapshot=317607936 13/08/15 10:08:06 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2076659712 13/08/15 10:08:06 INFO mapred.JobClient: Total committed heap usage (bytes) =188350464 13/08/15 10:08:06 INFO mapreduce.ExportJobBase: Transferred 139.1333 MB in 207.5656 seconds (686.3975 KB/sec) 13/08/15 10:08:06 INFO mapreduce.ExportJobBase: Exported 2401906 records.

The process is complete. Even at the summarized level, we are looking at 2.4 million records of simplified data from a content store about 600 times that size. With the imported information, we can now perform some simple and quick queries and structures on the data. For example, this summary of the key activities takes about 5 seconds (see Figure 4).

SQL to Hadoop and back again, Part 1: Basic data interchange techniques

Page 8 of 13

ibm.com/developerWorks/

developerWorks

Figure 4. Summary operations

On the full data set, the process took almost an hour. Similarly, a query on the top search terms took less than a second, compared to over 3 minutes, a time savings that makes it possible to include a query on the homepage (see Figure 5).

Figure 5. Summary ingredient search

These are simplified examples of the external reduction processing in Hadoop being used, but they effectively demonstrate the advantage of the external interface.
SQL to Hadoop and back again, Part 1: Basic data interchange techniques Page 9 of 13

developerWorks

ibm.com/developerWorks/

Conclusions
Getting information in and out of Hadoop data that has been based on SQL data is not complicated, providing you know the data, its format, and how you want the information internally processed and represented. The actual conversion, exporting, processing, and importing is surprisingly straightforward. The solutions in this article have looked at direct, entire-dataset dumps of information that can be exported, processed, and imported to Hadoop. The process can be SQL to Hadoop, Hadoop to SQL, or SQL to Hadoop to SQL. In fact, the entire sequence can be scripted or automated, but that's a topic for a future article in this series. In Part 2, we look at more advanced examples of performing this translation and movement of content by using one of the SQL layers that sits on top of HDFS. We'll also lay the foundation for providing a full live transmission of data for processing and storage.

SQL to Hadoop and back again, Part 1: Basic data interchange techniques

Page 10 of 13

ibm.com/developerWorks/

developerWorks

Resources
Learn "Analyzing social media and structured data with InfoSphere BigInsights" teaches you the basics of using BigSheets to analyze social media and structured data collected through sample applications provided with BigInsights. Read "Understanding InfoSphere BigInsights" to learn more about the InfoSphere BigInsights architecture and underlying technologies. Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights video to listen to Cindy Saracco discuss some of the frequently asked questions about IBM's Big Data platform and InfoSphere BigInsights. Watch Cindy Saracco demonstrate portions of the scenario described in this article in Big Data -- Analyzing Social Media for Watson. Check out "Exploring your InfoSphere BigInsights cluster and sample applications" to learn more about the InfoSphere BigInsights web console. Visit the BigInsights Technical Enablement wiki for links to technical materials, demos, training courses, news items, and more. Learn about the IBM Watson research project. Take this free course from Big Data University on Hadoop Reporting and Analysis (login required). Learn how to build your own Hadoop/big data reports over relevant Hadoop technologies, such as HBase, Hive, etc., and get guidance on how to choose between various reporting techniques, including direct batch reports, live exploration, and indirect batch analysis. Learn the basics of Hadoop with this free Hadoop Fundamentals course from Big Data University (log-in required). Learn about the Hadoop architecture, HDFS, MapReduce, Pig, Hive, JAQL, Flume, and many other related Hadoop technologies. Practice with hands-on labs on a Hadoop cluster on the Cloud, with the supplied VMWare image, or install locally. Explore free courses from Big Data University on topics ranging from Hadoop Fundamentals and Text Analytics Essentials to SQL Access for Hadoop and real-time stream computing. Create your own Hadoop cluster on the IBM SmartCloud Enterprise with this free course from Big Data University (log-in required). Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data for details on two of IBM's key big data technologies. Visit the Apache Hadoop Project and check out the Apache Hadoop Distributed File System. Learn about the HadoopDB Project. Read the Hadoop MapReduce tutorial at Apache.org. Using MapReduce and load balancing on the cloud (Kirpal A. Venkatesh et al., developerWorks, July 2010): Learn how to implement the Hadoop MapReduce framework in a cloud environment and how to use virtual load balancing to improve the performance of both a single- and multiple-node system. For information on installing Hadoop using CDH4, see CDH4 Installation - Cloudera Support. Big Data Glossary By Pete Warden, O'Reilly Media, ISBN: 1449314597, 2011. Hadoop: The Definitive Guide by Tom White, O'Reilly Media, ISBN: 1449389732, 2010.
SQL to Hadoop and back again, Part 1: Basic data interchange techniques Page 11 of 13

developerWorks

ibm.com/developerWorks/

"HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads" explores the feasibility of building a hybrid system that takes the best features from both technologies. "SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions" describes the motivation for this new approach to UDFs, as well as the implementation within AsterData Systems' nCluster database. Check out "MapReduce and parallel DBMSes: friends or foes?" "A Survey of Large-Scale Data Management Approaches in Cloud Environments" gives a comprehensive survey of numerous approaches and mechanisms of deploying data-intensive applications in the cloud which are gaining a lot of momentum in research and industrial communities. Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more. Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets. Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights. Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources. Stay current with developerWorks technical events and webcasts. Follow developerWorks on Twitter. Get products and technologies Hadoop 0.20.1 is available from Apache.org. Download Hadoop MapReduce. Get Hadoop HDFS. Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image. Download InfoSphere Streams, available as a native software installation or as a VMware image. Use InfoSphere Streams on IBM SmartCloud Enterprise. Build your next development project with IBM trial software, available for download directly from developerWorks. Discuss Ask questions and get answers in the InfoSphere BigInsights forum. Ask questions and get answers in the InfoSphere Streams forum. Check out the developerWorks blogs and get involved in the developerWorks community. IBM big data and analytics on Facebook.
Page 12 of 13

SQL to Hadoop and back again, Part 1: Basic data interchange techniques

ibm.com/developerWorks/

developerWorks

About the author


Martin C. Brown A professional writer for over 15 years, Martin (MC) Brown is the author and contributor to more than 26 books covering an array of topics, including the recently published Getting Started with CouchDB. His expertise spans myriad development languages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac OS and more. He currently works as the director of documentation for Continuent. Copyright IBM Corporation 2013 (www.ibm.com/legal/copytrade.shtml) Trademarks (www.ibm.com/developerworks/ibm/trademarks/)

SQL to Hadoop and back again, Part 1: Basic data interchange techniques

Page 13 of 13

S-ar putea să vă placă și