Documente Academic
Documente Profesional
Documente Cultură
In this series of articles, we'll look at a range of different methods for integration between Apache Hadoop and traditional SQL databases, including simple data exchange methods, live data sharing and exchange between the two systems, and the use of SQL-based layers on top of Apache Hadoop, including HBase and Hive, to act as the method of integration. Here in Part 1, we examine some of the basic architectural aspects of exchanging information and the basic techniques for performing data interchange. View more content in this series
developerWorks
ibm.com/developerWorks/
One-way data exchange from SQL to Hadoop, or Hadoop to SQL is practical in situations where the data is being transported to take advantage of the query functionality, and the source is not the companion database solution. For example, pure-textual data, or the raw results of a computational or analysis program, might be stored in Hadoop, processed with MapReduce, and stored in SQL (see Figure 1).
The reverse is less common, where information is extracted from SQL into Hadoop, but it can be used to process SQL-based content that provides a lot of textual content, such as blogs, forums, CRM and other systems (see Figure 2).
Two-way data exchange is more common and provides the best of both worlds in terms of the data exchange and data processing (see Figure 3).
Although there are many examples, the most common one is to take large, linear datasets and text datasets from the SQL and convert that into summarized information that can be processed by a Hadoop cluster. The summarized information can then be imported back into your SQL store. This is particularly useful where the large dataset would take too long to process within an SQL query. An example would be a large corpus of review scores or word/term counts.
ibm.com/developerWorks/
developerWorks
For example, you can create a query that extracts the data and populates a JSON array with the record data. Once exported, a job can be created to process and crunch the data before either displaying it, or importing the processed data and exporting the data back into DB2. Download InfoSphere BigInsights Quick Start Edition, a complimentary, downloadable version of InfoSphere BigInsights. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets.
As a rule, there are three primary reasons for interfacing between SQL and Hadoop: 1. Exporting for storage Hadoop provides a practical solution for storing large amounts of infrequently used data in a format that can be queried, processed, and extracted. For example: Usage logs, access logs, and error information are all practical for insertion into a Hadoop cluster to take advantage of the HDFS architecture. A secondary feature of this type of export is that the information can later be processed or parsed and converted into a format that can be used again. 2. Exporting for analysis Two common cases are exporting for reimport to SQL and exporting the analysis output to be used directly in your application (analyzing and storing the result in JSON, for example). Hadoop provides the advantage here by allowing for distributed large-scale processing of information rather than the single-table host processing provided in SQL. With the analysis route, the original information is generally kept, but the analysis process is used to provide summary or statistical bases that work alongside the original data. 3. Exporting for processing Processing-based exports are designed to take the original raw source information, process and reduce or simplify it, then store that information back to replace the original data. This type of exchange is most commonly used where the source information has been captured, but the raw original information is no longer required. For example, logging data of various forms can be easily resolved into a simpler structure, either by looking for specific event types or by summarizing the data to counts of specific errors or occurrences. The raw data is often not required here. Reducing that data through Hadoop and loading the summary stats back, saves time and makes the content easier to query. With these basic principles in mind, let's look at the techniques for a basic data exchange between SQL and Hadoop.
SQL to Hadoop and back again, Part 1: Basic data interchange techniques
Page 3 of 13
developerWorks
ibm.com/developerWorks/
The resulting file can be loaded straight into Hadoop through HDFS. Generating the same output with a simple script in Perl, Python, or Ruby is as straightforward.
my $fw = Foodware->new(); my $recipes = $fw->{_dbh}->get_generic_multi('recipe','recipeid', { active => 1}); my $js = new JSON; foreach my $recipeid (keys %{$recipes}) { my $recipe = new Foodware::Recipe($fw,$recipeid,{ measgroup => 'Metric', tempgroup => 'C',}); my $id = $recipe->{title}; $id =~ s/[ ',\(\)]//g; my $record = { _id => $id,
SQL to Hadoop and back again, Part 1: Basic data interchange techniques
Page 4 of 13
ibm.com/developerWorks/
developerWorks
title => $recipe->{title}, subtitle => $recipe->{subtitle}, servings => $recipe->{servings}, cooktime => $recipe->{metadata_bytag}->{totalcooktime}, preptime => $recipe->{metadata_bytag}->{totalpreptime}, totaltime => $recipe->{metadata_bytag}->{totaltime}, keywords => [keys %{$recipe->{keywordbytext}} ], method => $recipe->{method}, ingredients => $recipe->{ingredients}, comments => $recipe->{comments}, }; foreach my $ingred (@{$recipe->{ingredients}}) { push(@{$record->{ingredients}}, { meastext => $ingred->{'measuretext'}, ingredient => $ingred->{'ingredonly'}, ingredtext => $ingred->{'ingredtext'}, } ); } print to_json($record),"\n"; }
The data is exported to a file that contains the recipe data (see Listing 4).
The result can be loaded directly into HDFS and processed by a suitable MapReduce job to extract the information required. One benefit of this structured approach is that it enables you to perform any requiring preprocessing on the output, including structuring the information in a format you can use within your Hadoop MapReduce infrastructure.
SQL to Hadoop and back again, Part 1: Basic data interchange techniques Page 5 of 13
developerWorks
ibm.com/developerWorks/
The phrase "importing into Hadoop" really means you simply need to copy the information into HDFS for it to be available (see Listing 5).
Once the files are copied in, they can be used by your Hadoop MapReduce jobs as required. For better flexibility within HDFS, the output can be chunked into multiple files, and those files can be loaded. Depending upon your use case and processing requirements, extracting the data into individual files (one per notional record) may be more efficient for the distributed processing.
For those drivers that support it, use the --direct option to directly read the data and then write it into HDFS. The process is much faster, as it requires no intervening files. When loading data in this way, directories are created within HDFS according to the table names. For example, within the recipe data set is the access log information in the access_log table, and the imported data is written into text files within the access_log directory (see Listing 7).
By default, the files are split into approximately 30MB blocks, and the data is separated by commas (see Listing 8).
SQL to Hadoop and back again, Part 1: Basic data interchange techniques
Page 6 of 13
ibm.com/developerWorks/
developerWorks
And to select individual columns from that table, use the code in Listing 10.
Rather than individually selecting tables and columns, a more practical approach is to use a query to specify the information to output. When using this method, you must use the $CONDITIONS variable in your statement and specify the column to use when dividing up the data into individual packets using the --split-by option as shown in Listing 11.
One limitation of Sqoop, however, is that it provides limited ability to format and construct the information. For complex data, the export and load functions of a custom tool may provide better functionality.
Importing to SQL
Using CSV is simple and straightforward, but for more complex structures, you might want to consider the JSON route again because it makes the entire conversion and translation process so easy. Getting the information out requires use of the HDFS tool to get your output files back to a filesystem where you can perform a load $ hdfs dfs -copyToLocal processed_logs/*, for example. Once you have the files, you can load the information using whatever method suits the source information and structure.
developerWorks
ibm.com/developerWorks/
logs. For example, from our access logs, the Hadoop output has mapped the data into summaries of the number of operations, so it's necessary to first create a suitable table: CREATE TABLE summary_logs (operation CHAR(80), count int). Then the information can be imported directly from Hadoop into your SQL table (see Listing 12).
The process is complete. Even at the summarized level, we are looking at 2.4 million records of simplified data from a content store about 600 times that size. With the imported information, we can now perform some simple and quick queries and structures on the data. For example, this summary of the key activities takes about 5 seconds (see Figure 4).
SQL to Hadoop and back again, Part 1: Basic data interchange techniques
Page 8 of 13
ibm.com/developerWorks/
developerWorks
On the full data set, the process took almost an hour. Similarly, a query on the top search terms took less than a second, compared to over 3 minutes, a time savings that makes it possible to include a query on the homepage (see Figure 5).
These are simplified examples of the external reduction processing in Hadoop being used, but they effectively demonstrate the advantage of the external interface.
SQL to Hadoop and back again, Part 1: Basic data interchange techniques Page 9 of 13
developerWorks
ibm.com/developerWorks/
Conclusions
Getting information in and out of Hadoop data that has been based on SQL data is not complicated, providing you know the data, its format, and how you want the information internally processed and represented. The actual conversion, exporting, processing, and importing is surprisingly straightforward. The solutions in this article have looked at direct, entire-dataset dumps of information that can be exported, processed, and imported to Hadoop. The process can be SQL to Hadoop, Hadoop to SQL, or SQL to Hadoop to SQL. In fact, the entire sequence can be scripted or automated, but that's a topic for a future article in this series. In Part 2, we look at more advanced examples of performing this translation and movement of content by using one of the SQL layers that sits on top of HDFS. We'll also lay the foundation for providing a full live transmission of data for processing and storage.
SQL to Hadoop and back again, Part 1: Basic data interchange techniques
Page 10 of 13
ibm.com/developerWorks/
developerWorks
Resources
Learn "Analyzing social media and structured data with InfoSphere BigInsights" teaches you the basics of using BigSheets to analyze social media and structured data collected through sample applications provided with BigInsights. Read "Understanding InfoSphere BigInsights" to learn more about the InfoSphere BigInsights architecture and underlying technologies. Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights video to listen to Cindy Saracco discuss some of the frequently asked questions about IBM's Big Data platform and InfoSphere BigInsights. Watch Cindy Saracco demonstrate portions of the scenario described in this article in Big Data -- Analyzing Social Media for Watson. Check out "Exploring your InfoSphere BigInsights cluster and sample applications" to learn more about the InfoSphere BigInsights web console. Visit the BigInsights Technical Enablement wiki for links to technical materials, demos, training courses, news items, and more. Learn about the IBM Watson research project. Take this free course from Big Data University on Hadoop Reporting and Analysis (login required). Learn how to build your own Hadoop/big data reports over relevant Hadoop technologies, such as HBase, Hive, etc., and get guidance on how to choose between various reporting techniques, including direct batch reports, live exploration, and indirect batch analysis. Learn the basics of Hadoop with this free Hadoop Fundamentals course from Big Data University (log-in required). Learn about the Hadoop architecture, HDFS, MapReduce, Pig, Hive, JAQL, Flume, and many other related Hadoop technologies. Practice with hands-on labs on a Hadoop cluster on the Cloud, with the supplied VMWare image, or install locally. Explore free courses from Big Data University on topics ranging from Hadoop Fundamentals and Text Analytics Essentials to SQL Access for Hadoop and real-time stream computing. Create your own Hadoop cluster on the IBM SmartCloud Enterprise with this free course from Big Data University (log-in required). Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data for details on two of IBM's key big data technologies. Visit the Apache Hadoop Project and check out the Apache Hadoop Distributed File System. Learn about the HadoopDB Project. Read the Hadoop MapReduce tutorial at Apache.org. Using MapReduce and load balancing on the cloud (Kirpal A. Venkatesh et al., developerWorks, July 2010): Learn how to implement the Hadoop MapReduce framework in a cloud environment and how to use virtual load balancing to improve the performance of both a single- and multiple-node system. For information on installing Hadoop using CDH4, see CDH4 Installation - Cloudera Support. Big Data Glossary By Pete Warden, O'Reilly Media, ISBN: 1449314597, 2011. Hadoop: The Definitive Guide by Tom White, O'Reilly Media, ISBN: 1449389732, 2010.
SQL to Hadoop and back again, Part 1: Basic data interchange techniques Page 11 of 13
developerWorks
ibm.com/developerWorks/
"HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads" explores the feasibility of building a hybrid system that takes the best features from both technologies. "SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions" describes the motivation for this new approach to UDFs, as well as the implementation within AsterData Systems' nCluster database. Check out "MapReduce and parallel DBMSes: friends or foes?" "A Survey of Large-Scale Data Management Approaches in Cloud Environments" gives a comprehensive survey of numerous approaches and mechanisms of deploying data-intensive applications in the cloud which are gaining a lot of momentum in research and industrial communities. Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more. Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets. Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights. Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources. Stay current with developerWorks technical events and webcasts. Follow developerWorks on Twitter. Get products and technologies Hadoop 0.20.1 is available from Apache.org. Download Hadoop MapReduce. Get Hadoop HDFS. Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image. Download InfoSphere Streams, available as a native software installation or as a VMware image. Use InfoSphere Streams on IBM SmartCloud Enterprise. Build your next development project with IBM trial software, available for download directly from developerWorks. Discuss Ask questions and get answers in the InfoSphere BigInsights forum. Ask questions and get answers in the InfoSphere Streams forum. Check out the developerWorks blogs and get involved in the developerWorks community. IBM big data and analytics on Facebook.
Page 12 of 13
SQL to Hadoop and back again, Part 1: Basic data interchange techniques
ibm.com/developerWorks/
developerWorks
SQL to Hadoop and back again, Part 1: Basic data interchange techniques
Page 13 of 13