0 evaluări0% au considerat acest document util (0 voturi)
116 vizualizări12 pagini
A French consulting company called Manapps has released an ETL benchmark report. The report compares Talend, Pentaho Kettle, DataStage and Informatica. Two of those four vendors will be pleased with the results.
A French consulting company called Manapps has released an ETL benchmark report. The report compares Talend, Pentaho Kettle, DataStage and Informatica. Two of those four vendors will be pleased with the results.
A French consulting company called Manapps has released an ETL benchmark report. The report compares Talend, Pentaho Kettle, DataStage and Informatica. Two of those four vendors will be pleased with the results.
Toolbox for IT Topics Business Intelligence Blogs Tweet 2 0 0 ETL Benchmark Favours DataStage and Talend Vincent McBurney Dec 9, 2008 | Comments (7) A French consulting company called Manapps has released an ETL benchmark report that compares Talend, Pentaho Kettle, DataStage and Informatica and two of those four vendors will be pleased with the results. *** March 2009 Update: I was contacted my ManApps informing me the benchmark report was a draft and not the final report. There is a final report that shows more favourable results for Informatica and I will update this blog post when I have some spare time. Here is the statement from ManApps as translated from French by Google Translate: We publish on our website version of the document "Benchmark ETL" whose original version was wrongly found work published on various sites or blogs outside our company. This version significantly modifies certain measures have been taken based on more advanced technical parameters regarding Power Center Solution Informatica, we did not have in the original version. The findings have been modified accordingly. Note that this document remains a working document. It has no goal of marketing. We want to provide publishers covered by the study may be required to take any action they deem useful, and if necessary publish the results accordingly. *** End of update I am going to write a post on what I think of the objectives of this benchmark but for now here are the results and an analysis of each test. Manapps is part of OmegaHighTech, a company with over 3,500 employees around the world. Amongst other things Manapps do Business Intelligence consulting and data warehouse implementations. The Benchmark is released under creative commons license: You are free: to Share to copy, distribute, display, and perform the work to Remix to make derivative works I hear Coldplay have already taken the words to use in their next song. I dont know how they distributed the PDF I found it in a blog post by Marc Russell: ETL Benchmark by Manapps. Ive copied the graphs into this blog post with some comments and cropped off the top of the graphs for readability and because I dont think the really high scores are reflective of the products but show poor ETL design. Event Number 1 Sequential Files The first test was reading a sequential file and writing out to a sequential file. Anyone who knows DataStage can guess the result DataStage Server Edition will be awesome and DataStage PX was not so great: 1 Recommend Recommend Share Share Your email address FOLLOW BEGIN NOW Tooling Around in the IBM InfoSphere by Vincent McBurney Vincent McBurney is an IBM Champion for Information Integration and has been blogging for many years on InfoSphere software and ... more Receive the latest blog posts: Share Your Perspective Share your professional knowledge and experience with peers. Start a blog on Toolbox for IT today! ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 1 of 12 21/01/2014 10:37 Benchmark 1 Results ETL Sequential File Processing DataStage Server Edition LOVES sequential files. Its been optimised over 15 years of releases not just reading and writing sequential data but memory caching and row buffering in the middle. Look at the 5 million row result less than a third of the time of the nearest competition. This is one of the reasons why DataStage Server Edition customers who have a lot of low to mid range data sizes and share data in sequential files are sticking with Server Edition. DataStage PX tolerates sequential files it imports the data and converts it to parallel format and then exports it again back to sequential format. Not only that but because they chose 2 nodes DataStage PX had to partition and the unpartition the damn data. Let me show you how much this sucks. Lets say you have a wheel barrow with two sacks of flour in it and youve got to deliver it to the king of Sparta before sunset or hell kick you into a bottomless pit. With most of these ETL tools you pick up the wheelbarrow and run like hell to the king. With DataStage PX you pick up the wheelbarrow, you wheel it over to two wheelbarrows and put a sack of flour in each and then clone yourself and the two of you push both those wheelbarrows to the palace where you swap them back to one wheel barrow and give it to the king. He kicks you and your doppleganger down the bottomless pit and throws the wheelbarrows after you. If you have 100 flour bags your parallel wheelbarrows are great, but with two wheat bags its a waste of time. A single job parameter that tells this job to run in sequential mode could have made it as much as 50% faster. Informatica youch! They are like Forest Gump before the leg braces came off. They finished the first race after DataStage Server finished the third race. Informatica was the only ETL tool in this test that used three stages to do this job instead of two. They did a file input and then a file delimiter definition painfully slow row-by-row delimiter definition. Im no expert but someone tell me this was a dumb job design. Eventually Informatica picked up speed and in the 20 million test it came second. Test 2 MySQL This test only compared Talend and Pentaho writing to mySQL. In this test they had two versions of Talend TOS 2.4.1 and TOS 2.4.1 extended insert. This tells me that Manapps did a bit more research on Talend than Pentaho Kettle. By this stage Im thinking this test is rigged. Fun, but rigged. Test 3 Read a Database In test 3 Manapps has a test to read from an Oracle database table and write out to a sequential file. Work With Me Links Categories GO If you are an expert in InfoSphere software and want to work for the biggest IBM partner in Australia and New Zealand get in touch with me via ITToolbox or Linked In. Steal This IM Methodology Informatica Data Quality Blog DataFlux Community of Experts Data Governance Blog dq:view - Steve Tuck on Data Quality ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 2 of 12 21/01/2014 10:37
Test Results 2 ETL Reading a Database DataStage PX did surprisingly well it had three stages, database modify sequential file. It repartitioned twice (needlessly) so could be improved but because it has the Oracle Enterprise stage it kicked butt on the database read. DataStage Server did not so well and I would have loved to tweak the array size and transaction size properties to read the database data in one chunk. Might have got a huge performance boost. Informatica well Ive chopped off the top of the graph to make it readable but again they had the extra stage and that added 40 seconds to each job run time. Surely there is a way to avoid that 40 second lag. Test 4 Database Bulk Load This test was notable as the only one that Pentaho Kettle won. At least in the miniscule volume category. The editor must have been asleep at the wheel when he let this one through. They did worse as the volumes went up.
Test Result 4 Feeding Oracle Bulk Load It was another test that Server Edition won comfortably because this is essentially the same as test 1 its all about the speed of creating a sequential file. After youve got the file all five ETL tools call the exact same Oracle bulk loader. Test 1 and Test 4 identical. Test 5 Same as Test 1 with a Transform in it Finally, finally! An ETL test that has Transform in it (you know, the T in ETL?) All the tests up to now were bullshit, this is the first true ETL test and it took Manapps five tests to get here: ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 3 of 12 21/01/2014 10:37 Test Result 5 ETL Transformer This was the first time the parallel partitioning of DataStage PX may have helped rather than hinder. With two nodes doing those transform functions the higher the volumes the more it wins. We finally see Informatica do well coming second in the 20 million row. What I would give to see a 1,000,000,000 row test. Once again Informatica had that initial 35 second handicap but you take that out and it performed really well. Test 6 ELT This was a silly test as Manapps didnt have DataStage ELT and didnt know how to use Informatica ELT. They tried another way to force those product to use ELT but got it wrong. The one thing I will say about this test is that Talend seems to have a good GUI for simple ELT: Benchmark Job Talend ELT This job lets you define an oracle connection, aggregate the data and save it to another Oracle table. It ran in under 2 seconds for up to 1 million rows so I assume it pushes it all down to the database. This is a good ETL way to write an aggregation it runs just like a database group by command but its got an open and easy to read data lineage. This is just the type of thing I would expect from DataStage and Informatica ELT if you can get it running. What Manapps did in this test that is kind of sneaky is have the source and target table in the same database which kind of defeats the purpose of using an ETL tool. A true ELT scenarios in the involves different source and target databases. The other four ETL tools could have done the exact same thing with user-defined SQL select though the data lineage would not have been as good. The tester made the comment: Only Talend Open Studio permits to use an ELT mod. Informatica got the Push Down Optimization, but I didnt find this feature on the tool. Youve got to buy the add on! Its not free with the tool! They are not as charitable as Talend. Test 7 more ELT This is the most interesting test in the benchmark because it shows how ETL engines process faster than ELT by throwing grunt, memory and hardware at the transform part of the job. This test compares a pure ELT command from Talend versus tradition ETL from DataStage and ETL wins: ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 4 of 12 21/01/2014 10:37 Benchmark Job Talend complex ELT Benchmark Job DataStage Join and Filter The first diagram is the clever Talend ELT interface that leaves the data on the database and performs some mapping on it. The second diagram is traditional ETL, DataStage reads the data and then transforms, joins, transforms, filters and writes it out. It looks like its doing a lot more work but dont let those Modify stages deceive you they are almost zero overhead and since there are no sequential files this job is in its element and comes out fastest: Benchmark Result 7 Join and Filter Talend has just one processing engine the database. DataStage has two the database and the ETL server. The higher the volume the faster DataStage PX will go. Manapps only tested up to 1,000,000 in this test despite testing higher volumes in other tests. I would have liked a 20,000,000 test for this one. It would be so so very much faster with a tiny bit of tuning. You see this little DataStage symbol: . Thats pain time. Thats data being sorted and repartitioned swapping flour bags between wheelbarrows. If you sort the data in the source database stages and remove the sorts from the job this baby runs a lot faster. Because DataStage PX jobs push everything through a parallel engine you become adept at sorts and partitions and Manapps would have worked this out with some scenario testing. Test 8 - Sort A very interesting test result that shows how friggen fast the DataStage PX sort is. When Applied Parallel Technologies (who became Torrent Systems and then IBM DataStage PX) wrote a parallel flow based processing engine in 1993 one of the first functions they wrote was sort it was an obvious candidate for running faster in parallel mode. Fifteen years later and it flies: Benchmark Job DataStage PX Sort Its a simple DataStage job design, read from one file and write to another. The properties of the second file insist on the data being sorted so you can see the little ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 5 of 12 21/01/2014 10:37 yellow sort symbol that tells you the data on that link is being sorted. This test had two sequential files, the Achilles heel of parallel processing, but it was kind of like a sprint relay with Father Christmas handing the baton Hussein Bolt who handed it to Roseanne Barr. The sort in the middle made up the time. With the result that DataStage PX was miles ahead: Benchmark Result 8 Sort Speeds My own benchmark tests showed DataStage PX was many times faster at sort and aggregation than DataStage Server Edition even before you added any parallel nodes. Its got very well written processing components. A 7 minute sort in Server Edition took 12 seconds on one node in DataStage PX. This is the reason why Co-Sort and Syncsort (the sequential file sorting specialists) are welcome at DataStage Server sites and not DataStage PX sites. DataStage PX does not need any help with sorts. Talend used GNU sort external to the tool, and lost badly on the very high volume sort. Maybe there is a better sort script out there. Looks like DataStage Server Edition fell over on 20 million rows not a huge surprise. If you are sorting big data volumes you need to upgrade to PX! Youll get a huge discount at the moment thanks to the credit crisis, they are desperate for any extra licensing. Test 9 ETL Aggregation Test 9 is similar to test 8 but its aggregation instead of sort. One of the few tests Informatica won, edging out DataStage in the 20 million category despite that initial 35-40 second flow start:
Benchmark Result 8 ETL Aggregation Run Informatica Run! The trend line for Informatica is impressive not much increase in time from 100,000 to 5,000,000. If they could break free of those leg braces earlier they would be winning all categories. The test developer made a mistake with the DataStage PX job in this test and left it with two sorts instead of one: ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 6 of 12 21/01/2014 10:37 Benchmark Job DataStage PX Aggregation They used the job from Test 8 that had an enforced sort in it instead of creating a new job or using the job from test 1. The aggregator will add a sort it needs sorted data in order to aggregate. The output sequential file is also asking for a sort (left over from test 8), possibly in a different order to the aggregator, so this job is combining test 8 and 9 into one and DataStage PX is still coming first or second in most results. Could have been 10-20% faster without that second sort. Test 10 Lookups Sigh, this is where the benchmark really gets loopy. Mork and Mindy loopy. This job is what you expect someone to build if they have only be using DataStage for a couple hours: Benchmark Job DataStage PX Join Its a mess. These sorts and partitions are killers: . Lots of flour bag sorting and swapping between wheelbarrows. This job would be as much as 80% faster if you replaced that join with a lookup. The lookup stage does not need any sorting to work and 9 out of 10 times it will be faster than a join. By default I use a Lookup stage and I need something to go seriously wrong with the job before I switch to a Join. This job design doesnt cut it for benchmarking. Talend does best with a small lookup volume, DataStage PX does okay and Informatica is astoundingly bad. Benchmark Result 10 ETL lookup Im no Informatica expert but the job design looked kind of crazy:
Benchmark Job Informatica Lookup Can someone tell me what is wrong with it? Informatica lookups shouldnt be this slow. The one time you do want to use a DataStage PX Join stage instead of a Lookup is when you have massive amounts of lookup data, and in this benchmark there was a set of tests with 5,000,000 rows of lookup data and we finally got to see a Join stage that was worthwhile: ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 7 of 12 21/01/2014 10:37 Benchmark Result 10 high volume ETL Lookup This test has high volume input rows AND high volume lookup rows. We have reached a volume of data that justifies a Join stage where data is sorted before the comparison of rows is performed and you can see the scalability of DataStage PX on 20 million input rows joined to 5 million lookup rows. This test gives you an idea of what would happen with a job with many stages join, lookup, transform and sort. DataStage PX would be further in front as the volumes go up and if you added more CPUs the difference would be even more obvious. Test 11 Lookups with Rejects Test 11 is similar to test 10 except when you cannot join you produce a reject. This has me so frustrated, I want to take Manapps out and beat them with a rake, this test would have been so much better with a DataStage PX lookup stage. It can do the lookup and reject in one step so much faster than the join stage that does it in two steps with extra sorts. By this stage of the benchmark the Informatica job is looking like the route home that a drunk driver takes to avoid the police: Benchmark Job Informatica Drink Driving What the hell? DataStage PX in the hands of a drunk driver still manages to crash into second place on high volumes but Im afraid the testers did not know enough about lookups to do it justice. Informatica fared much worse in the hands of a novice and I wait with bated breath to hear what was wrong with these job designs. Conclusion Thanks to Manapps for the benchmark but I would like to see the sequential file tests run with DataStage on one node and the lookup tests with a lookup stage hey isnt that a coincidence. Lookup test lookup stage. Who would have thought a lookup stage would work for a lookup test? Talend does a lot of its work in memory (like DataStage PX) but this starts to come apart at the seams when the volumes go up. DataStage PX handles this by caching and buffering. It would be interesting to see a benchmark going into the 100s of millions of rows or 50 columns or more to see what each tool does under real stress. The type of processing that is common for telcos, banks and insurers. These tests do show that when you are down in the smaller volumes the open source ETL tools are an option and I would prefer them to manual coding, but in the higher volumes give me a premium tool any day. Even a novice can get good results. ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 8 of 12 21/01/2014 10:37 Read 7 comments More White Papers 7 Comments Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way. Vincent McBurney is an IBM Information Champion for Information Integration. Popular White Paper On This Topic Best Practices for a BI and Analytics Strategy Related White Papers Passmark 2013 Benchmark Report Endpoint Security Performance Results ERP in Manufacturing 2011 Werner Daehn Dec 9, 2008 Would love to run that benchmark myself. In case you ever get the source files and database tables let me know. Personally I don't like the test either. How many GB of data is moved via flat files vs. from source to target database? I guess the majority is database-to-database, hence the file tests are nice and simple but do not help much as the parsing of the files can be overly expensive, more expensive than the transformations - if there would be any. The other thing I am surprised is that there is a difference between the vendors. I would have thought that for these copy operations with a lookup in the middle and the such, the performance bottleneck would be the disk I/O. So I would immediately have guessed that the flows are not correctly designed with each tool. Especially in an ELT case, where the engine has almost nothing to do compared to the database, the difference should be zero, shouldn't it? But the most surprising statement was actually yours about the Oracle bulkload: "its all about the speed of creating a sequential file. After youve got the file all five ETL tools call the exact same Oracle bulk loader." I know Informatica supports the Oracle API bulkloader, so no need to write any file. Doesn't DataStage as well? -Werner Johannes Almiala Dec 10, 2008 I'm probably going to comment more later, but now a quick one for Test 11. One thing I would have done differently with Informatica is that I would have used a single Router transformation instead of four filters. A router does in one pass the same as the four filter do in four passes, plus you catch the rows that don't match any of the filter conditions. Also, there is no visibility on how the lookup has been configured, it could easily be a bottleneck. Generally, the default amount of memory allocated to transformation caches in Informatica PowerCenter sessions is 5% (or 512 MB, which ever is smaller) of the maximum available. If that hasn't been changed and the lookup source file is large, this test will basically measure random disk reading speed on the server platform. Vincent McBurney Dec 10, 2008 My Oracle bulk loader days go back to DataStage Server Edition and about Oracle 8! That version wrote out a text dat file under the covers in the Oracle bulk loader data format and passed the file to the Oracle bulk load program. DataStage PX has a much newer Oracle Enterprise stage compatible with the newer versions of Oracle but I don't know what it does under the covers. The bulk load test would be ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 9 of 12 21/01/2014 10:37 SUBMIT PREVIEW interesting if the source was a database table so you could take sequential file parsing out of the equation - and then bump the data volume up to 20 million rows. Dec 14, 2008 It is interesting that the version of DataStage used in the benchmarking is two major releases behind the current version 8.1. Along with little mention that DataStage is the hands down winner in linear scaling of parallel jobs to available hardware by simple changes to a configuration file. The fact that DataStage can scale seamlessly beyond any other vendor in this test and that management of that scalability is least costly in terms of hardware, installation, and IT resources is overlooked. As mentioned it doesn't make a lot of sense to run any job in a parallel process when the data volumes and transformative actions are minimized but once the volumes increase or transformations expand beyond simple data mapping, the parallel engine underlying the Information Server platform begins to easily out perform the other vendors in the test. In addition, no mention is made of the integrated platform Information Server brings to the table as most the vendors in the test recognize the data integration is much more than ETL. Granted my opinions are biased and all should evaluate these results from their own perspective. The only point here is that taking a simple scenario or two does not give the reader an accurate view of the products or capabilities as each vendor can demonstrate where the benchmarked test deviates from their best practices for each product. USER_1963953 Apr 1, 2010 this benchmark has strange results; just did a mapping with informatica powercenter 8.6.1 that calculates 2 ranks on 34Millions of row, joins them with over 64Millions of rows, then aggregates them and takes 130 secs consuming on average 5 power 5+ CPUs at 1900 mhz. repeated the same with a larger volume of data 80M+ for the 2 ranks and 1 billion for the outer join and the aggregation, and it takes 450 secs to execs, same cpu consumption. anyway to benchmark ETL is very difficult task because the results are too much related to the skill of the developer and the knowledge of architecture of the product btw: informatica lookups are slower than the joiners at least in 8.6.1 release, we will see in informatica 9 Younes Siebel Oct 7, 2010 I think that Talend, when there is not huge informations to deal with, can simply be the best. But the things that make it more interesting is that it cost 0.00$, while DataStage Server is more than 80.000,00$! naresh ketepalli Aug 2, 2011 Can anyone tell me the architecture and features of Talend. Leave a Comment Connect to this blog to be notified of new entries. ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 10 of 12 21/01/2014 10:37 Browse all IT Blogs We Recommend Functional Design Specification Document Template Part 1 - Intro Merge / Upsert statement 4 Ways Mobile CRM Improves the Quality of Customer Engagements Password Management in the SAP System How to build a secure LAMP web server with CentOS 5 Are Developers "The New Kingmakers" in an App-Centric World? From Around The Web Why IT Is Responsible for Painful Customer Experiences (TechViews) Letting Go of Fear to Help the Creative Process (Innovative Thinking System) What Happened to Japanese Innovation? (Innovative Thinking System) Time Is More Than Just Money For The Denver Broncos (Forbes.com) Human trafficking the fastest growing criminal industry (WALK FREE)
You are not logged in. Sign In to post unmoderated comments. Join the community to create your free profile today. Want to read more from Vincent McBurney? Check out the blog archive. Archive Category: Information Integration Keyword Tags: etl benchmark manapps datastage informatica pentaho talend Disclaimer: Blog contents express the viewpoints of their independent authors and are not reviewed for correctness or accuracy by Toolbox for IT. Any opinions, comments, solutions or other commentary expressed by blog authors are not endorsed or recommended by Toolbox for IT or any vendor. If you feel a blog entry is inappropriate, click here to notify Toolbox for IT. From Around The Web Recommended by Recommended by Collaboration Tools Discussion Groups Blogs Wiki Toolbox for IT My Home Topics People Companies Jobs White Paper Library Follow Toolbox.com Toolbox for IT on Twitter Toolbox.com on Twitter Toolbox.com on Data Center Data Center Development C Languages Java Visual Basic Web Design & Development Enterprise Applications CRM ERP PeopleSoft SAP SCM Enterprise Architecture & EAI Enterprise Architecture & EAI Information Management Business Intelligence Database Data Warehouse Knowledge Management Oracle IT Management & Strategy Emerging Technology & Trends IT Management & Strategy Project & Portfolio Management Networking & Infrastructure Hardware Networking Communications Technology Operating Systems Linux UNIX Windows Security Security Storage Storage Topics on Toolbox for IT Toolbox.com About News Privacy Terms of Use Work at Toolbox.com Advertise Contact us Provide Feedback Help Topics Technical Support AdChoice Other Communities Toolbox for HR Hispanic Content Marketing: Is it set to explode? (Portada-Online.com) 95% of professionals don't know about this email trick (Frank Addante) Google Penalty Hit You Hard? Video Reveal 3 Steps To Overcome Penalty (Kumar Setu) The Real Problem In Working From Home (It's Not What You Think) (Forbes.com) Mike Zammuto launches ranking service for 'Super Blogs' (Examiner.com) Eight Ways to a Faster Website (ServInt) Infographic: The Rise of the Millennials (Badgeville) San Francisco: Destination for top talent and Mike Zammuto (Washington Times) ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 11 of 12 21/01/2014 10:37 Facebook Siebel Cloud Computing Cloud Computing Toolbox for Finance Copyright 1998-2014 Ziff Davis, LLC (Toolbox.com). All rights reserved. All product names are trademarks of their respective companies. Toolbox.com is not affiliated with or endorsed by any company listed at this site. ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag... 12 of 12 21/01/2014 10:37