Problems and How to Nail Them! Dan Downing, VP Testing Services MENTORA Atlanta Boston DC San Jose 404.250.6515 www.mentora.com Bottlenecks Exposed Title Slide Web Application Copyright Mentora 2001 2 Identify common website performance bottlenecks: Source (what component they occur on) Symptom (how you know theres a problem) Causes (what creates the problem) Measurements (how to nail it) Cures (how to make it go away) Illustrate with examples of B2C, B2B, B2E cases Audience: Performance Engineer, Load Testing Expert, with intermediate experience Objectives 3 Terms & Concepts Application Performance Testing: A repeatable methodology for volume-simulation of real-world applications in a customers environment to yield performance results that can be implemented to deliver efficient utilization of computing resources. Scalability: The demonstrated ability (or lack thereof) of a system (or component) to yield the same response time of a business process irrespective of the magnitude of the load applied to the system. Bottleneck: A hardware component or process or software of the system-under-test that is causing performance degradation and low scalability under load. Resource Utilization: The quantification of a shared computing resource being consumed by an application process or component. Symptom: The outwardly visible but unquantifiable effect of a performance bottleneck Cause: The specific and measurable factor yielding one or more symptoms. Cure: The specific action applied to the Cause that will measurably improve the visible symptom. Measurement: A numeric value of a performance-affecting factor that can be quantified by a monitoring tool and related to a specific component of the system- under-test. 4 Symptoms Its Too Slow As perceived from slow browser response by functional testers As measured by poor scalability during first low-load test As experienced (too late!) by low productivity by real production users Its broken Page never returns after button press Web server errors (404, 500) Application error messages in application logs Symptoms are usually veryunspecific! 5 3-Tier Environment Network Firewall, load balancer, routers, network interface cards, cabling between all components Web Server Tier One or more (usually many) low capacity computers that receive, route, and display results of http requests from visitors browsers Application Server Tier One or more (often 2) medium-high capacity computers that receives, applies business logic to, and returns to the web server the results of the http request Database Server Tier One or more (usually one with redundant stand-by) high capacity computers that operate database software, and access database (often on large disk arrays) for servicing user data requests Web Server Sun E220 DB Server Sun E4500 App Server Sun E420 Oracle 6 Performance Bottleneck Sources Network Web Server App Server DB Server 30% 16% >30% 16% 12% 21-30% 25% 40% 11-20% 27% 29% <10% Ntwk Web Srvr How often? What in your experience* do you find as the relative distribution of bottlenecks? 9% 7% >60% 29% 21% 41-60% 32% 48% 21-40% 21% 11% 11-20% 7% 11% <10% DB Srvr App Srvr How often? * Poll results of 56 Mercury Conference 01 attendees of intermediate to advanced experience. 7 Performance Bottleneck Sources In my experience, its the application! (~80% of the time) Network 8% Web Server 12% App Server 35% DB Server 45% - % distribution is a SWAG based on experience testing dozens of apps Most of the application code resides here 21-40% (48%) 21-40% (32%) 11-20% (40%) >30%% (30%) Highest ranges from poll shown in color 8 Database (Simple) Anatomy Data Data Data Log BI C l i e n t
C o m m B u f f e r Query Parser Query Opti- mizer Query Plan Storage Query Executor Metadata cache Write Buffer Shared Memory Data Cache Disk Array (e.g. Sun A10000) DB Server (e.g. Sun 4500 quad cpu 2 GB memory) DB Connection Pool App Server (e.g. Sun 420) Data SQL Data 9 Key DB Server Measurements Should be ~80% of available user memory on Server, and should average < 75%; else, add! DB Memory Should be balanced across all drives, else indicates db hot spot on large, hi-access tables, which need to be striped across multiple drives; avg 20% below disk IO saturation level Server I/O Correlates with cache-hit ratio; should decrease run-to-run as cache is tuned Physical reads/writes A measure of the number of open client queries; should be low, or could be an indicator of inefficient query model Open cursors A measure of the data-intensiveness of queries; read bytes should be <50% of sent bytes, else indicates complex application queries should become stored procedures SQL*Net bytes rcvd/sent from/to client A general indicator of db load handling, and should be compared run-to-run Transactions/second Should be low (<20%); else could indicate under-sized query cache, old/no optimizer statistics, or flawed query model in app server function Parse-to-execute ratio Should be low for normal transactions (can be high for reporting functions); else indicates that indexes missing or poorly designed Table scan blocks/sec Should be zero at target loads; if not, indicated transaction model design problem Deadlocks Should be hi 90-95% range; else data cache sized too low and too much physical IO Cache Hit Ratio Should be low and constant, else yields virtual memory disk IO, which indicates insufficient memory allocated to DB processes Server Page Faults/s. Memory available should stay constant and average below 70-80%; else add memory Server Memory Shows raw horsepower consumption on the server; should average 70-80%; else add cpus! Server CPU Impact/Range Measurement 10 DB Server Causes & Cures Pinpoint and correct! Inefficient access method; too many DB connections; small comm buffers; Other Fix application transaction code Deadlocks non-zero /errors in error log Deadlocks rerun optimizer statistics high table scan blocks; many slow functions Out-of-date statistics Increase cache size Low cache-hit ratio, hi physical reads Data cache too small Review/fix transaction logic; modify DB locking strategy Hi blocked transactions, high table locks Inefficient concurrency model Raise size of query plan cache Hi parsed-to-executed queries ratio Query plan cache too small Find/add/fix table indexes high table scan blocks; slow function Missing/ineffective indexes Tune query prepares in App server / code Hi open cursors; hi bytes sent from client Overuse of row-at-a-time processing Reconfigure DB (add memory, write processes, threads, ) Low correlation btw DB and Server resource utilization; unbalanced I/O Inefficient DB configuration Convert client SQL to stored procedures | optimize slow qs Many slow pages; hi 'bytes recvd' by db server; low db cpu; or: many slow queries Inefficient SQL query model Analyze query plan, optimize query Slow page (>10 sec) which ties to a specific function, thus an SQL query; hi db cpu | IO Inefficient SQL statement Cure Measurement Cause 11 Inefficient SQL statement 24% Inefficient SQL query model 17% Inefficient DB configuration 14% Hi row-at-a-time logic 12% Missing indexes 9% Inefficient concurrency model 7% Query cache too small 7% Data cache too small 5% Other 5% Database Server Causes ~60% of the time the time its bad SQL or bad indexes! 12 Example: B2B Supply Chain Management Symptom: Transactions that return list data running very slowly; they dont scale Measurement: (using LR Oracle Monitor) Hi table scan blocks Low index fast full scans Cure: Add additional indexes Design indexes so queries can be resolved with index table columns w/o accessing base table Enable fast scan Oracle parameter Web Server Sun E220 DB Server Sun E420 App Server Sun E420 Oracle Apache WebLogic Oracle 13 LR Oracle Monitor Table scan blocks average = 12 Index fast full scans = 0 14 App Server (Simple) Anatomy C o n n e c t i o n
M g r Presentation Manager Object Cache DB Server App Server (e.g. usually two; Sun 420 dual cpu 1GB memory) Data SQL Web Server Client Requests html pages Business Logic Presentation Logic S e c u r i t y
M g r T r a n s a c t i o n
M g r D B
C o n n .
M g r M e s s a g i n g
M g r C o m m u n i c .
M g r 15 Key App Server Measurements Should see all app server instance doing similar amount of work; else indicates load balacing problem Load balancing Should contain low/no error messages, low warnings; else indicates application problems Application log Memory should track App Server memory, should stabilize at target load at 70% average, else possible memory leak or add memory Server Memory Active sessions should rise with load, and stabilize at less than Total; if does not stabilize, indicates insufficient processing power to keep up with DB; if maxes out, too few connections Active/Total DB Pool Connections A general indicator of app server load as evidenced by web server request volume, and should be compared run-to-run and track with load applied Requests/second Should be a relatively low ratio vs. non-secure transactions (<15%?); else, eating up cpu, bw SSL transactions/sec Should be rise as load increases, stabilize at target load, approximate vendor target/instance; else, decrease inactive session keep-alive time Active/Total Sessions Memory should rise as active sessions grow, should shrink in garbage collection cycle, and should stabilize at target load at 70% average, else possible memory leak or add memory App Server memory Should be hi 90% range; else data/object caches sized too low and too much physical IO Cache Hit Ratios Should be low and constant, else yields virtual memory disk IO, which indicates insufficient memory allocated to App Server processes Server Page Faults/s. Shows raw horsepower consumption on the server; should average 70-80%; else add cpus! Server CPU Impact/Range Measurement 16 App Server Metrics & Cures Cure Measurement Cause Pinpoint and correct! Low OS resources; erratic transaction performance Other Change object access method Slow object creation Inefficient object access method Review/relax app security Hi calls on port 7002 Inefficient security model Pinpoint & diagnose longest running business processes Slow specific business function Inefficiently coded transaction Raise DB connections; lower no. of App Server instances Steadily rising active connections, hi cpu utilization Poorly configured DB connection pool Add cpus, memory; decrease no. App server instances Hi cpu, memory, I/O utilization Insufficient hardware resources Validate proper JVM-to-app server match; Increase data & object caches; add HW memory Low correlation btw App and HW resource utilization; overall poor performance Poorly configured App Server Tune session keep-alive setting Steadily rising active sessions Sub-optimal session model Tune app server load balancing Spikes in transaction times Inefficient garbage collection Find and fix memory faulty application code Memory utilization rises steadily, doesn't recover Memory leak 17 App Server Causes Memory leak 15% Inefficient garbage collection 12% Sub-optimal session model 12% Poorly configured App Server 12% Insufficient hardware resources 10% Poorly configured DB connection pool 9% Inefficiently coded transaction 11% Inefficient DB access architecture 4% Inefficient object access method 5% Other 10% 60% of the time: object caching, SQL, db connection pool; 20% of the time: inefficient application server 18 Example: B2C Large Retail Web Store Web Server Sun E420 DB Server Sun E4500 App Server Sun E420 Oracle Symptom: App server memory leak Measurement: Steadily increasing, non-recovering memory usage in Dynamo console Memory exhausted and app server dies over 8 hour run Solution: Test individual functions Isolate errant function not releasing memory Fix code! Re-test to validate fix (longevity test) Apache ATG Dynamo Oracle 19 Web Server Metrics & Cures Cure Measurement Cause Add cpus, memory; add web servers; distribute content; add specialized servers (images, streaming media) Hi cpu, memory, I/O; timeout errors Insufficient hw capacity Tune web server configuration Hi I/O, hi memory utilization, low throughput Poorly configured server Review/revise load balancing policies Uneven utilization across web servers Unbalanced load across servers Review/relax secure transaction model Memory utilization >70%, low throughput; hi port 443 calls Hi SSL transactions Diagnose App, DB servers Low OS resource utilization, overall poor throughput Other Reduce keep-alive time; correct transaction design Hi ip connections per active session Inefficient transaction design Diagnose / fix application Broken link errors Broken links Direct firewall and user traffic to different ports Hi firewall-to-web server traffic Security too tight 20 Web Server Causes Security too tight 8% Broken links 8% Inefficient transaction design 11% Other 12% Hi SSL transactions 13% Unbalanced load across servers 15% Poorly configured server 15% Insufficient hw capacity 18% Major contributor: Secure transactions; often: load balancing; sometimes: high-resource specialized functions (external links, email, chat) 21 Example: B2E Collaborating Communities Web/ App Server Dell 1550 DB Server Dell 2450 SQL Server IIS/Visual Basic SQL Server Cisco Load Director Symptom: Slow overall performance DB server low activity Measurement: Web/App server resources maxed out Non-scalable transaction times Solution: Short-term: Move Chat function to dedicated server Long-term: Re-architect system in java, separate Web and App tiers, introduce dedicated server for chat and email functions 22 Network Metrics & Cures Review/tune configuration of NICs, Routers, other devices Hi latency values in network delay monitor; low throughput Poor network architecture Cure Measurement Cause ??? ??? Other Tune NIC buffers; add 2nd NIC for failover heartbeat Low throughput btw servers Poorly configured/insufficient network interface cards Loosen security policies; redesign application security High traffic btw firewall & servers Security too tight Get hoster to raise bw ceiling; increase system bw; add NICs for failover functions Low, maxed throughput; high collision rate Insufficient overall bandwidth Revise load balancing policy Uneven load at web servers Load balancing ineffective 23 Network Causes Load balancing ineffective 22% Insufficent overall bandwidth 13% Security too tight 15% Poorly configured/insufficient NICs 10% Other 20% Poor network architecture 20% No single major cause; often problem is load balancing, security, or network architecture. 24 Web Server Sun E420 DB Server Sun E4500 App Server Sun E420 Oracle Example: B2C On-line Printing Services Symptom: Low transaction performance scalability under load High latency across load balancer Measurement: Unbalanced load on web server tier Solution: Replace load balancer (bad hardware) Change load balancer policies from IP- based to server-load based Cisco Load Director 25 Monitoring Tools LoadRunner Transaction performance monitor Server resource monitor Oracle, SQL Server, selected app servers monitors Network delay monitor Database performance monitoring tools Quest Oracle Instance Monitor, Embarcadero, BMC DB Patrol App Server System Console (from app server vendor) Java object monitoring tools JProbe, Performasure (Sitraka) Network Analyzer (aka network sniffer) Operating system utilities Unix top, sar, vmstat, iostat 2000/NT Perfmon 26 Tool Example: WebLogic Console 27 Lessons Learned 1. 80% of the time it is the application or system software, not the infrastructure! 2. Make friends with your app server, db server, and hardware monitoring tools! 3. Application architect, DBA, and App Server experts are indispensable and must be involved during load tests! 4. Arrive armed with the Top 10 Things to check for each component! 5. Id the measurements you need to be able to make 6. Systems Engineer with networking, firewall, and load balancer expertise is very handy! 28 Questions? ddowning@mentora.com