Sunteți pe pagina 1din 15

A TECHNICAL LOOK INSIDE SYBASES ADAPTIVE SERVER ENTERPRISE 15.0.

X:
UNDERSTANDING TASK MANAGEMENT AND SCHEDULING IN THE ASE KERNEL
September, 2006 Version 1.0
Written by: Senior Director / Architect Technology Evangelism Sybase ITSG Engineering peter.thawley@sybase.com

Peter F. Thawley

A key concern of IT organizations is predicting their platforms capacity. As a business grows, its systems must scale to handle growth in supporting additional users, applications of varying types, and increasing data volumes. To fully understand Adaptive Server Enterprise's (ASE) capacity in a specific environment, one must understand and maximize the efficiency with which the users of the various applications utilize the resources that ASE uses to provide its services. This paper introduces the methods ASE uses to efficiently manage the many users requesting database services. These methods, commonly referred to as task management, control how Adaptive Server shares its CPU time across different users as well as how system services such as database locking, disk I/O, and network I/O are affected by these mechanisms.

Understanding Task Management and Scheduling in the ASE Kernel

Table of Contents
Overview ........................................................................................................................4 ASE Design Principles ..................................................................................................5
ASEs Virtual Server Architecture.........................................................................................5

Data Structures to Manage Tasks ................................................................................6


User Tasks ..............................................................................................................................6 Run Queues ............................................................................................................................7 Sleep Queues..........................................................................................................................9 Pending Disk I/O Queues ........................................................................................................9 Pending Network I/O Queues ................................................................................................10 Lock Chains...........................................................................................................................10

Knowing When to do What .........................................................................................13


Keeping Time by Counting Clock Ticks ...............................................................................13 To Yield or Not to Yield that is the question ........................................................................13

Conclusion ...................................................................................................................15

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 2 of 15

Understanding Task Management and Scheduling in the ASE Kernel

Table of Figures
Figure Figure Figure Figure Figure Figure Figure 1 2 3 4 5 6 7 Important Data Structures Used in ASEs Task Management ......................................6 Data Structures Representing User Tasks...................................................................7 Task Priorities in ASE..................................................................................................7 Partitioning CPU Capacity into Engine Groups ............................................................8 Task Selection Using Engine Groups ............................................................................8 ASEs Lock Chains.....................................................................................................11 ASEs Lock Compatibility Matrix ................................................................................12

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 3 of 15

Understanding Task Management and Scheduling in the ASE Kernel

OVERVIEW
Relational databases share one inherent similarity to operating systems they each are required respond to the requests of hundreds, if not thousands of simultaneous users. This requirement constantly challenges systems (i.e., the hardware, operating system, and RDBMS) to efficiently and equitably share the resources needed to provide these services. Different RDBMS vendors respond to this challenge differently. At one extreme, Oracle's predominately process-based architecture defers this challenge to the operating system and hardware by representing users as separate instances of the Oracle kernel. While certainly simple, this approach forces a general purpose operating system to manage resources such as memory and CPU scheduling on behalf of specialized database processing and results in significant resource consumption to provide these services. At the other extreme, Sybase took the time to build a database which could be as efficient as possible with resources such as memory and CPU by building a multi-threaded kernel. Since operating systems of the late 80s and early 90s did not provide them, Sybase built its own threads package to minimize both memory and CPU consumption. This implies ASE has complete responsibility for nearly all aspects of multi-user database services such as sharing the finite amount of CPU time allotted to it by the operating system across many database users. Therefore, how the ASE kernel chooses to share its resources impacts both application performance and the systems total capacity. An inequitable allocation of computing resources can lead to great performance for some users at the expense of others! One of the first steps in understanding a systems capacity is to understand how user tasks consume key system resources and what system resources constrain performance. To do this, one needs to understand context switch behavior of the user tasks, that is, how and why user tasks start and stop execution. This step, vital to the final tuning and capacity planning of a system, allows you to understand the relationship between the components of the system and the standard performance metrics of throughput (i.e., transactions per second) and response time. This paper introduces the concepts and algorithms of task management used in ASE so lets begin by reviewing the design principles under which ASE has been built.

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 4 of 15

Understanding Task Management and Scheduling in the ASE Kernel

ASE DESIGN PRINCIPLES


One of ASEs principal and on-going design goals since its first release has been efficient utilization of hardware and operating system resources. For this reason and the fact that most operating systems did not yet implement operating system level threading, Sybase designed a database kernel which implemented its own threading model. This provides customers with a database having a very low memory footprint since user tasks are represented in the ASE as internal data structures rather than the more memory-intensive operating system processes. Therefore, the database kernel performs many of the jobs typically found in the operating system such scheduling and dispatching the execution of user tasks.

ASEs Virtual Server Architecture


In version 4.8 when it was known as the Sybase SQL Server, Symmetric Multi-Processor (SMP) support was added to ASE. Known as the Virtual Server Architecture (VSA), Sybase built upon its efficient resource utilization by designing the concept of Database Engines. Database Engines are instances of an ASE which each service user requests and act upon common user databases and internal structures such as data caches and lock chains. The engines are fundamentally identical since they each perform the same types of user and system tasks such as searching the data cache, issuing disk I/O read & write requests, and requesting & releasing locks. This approach was chosen because it offered a fully symmetric approach to database processing. To better understand these concepts, lets begin by looking at the most important data structures within ASE.

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 5 of 15

Understanding Task Management and Scheduling in the ASE Kernel

DATA STRUCTURES TO MANAGE TASKS


Databases are complex pieces of software requiring extensive use of internal data structures to provide the foundation for reliable database services. While a complete discussion of all of SQL Servers internal data structures is clearly beyond the scope of this paper, there are a number of the important data structures which must be understood to better comprehend how task management is accomplished. Please refer to Figure 1 below for a graphical picture of these structures as we review each of these data structures in detail.
Figure 1 Important Data Structures Used in ASEs Task Management

User Tasks
User tasks executing within ASE are represented by data structures (depicted as yellow triangles in Figure 1 above). Users connecting to the database do not use any operating system resources other than the network socket and file descriptor used to communicate. Instead, when an application connects to ASE, several internal data structures are allocated within the shared memory region the database server obtained when it initially started up. Some of these structures are connection-oriented such as the well-known Process Status Structure (PSS) which contains static information about a particular user. Other structures are command-oriented. For example, when an application sends a command to ASE, the executable query plan is physically stored in an internal data structure. It is the representation of user tasks as these various internal data structures that makes ASEs task management model so lightweight in both memory and context switch overhead when compared to other methods.

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 6 of 15

Understanding Task Management and Scheduling in the ASE Kernel

Figure 2 Data Structures Representing User Tasks

Figure 2 above shows the two basic data structures used to represent User Tasks within ASE. With this model, task management is essentially an exercise of moving some of the kernel task data structure between an engine (i.e., the OS process where the commands are physically executed) and one of two other structures, the Run Queues and the Sleep Queue, used to keep track of user tasks waiting for the appropriate resource.

Run Queues
User tasks simply waiting for their turn to begin or resume execution on an engine are stored in a data structure known as a Run Queue. It is important to understand that tasks on a Run Queue all have a runnable status. You can find out which tasks are on a Run Queue by querying the sysprocesses table or the sp_who system procedure for tasks with this status. Run Queues are implemented as FIFO-based (i.e., First In, First Out) linked lists of PSS task structures. As part of the ASE 11.5 release, we added support for user task priority through the notion of Execution Classes. Currently, ASE uses eight different execution priorities to service the different types of user and system services it must perform. As shown below in Figure 4, user tasks generally fall into one of three (High, Medium, and Low) priorities with most system services (except Housekeeper) being scheduled less frequently but at a higher priority. Each priority is implemented as a separate Run Queue to minimize synchronization contention on SMP systems and consequently speed task selection.
Figure 3 Task Priorities in ASE

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 7 of 15

Understanding Task Management and Scheduling in the ASE Kernel

Anytime an engine is idle, the engine simply iterates through the different Run Queues to find the highest priority task which will be at the top of the first non-empty Run Queue. In the case of a user task, it then begins (or continues) executing the steps outlined in the tasks query plan. In the example in Figure 1, we see that task #9 is at the top of the higher priority Run Queue and consequently will be the next task to execute. By default, there is not any affinity between a task and the engine on which it executes so whichever engine happens to become idle first will be the engine which grabs task #9. Another capability provided by the execution class feature which also affects scheduling and task management is the notion of Engine Groups. This provides a method by which CPU capacity can be partitioned into distinct groups and bound to application-specific services in order to more effectively manage CPU resources and help applications predictably meet their service level agreements to the lines of business. Figure 4 below depicts a simplistic configuration where two Engine Groups are used to separate OLTP from DSS applications.
Figure 4 Partitioning CPU Capacity into Engine Groups

As noted above however, the use of Engine Groups affects the schedulers decision on which task to run. When an engine tries to run a task, it must quickly check to make sure that the task is bound to an engine group that includes that engine. To do so, each engine checks a bitmap contained in the PSS to verify that the task is allowed to run on that engine by ANDing its own bitmap with one specific to the Engine Group to which the task is currently associated. As shown in Figure 5 below, this may imply an engine must examine a few tasks before selecting one it can run.
Figure 5 Task Selection Using Engine Groups

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 8 of 15

Understanding Task Management and Scheduling in the ASE Kernel

Sleep Queues
During normal database processing, it is common for a task to require a resource that is not immediately available such as a page from disk or a database lock on a specific row. In situations as these, where the time needed to obtain the resource is either non-deterministic (e.g., a lock where its release is dependent on another users transaction) or too long (e.g., a physical disk I/O which takes 8+ milliseconds or longer), it would degrade the throughput of a multi-user system to force an engine to wait idly until that resource became available. Consequently, in order to encourage multi-user throughput, ASE generally utilizes asynchronous techniques when resources are unavailable. The cornerstone of these asynchronous techniques is a data structure known as the Sleep Queue. The Sleep Queue is essentially a hash table of user task structures. Whenever a task cant obtain a required resource, it is put to sleep on the Sleep Queue by hashing the user task structure by its SPID value. The task will only be woken up upon obtaining the resource for which it is sleeping. Typically, obtaining that resource is dependent on some other event. For example, a task sleeping for a page lock will be woken up only when that lock has been granted to it. Once woken, the task is placed at the bottom of the Run Queue to wait its turn to resume execution on an engine. This technique of putting tasks to sleep on unavailable resources obviously requires the capability to recognize when the resource becomes available so that the appropriate task can be woken. This is achieved through a few additional data structures depending on the type of resource (i.e., I/O, lock, etc.). When a task needs to send results back to the client application, it first obtains and saves information about the network I/O in a structure called a network I/O structure. For example, the tasks SPID and a pointer to the TDS (i.e., Tabular Data Stream) buffer containing the data to send across the network are all saved in this structure. The network I/O structure is then linked onto a Pending Network I/O Queue and the task is put to sleep waiting for its Network Engine to actually perform the network send. When it is time for this tasks Network Engine to send all its accumulated network I/O (this will be covered in detail in a subsequent section), the network I/O structure is retrieved, the network send is physically requested to the OS, and finally, the task is woken by moving it off the Sleep Queue onto the bottom of the appropriate Run Queue. As an optimization, if the engine executing a task happens to also be that tasks network engine, the engine immediately sends the TDS packet and the task continues to execute rather than being put to sleep.

Pending Disk I/O Queues


Physical disk I/O is relatively expensive in computer terms with even todays fastest magnetic disks taking 6-8 milliseconds to access data. Consider all the instructions a CPU can execute in that amount of time. Consequently, in order to maximize system throughput, ASE uses asynchronous I/O on whenever possible. In this case, when a task needs to do a physical I/O, the engine on which the task is running first issues the I/O request to the operating system and then puts the task to sleep to wait for the data to be returned. Since the asynchronous I/O completion event is processed at some future point in time, we need a mechanism to match the I/O being returned by the OS with the task that initiated it so that the right task can be woken! This is achieved through Pending Disk I/O Queues.
Version 1.0 September, 2006 2006 Sybase, Inc. All Rights Reserved. Sybase ITSG Engineering Page 9 of 15

Understanding Task Management and Scheduling in the ASE Kernel

In order to match completed asynchronous I/Os to the task which initiated them, the ASE kernel uses a structure called a Disk I/O structure. As the engine prepares to initiate a physical disk I/O, it first obtains and saves some information about the I/O it is about to request. For example, the tasks SPID, the device, logical and physical address as well as the number of bytes to read or write are all saved in this structure. The Disk I/O structure is then linked into a list of Pending Disk I/Os. At this point, the engine issues the physical request to the OS and puts the task to sleep until the I/O is returned to ASE. When the OS returns the data to ASE at some point in the future, the corresponding disk I/O structure is retrieved from the Pending Disk I/O Queue so that the appropriate task can be woken by moving it off the Sleep Queue onto the Run Queue. Well explore how and when this I/O completion processing occurs later in this paper. Network I/O behaves very similarly.

Pending Network I/O Queues


Network connections using standard OSI transport layer protocols such as TCP are the means by which client applications and ASE communicate. This connection is initially established at login time between the client application and an ASE Listener Service. In version 11, a feature called Multiple Network Engines distributed the networking responsibility across all engines for improved performance (both throughput and response time) and scalability by migrating the connection to the least busy engine at login time. Since there is no affinity between tasks and engines, a task will likely be executing on a different engine than the engine assigned networking responsibility for it! Therefore, much like disk I/O above, ASE uses Pending Network I/O Queues to manage tasks requiring network services. When a task needs to send results back to the client application, it first obtains and saves information about the network I/O in a structure called a network I/O structure. For example, the tasks SPID and a pointer to the TDS (i.e., Tabular Data Stream) buffer containing the data to send across the network are all saved in this structure. The network I/O structure is then linked onto a Pending Network I/O Queue and the task is put to sleep waiting for its Network Engine to actually perform the network send. When it is time for this tasks Network Engine to send all its accumulated network I/O (this will be covered in detail in a subsequent section), the network I/O structure is retrieved, the network send is physically requested to the OS, and finally, the task is woken by moving it off the Sleep Queue onto the bottom of the Run Queue.

Lock Chains
One of the more obvious uses of asynchronous techniques in database processing is in the area of concurrency (i.e., database locking). Since lock duration is non-deterministic from ASEs perspective, it is clearly not in the systems best interest for a task to wait on an engine idly waiting until the requested lock is available. Therefore, when a task requests a lock on an object that is unavailable because another task already holds it, the task is put to sleep until that lock is granted to it. As in previous cases, ASE needs a mechanism to wake the appropriate task when the lock becomes available. This mechanism, however, is more complex than others since multiple tasks may need to be woken for the same lock (e.g., for a SHARE lock on the same page or row). To complicate matters, these tasks must be woken in the order
Version 1.0 September, 2006 2006 Sybase, Inc. All Rights Reserved. Sybase ITSG Engineering Page 10 of 15

Understanding Task Management and Scheduling in the ASE Kernel

in which the requests were made! To achieve this, ASE uses a collection of structures called Lock Chains which are implemented as hash tables containing two-dimensional linked lists of two different structures, Semawait and Lock Requests structures. Please refer to Figure 6 below for a graphical picture.
Figure 6 ASEs Lock Chains

Lock Request structures are allocated by each user task requesting a lock and are used to both match the correct user tasks to wake as well as the order in which tasks should be woken when the lock becomes available. Information such as the users SPID is stored in this structure. The Semawait structures are used to link multiple Lock Requests (i.e., for different users) waiting for compatible locks on the same object or page. When a user requests a lock, the task obtains a lock request structure from a pool of available structures for the engine on which it is executing. If this engines pool is empty, structures are moved from the servers global pool. Once obtained, information is saved in this structure such as the users SPID, the object id of the object being locked, the type of lock (e.g., shared, exclusive, etc), the granularity (e.g., row, page, table, etc.), and, if applicable, the page number. The engine now determines if this request can be immediately granted. This is done by hashing the unique id of the requested object (row, page, or table) to determine if there is a Semawait structure for this object. If no Semawait structure is found, this indicates no one is currently holding a lock on this object. Therefore, this lock is granted immediately by creating a Semawait structure and linking it into the hash table so subsequent users requesting a lock on that object can see it. Finally the lock request structure is linked to the newly created Semawait and the user continues execution on the engine. If a Semawait structure was found in the above search, one or more users already hold a lock on this object or page. At this point ASE must determine whether this is a compatible lock request or not. Compatibility is determined by whether the two locks can co-exist with each other. For example, multiple share locks co-exist since multiple users can lock the same page at the same time. Update locks, on the other hand, have somewhat mixed behavior. Figure 7 outlines the lock compatibility matrix used by ASE to make this decision.
Version 1.0 September, 2006 2006 Sybase, Inc. All Rights Reserved. Sybase ITSG Engineering Page 11 of 15

Understanding Task Management and Scheduling in the ASE Kernel

Figure 7 ASEs Lock Compatibility Matrix

Lock Types
SHARE UPDATE EXCLUSIVE

SHARE Compatible Compatible Not Compatible

UPDATE Compatible Not Compatible Not Compatible

EXCLUSIVE Not Compatible Not Compatible Not Compatible

If the locks are compatible, the lock is granted by linking this users lock request structure onto the semawait structure and the user continues execution on the engine. If the locks are not compatible, the kernel finds the first semawait structure that is compatible and links the lock request structure to it. Since another user (whose lock request structures are linked to the first semawait) currently holds that lock, the user is put to sleep to wait for the lock by placing it onto the sleep queue.

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 12 of 15

Understanding Task Management and Scheduling in the ASE Kernel

KNOWING WHEN TO DO WHAT


As a multi-threaded database that provides its own system services on behalf of user tasks, ASE must keep track of time for many reasons. Predominately, we need a mechanism to decide when to perform periodic system activities such as processing completed asynchronous disk and network I/Os on a regular basis. However, ASE uses a non-preemptive scheduler to provide complete control over when a task schedules off of an engine in order to ensure tasks dont go to sleep holding critical shared resources such as latches or spinlocks. This places an additional requirement to ensure tasks dont run too long. Unlike business applications, databases cant just glance at a watch to find out what time it is high performance systems software like databases rely on relative time intervals to figure out when to do something.

Keeping Time by Counting Clock Ticks


The configuration parameter "sql server clock tick length" is how ASE keeps time within each engine. It defines a time interval, expressed in microseconds, which the operating system uses to periodically interrupt the engine to let it know that a complete time interval has occurred. Platforms use an optimized mechanism such as signals with a frequency matching this time interval. The signal handler for each engine is responsible for suspending the current task running on that engine, performing some "run-time accounting housework", and then resuming execution of the suspended task. Each engine, being a separate process under the OS, sets up, receives, and handles its own interrupt which is important since each engine does its own scheduling. Obviously, this mechanism provides a relatively course-grained, but highly efficient, way to keep track of time. It is therefore an obvious design choice for deciding when certain system tasks need to execute. However, our non-preemptive scheduler introduces a few wrinkles in order to make sure the choice for when to physically run these system services is in fact the best choice to do so.

To Yield or Not to Yield that is the question


As noted above, a non-preemptive design implies that the code knows best when it is a good idea for one task to relinquish control of an engine in order to provide equitable services to large numbers of users. Although most tasks block on some resource such as a disk or network I/O relatively frequently causing it to be scheduled off an engine, our course-grained timing method can challenge CPU-intensive systems to yield the CPU often enough to ensure reasonable sharing of CPU resources. Like operating systems, ASE has the notion of the time quantum. The time slice" parameter defines the maximum amount of time a task gets before it is a candidate to voluntarily yield the engine. Although one configures time slices in milliseconds, the ASE kernel actually converts it to clock ticks.

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 13 of 15

Understanding Task Management and Scheduling in the ASE Kernel

Each time a task is taken from the run queue by an engine and starts executing, its' private execution time counter is set to time slice number of clock ticks and the task's execution begins. The task continues to execute on that engine until either: It requests a physical I/O (either disk or net); It blocks waiting for a lock or some other shared resource; It exceeded its time slice and the task's execution path in our code hits a yield point; It exceeded the maximum time allowed running on an engine without yielding as governed by the "cpu grace time" parameter. As you can see, tasks that do a lot of physical disk or network I/Os or often block on locks or other resources spend significantly less than a single time slice executing on an engine. So, the only real question left is if a very CPU intensive request occurs, how does ASE determine how long its been running and when it should yield the engine (CPU) to another task. Each time an engine receives an interrupt from the OS that "clock tick" time period has expired, the engine "suspends" the current task and decrements the task's private execution counter by 1 (again units of clock ticks). If the task's private execution counter is less than zero, the engine sets a bit in the task's private status structure that marks it as available to "yield". In all code and loop paths, there are checks to see if its time for the task to yield! The engine then performs the earlier mentioned chores. Once complete with its chores, the engine continues executing the same task. The check for an execution counter < 0 may seem puzzling to you since it was initialized to 1 when the task began executing on the engine. Remember though that the clock tick interrupt is the only way we keep time. A task could have begun executing 75% through a clock tick interval. Therefore if we marked it "yield-able" at the first interrupt, the task would have only gotten 25% of a time slice. Since we want the task to get at least a full time slice, we mark it yield-able when less than zero (i.e., 2 interrupts have been processed in this case) so that we know the task got one full time slice. Since on average, most tasks block on I/O, locks, etc earlier than a full time slice, this is rarely a problem. Occasionally, a CPU-bound task will begin execution on an engine and due to its nature or more often, system calls that dont return, could continue to execute longer than its time slice. For these rare occasions, we use the cpu grace time parameter to prevent the task from executing forever. The cpu grace time parameter is defined in units of "clock ticks". During interrupt handling, if the engine recognizes that the task's private exec counter is equal to - (cpu grace time + time slice in ticks), then the task is assumed to be in an infinite loop and is terminated with the timeslice error. The error number (-201) which you may see here is actually the number of clock tick periods of time it has consumed in all. One has to be very careful about changing these. For example, changing sql server clock tick length is generally not advised. Its like opening up your Sun server and plugging in a 3.0Ghz crystal because you want it to run as fast as an AMD Opteron chip. Some parameters have dependencies that could cause a mis-configured server if you don't understand the relationships. For example, since cpu grace time is in units of clock ticks, if you change clock tick length (say you half it to 50,000) but leave cpu grace time the same, you've just reduced the wall-clock time of cpu grace time. You would have to double cpu grace time to 400 in this example to maintain the default wall-clock time of 20 seconds.
Version 1.0 September, 2006 2006 Sybase, Inc. All Rights Reserved. Sybase ITSG Engineering Page 14 of 15

Understanding Task Management and Scheduling in the ASE Kernel

CONCLUSION
Although the SQL that developers write and the physical data model the DBA builds are clearly the two most dominant factors in a systems performance and capacity, some business requirements dictate a detailed understanding of database processing to fine-tune systems. For example, tuning systems for real-time performance where query response times often cant exceed 10-20 milliseconds require significantly different approaches than systems doing large, complex query processing. As businesses continue to strive to improve the resource efficiency of their hardware and software systems, DBAs are increasingly being asked to reach a little farther into their bag of tricks. It is with these considerations in mind, that understanding task management and scheduling in the ASE Kernel becomes vital to making the informed decisions necessary to squeeze every last bit of performance and capacity from your systems.

Version 1.0 September, 2006

2006 Sybase, Inc. All Rights Reserved.

Sybase ITSG Engineering Page 15 of 15

S-ar putea să vă placă și