Sunteți pe pagina 1din 48

SystemInterconnectnetwork&Topologies

Regularvs.Irregular Aregularnetworktopologyisdefinedintermsofsomesortofregulargraphstructure(suchasrings, meshes,hypercubes,etc);anirregulartopologyisn't.Wetendtotalkmoreaboutregulartopologies, sinceit'spossibletoanalyzethemreasonably;evendiscoveringthetopologyofanirregularnetwork canbeachallenge.Regulartopologiesareusedinapplicationssuchasparallelprocessorsandsmall LANs;irregulartopologiesareusedinlargerLANsandininternets(theInternetisextraordinarily irregularandthereareeventoolsoutthereforattemptingtodiscoverthetopologyoftheinternet nearyoursite!). Staticvs.Dynamic Therearetwobasicwaystoconstructanetwork:wecanusetheprocessorsthemselvesastherouting nodes,orwecanlettheprocessorsandmemorysit``outside''thenetworkandhavespecialized switchingnodestransferthemessages.Theformerisastaticnetwork;thelatterisadynamicnetwork (theideaisthatwithastaticnetworkyoucanonlysendamessagetoyourneighbors,whileina dynamicnetworkyoucandropamessagewithroutinginformationandthenetworkcangetit anywhere).Asitturnsout,wecaneasilyfindexamplesofbothdynamicandstaticnetworksfornearly anytopologywecaretocomeupwith. CircuitSwitchingvs.PacketSwitching Second,therearetwobasicwaystosetupthecommunicationpathsinanetwork:wecanputeach packetonthenetwitheitherroutinginformationorjustinformationaboutitsdestination,orelsewe cansetupalltheswitchesonceandletallthepacketsfollowthepaththatwasestablished.Thefirst wayiscalledpacketswitching,thesecondiscircuitswitching.Historicallythephonesystemused circuitswitching;nowjustabouteverythingusespacketswitching. Note: while just about everything today is packet-switched, the way it is normally presented to the user is through virtual circuits. Sourcebasedroutingvs.others Wealsohaveachoiceofsendingapacketalongwithjustthedestinationaddress,leavingthenetwork tofigureouthowtogetthedatatoitsdestination,oractuallyspecifyingthefullrouteintheheader. TheoldUsenetbangpathsspecifyingemailaddresseswereanexampleofsourcebasedrouting. Today,wealmostneverseesourcebasedrouting;wealwaysletthenetworklinksdotherouting (note:theterm``sourcebased''routinghasbeenrecycledinrecentyearstorefertomakingrouting decisionsbasedonthesourceofapacket.Thisisacompletelydifferentandunrelateduseoftheterm, andisinfactusedinanenvironmentthatisnotsourcebasedasweareusingithere). Storeandforwardvs.Wormholerouting Normally,wethinkofdatabeingshippedthroughanetworkapacketatatime:wesendthepacketto thefirstintermediatenode,thenontothesecond,andsoforth.Thisiscalledstoreandforward routing.Analternativewhichhasbecomepopularrecentlyiswormholerouting.Rememberthat ordinarily,apacketcontainsaheaderwithroutinginformationfollowedbyapayloadcontainingthe actualdata,probablyfollowedbyachecksumorsomethingtoguaranteeintegrity.Thisimpliesthat oncetheheaderhasarrivedatanode,it'spossibletomakeroutingdecisionsandpassitalong immediately,ratherthanwaitingfortheentirepackettoarrivefirst.Thisiscalledwormholerouting,in analogywithawormcrawlingthroughawormhole.Wormholeroutingdramaticallyreduceslatency, butcreatesnewpossibilitiesfordeadlock. Blockingvs.nonblockingvs.rearrangeable Inadynamicnetwork,aquestionthatarisesiswhetherit'spossibletorealizeanypermutationofthe sourcesanddestinations.Ifnot,it'sablockingnetwork;ifso,it'snonblocking.Onelast,somewhat counterintuitivetomymind,possibilityisthatit'spossibletohaveanetworkinwhichthereismore thanonepossiblepathfromsourcestoandyouhavetofindtherightonetotaketoavoidblocking. Thisiscalledarearrangeablenetwork,becauseyoucanrearrangeyourpathstofixblocking.

Topologies All of these factors are usually expressed in big-O notation, to maintain technological independence. So the latency is measured by the number of hops required from source to destination and the cost is measured by the required number of switches or network interfaces. There are two ways to measure bandwidth: We can measure the aggregate bandwidth (how many hosts can simultaneously send messages) for a performance measure, or we can measure the bisection bandwidth (how many links have to break before the network is cut in two halves) for a fault-tolerance measure. I've always like total bandwidth precisely because it measures performance; I have to admit, though, that it's maybe not as interesting as bisection bandwidth since just about every network topology has an aggregate bandwidth of O(n). We can put some bounds on just how good any of the three parameters can get, and use them for gold standards: obviously, we can't do better than constant latency. Since we have n processors, and each of them needs an interface to the network, our cost can't get better than O(n). And even if we come up with a network that could support better than O(n) bandwidth, the processors can't provide data faster than that, so O(n) is our best bandwidth. There are standard examples of networks that each meet two of the three criteria well:

As we can see, with a bus every node is connected to a single wire. If we assume the time to communicate on that wire is negligible then the distance from any node to any other node is the same: one "hop." The bus is also as cheap as we can get: every node has exactly one network interface. Unfortunately, we can only send one message on the bus at a time. The ring quite directly trades latency for aggregate bandwidth. The cost is the same as for the bus, as each node has exactly one transmitter and one receiver. The bandwidth is much better than for the bus, since every node can simultaneously be sending a message to it neighbor. Unfortunately, the latency becomes very bad since we typically have to send a message across several intermediate nodes for it to reach its destination. Finally, the completely connected network optimizes both bandwidth and latency, but is very expensive. The distance between any two nodes is exactly one (since they're directly connected), and every node can be sending a message simultaneously. Unfortunately, every node needs a network tranceiver connecting it to every other node. We can summarize the costs of these three networks in the following table. The table shows what an ideal (but not realizable) network's behavior would be for each of the parameters, and compares these three real networks to it. In all cases, N is the number of nodes in the network. Topology Bandwidth Latency Cost BestAt AggregateBisection ideal O(N) O(N) O(1) O(N) All,ofcourse! bus O(1) O(1) O(1) O(N) latency,cost ring O(N) O(1) O(N) O(N) bandwith,cost

completely O(N) O(N) O(1) O(N2) bandwidth,latency connected The ring and the crossbar both provide good examples of how we can have both static and dynamic examples of topologies. You can either implement a ring in which each processor is directly connected to the next, or one in which there is a ring of switches, and a processor sends a message by dropping it into the switches. Likewise, a crossbar is a dynamic network; the static equivalent is a completely connected graph of all the processors (the wires in the static network take the place of the switches in the dynamic). One last point to make on comparing these topologies, and deciding between them, is that the relative importance of the parameters is changing over time. Not long ago, the cost of a crossbar network of any reasonable size was completely prohibitive; today, a 32-node crossbar isn't unreasonable at all. Similarly, the latency of a mesh was far too slow; using wormhole routing has made it into a contender. Hypercubes Now,let'strytogetacompromisebetweenthese.So,insteadofpickingtwo,we'llpickpartofallthree.There isawholefamilyofnetworksthatturnouttobetopologicallyequivalenttohypercubesofvarioussizes. We can start by looking at a static hypercube network: put several interface cards on each node, and connect them directly (this is what the Caltech Cosmic Cube did, remember).

We can see the recursive construction of a hypercube in the following sequence of images. The algorithm for constructing a hypercube of dimension n is: if (n == 0) draw a node else draw two hypercubes of dimension n-1 put arcs between corresponding nodes of the two hypercubes In the figure, the new arcs connecting the two subcubes in each step are shown with darker lines. This also gives us a scheme for building up a unique address for each node in an n-dimensional hypercube: use the first bit to decide which of the two sub-hypercubes the node is in. use the remaining n-1 bits as an address within that sub-hypercube. The following figure shows how this scheme labels the nodes in a 3-cube:

The 3-cube has eight nodes, which this figure labels in binary from 000 to 111. Notice that the least significant bit (the 1's place) is used for the left-right dimension; the next bit (the 2's place) is the up-down dimension, and the most significant bit (the 4's place) is forward-back. We can get from any node to any other node in at most log N hops (unless there is contention) so the latency is is O(log N). There are N processors, each with log2N interfaces, so the cost is O(N log N). and all the processors can use their links simultaneously, so our aggregate bandwidth is O(N). The bisection bandwidth is O(log N).

The translation from a static hypercube to a dynamic one is a step that tends to cause a lot of heartburn. So let's take a look at how we can do this. First, let's impose some arbitrary restrictions on how we transport stuff: on any given cycle, we will only use the links along one direction. So we'll go front/back (if needed) on the first cycle, up/down on the second, and left/right on the third. This means we only process the most significant address bit on the first cycle, the second bit on the second cycle, and so on until you can swap the least significant bit on the last cycle. We can draw a flow diagram showing how this goes on each cycle, like this:

The idea is that this shows how a packet is routed between the nodes in each time unit. The dark line shows an example, the route from node 110 to node 011. On time unit 1, node 110 transfers the data to node 010. On time unit 2, the data stays put. On time unit 3, node 010 transfers the data to node 011. Now we can take the last step of the communication, and replace the direct communication between the nodes with a switch box:

We can do the same thing for the other levels, but this will require rerouting some wires to make the switch boxes fit:

If we look carefully at the spaghetti nest of wire leading to each switch stage, we can see that it performs a perfect shuffle -- so we have a shuffle-exchange network. This is a blocking network -- can't simultaneously route 2->3 and 6->1, for instance. If the switch boxes are also capable of broadcasting (selecting one of their inputs and sending a message on both outputs), this is an "Omega" network. We can also go back to a static shuffle-exchange network, like this:

In the static shuffle-exchange, the curved arcs take the same role as the shuffles between the switch boxes in the Omega, and the straight arcs take the same role as the switch boxes. This gives us, once again, an O(log n) latency and an O(n) aggregate bandwidth; the bisection bandwidth is always always 4 (so it's O(1)) and the nodes always have 3 links, so the cost is O(n). Note that most pictures of the static shuffle-exchange show links from 000 around back to itself, and similarly for 111. One other interesting property of the perfect shuffle: it does the same thing as left-rotating the address of the node by one bit. Multidimensional Meshes and Toroids A mesh is a generalization of the hypercube, in which we have more than two nodes along a dimension. The most popular meshes are 2- and 3-dimensional meshes; here's a picture of a 2-d mesh:

Meshes have O(n) cost, O(sqrt(n)) bisection bandwidth, O(n) aggregate bandwidth, and O(sqrt(n)) latency. This latency was regarded as unacceptable when store-and-forward was the order of the day, but they have become quite popular as wormhole routing has become more common. If we wrap the ends around, we have a toroid instead of a mesh; a two-dimensional toroid is normally drawn so it looks a lot like a donut but I'm not going to try to render it like that!

A ring is actually a one-dimensional toroid.

Multiprocessorsystems:
In a multiprocessing system, all CPUs may be equal, or some may be reserved for special purposes. A combination of hardware and operating-system software design considerations determine the symmetry (or lack thereof) in a given system. For example, hardware or software considerations may require that only one CPU respond to all hardware interrupts, whereas all other work in the system may be distributed equally among CPUs; or execution of kernel-mode code may be restricted to only one processor (either a specific processor, or only one processor at a time), whereas user-mode code may be executed in any combination of processors. Multiprocessing systems are often easier to design if such restrictions are imposed, but they tend to be less efficient than systems in which all CPUs are utilized. Systems that treat all CPUs equally are called symmetric multiprocessing (SMP) systems. In systems where all CPUs are not equal, system resources may be divided in a number of ways, including asymmetric multiprocessing (ASMP), non-uniform memory access (NUMA) multiprocessing, and clustered multiprocessing (qq.v.). Instruction and data streams In multiprocessing, the processors can be used to execute a single sequence of instructions in multiple contexts (single-instruction, multiple-data or SIMD, often used in vector processing), multiple sequences of instructions in a single context (multiple-instruction, single-data or MISD, used for redundancy in fail-safe systems and sometimes applied to describe pipelined processors or hyper-threading), or multiple sequences of instructions in multiple contexts (multiple-instruction, multiple-data or MIMD). Processor coupling Tightly-coupled multiprocessor systems contain multiple CPUs that are connected at the bus level. These CPUs may have access to a central shared memory (SMP or UMA), or may participate in a memory hierarchy with both local and shared memory (NUMA). The IBM p690 Regatta is an example of a high end SMP system. Intel Xeon processors dominated the multiprocessor market for business PCs and were the only x86 option until the release of AMD's Opteron range of processors in 2004. Both ranges of processors had their own onboard cache but provided access to shared memory; the Xeon processors via a common pipe and the Opteron processors via independent pathways to the system RAM. Chip multiprocessors, also known as multi-core computing, involves more than one processor placed on a single chip and can be thought of the most extreme form of tightly-coupled multiprocessing. Mainframe systems with multiple processors are often tightly-coupled. Loosely-coupled multiprocessor systems (often referred to as clusters) are based on multiple standalone single or dual processor commodity computers interconnected via a high speed communication system (Gigabit Ethernet is common). A Linux Beowulf cluster is an example of a loosely-coupled system. Tightly-coupled systems perform better and are physically smaller than loosely-coupled systems, but have historically required greater initial investments and may depreciate rapidly; nodes in a loosely-coupled system are usually inexpensive commodity computers and can be recycled as independent machines upon retirement from the cluster. Power consumption is also a consideration. Tightly-coupled systems tend to be much more energy efficient than clusters. This is because considerable economy can be realized by designing components to work together from the beginning in tightly-coupled systems, whereas loosely-coupled systems use components that were not necessarily intended specifically for use in such systems. Flynns Classification SISD multiprocessing In a single instruction stream, single data stream computer one processor sequentially processes instructions, each instruction processes one data item. One example is the "von Neumann" architecture with RISC. SIMD multiprocessing In a single instruction stream, multiple data stream computer one processor handles a stream of instructions, each one of which can perform calculations in parallel on multiple data locations. SIMD multiprocessing is well suited to parallel or vector processing, in which a very large set of data can be divided into parts that are individually subjected to identical but independent operations. A single instruction

stream directs the operation of multiple processing units to perform the same manipulations simultaneously on potentially large amounts of data. For certain types of computing applications, this type of architecture can produce enormous increases in performance, in terms of the elapsed time required to complete a given task. However, a drawback to this architecture is that a large part of the system falls idle when programs or system tasks are executed that cannot be divided into units that can be processed in parallel. Additionally, programs must be carefully and specially written to take maximum advantage of the architecture, and often special optimizing compilers designed to produce code specifically for this environment must be used. Some compilers in this category provide special constructs or extensions to allow programmers to directly specify operations to be performed in parallel (e.g., DO FOR ALL statements in the version of FORTRAN used on the ILLIAC IV, which was a SIMD multiprocessing supercomputer). SIMD multiprocessing finds wide use in certain domains such as computer simulation, but is of little use in general-purpose desktop and business computing environments. MISD multiprocessing MISD multiprocessing offers mainly the advantage of redundancy, since multiple processing units perform the same tasks on the same data, reducing the chances of incorrect results if one of the units fails. MISD architectures may involve comparisons between processing units to detect failures. Apart from the redundant and fail-safe character of this type of multiprocessing, it has few advantages, and it is very expensive. It does not improve performance. It can be implemented in a way that is transparent to software. It is used in array processors and is implemented in fault tolerant machines. MIMD multiprocessing MIMD multiprocessing architecture is suitable for a wide variety of tasks in which completely independent and parallel execution of instructions touching different sets of data can be put to productive use. For this reason, and because it is easy to implement, MIMD predominates in multiprocessing. Processing is divided into multiple threads, each with its own hardware processor state, within a single software-defined process or within multiple processes. Insofar as a system has multiple threads awaiting dispatch (either system or user threads), this architecture makes good use of hardware resources. MIMD does raise issues of deadlock and resource contention, however, since threads may collide in their access to resources in an unpredictable way that is difficult to manage efficiently. MIMD requires special coding in the operating system of a computer but does not require application changes unless the programs themselves use multiple threads (MIMD is transparent to single-threaded programs under most operating systems, if the programs do not voluntarily relinquish control to the OS). Both system and user software may need to use software constructs such as semaphores (also called locks or gates) to prevent one thread from interfering with another if they should happen to cross paths in referencing the same data. This gating or locking process increases code complexity, lowers performance, and greatly increases the amount of testing required, although not usually enough to negate the advantages of multiprocessing. Similar conflicts can arise at the hardware level between processors (cache contention and corruption, for example), and must usually be resolved in hardware, or with a combination of software and hardware (e.g., cache-clear instructions). Thread: In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. It generally results from a fork of a computer program into two or more concurrently running tasks. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources. In particular, the threads of a process share the latter's instructions (its code) and its context (the values that its variables reference at any given moment). To give an analogy, multiple threads in a process are like multiple cooks reading off the same cook book and following its instructions, not necessarily from the same page. On a single processor, multithreading generally occurs by time-division multiplexing (as in multitasking): the processor switches between different threads. This context switching generally happens frequently enough that the user perceives the threads or tasks as running at the same time. On a multiprocessor or multi-core system,

the threads or tasks will actually run at the same time, with each processor or core running a particular thread or task. Many modern operating systems directly support both time-sliced and multiprocessor threading with a process scheduler. The kernel of an operating system allows programmers to manipulate threads via the system call interface. Some implementations are called a kernel thread, whereas a lightweight process (LWP) is a specific type of kernel thread that shares the same state and information. Multithreading Multithreading as a widespread programming and execution model allows multiple threads to exist within the context of a single process. These threads share the process' resources but are able to execute independently. The threaded programming model provides developers with a useful abstraction of concurrent execution. However, perhaps the most interesting application of the technology is when it is applied to a single process to enable parallel execution on a multiprocessor system. This advantage of a multithreaded program allows it to operate faster on computer systems that have multiple CPUs, CPUs with multiple cores, or across a cluster of machines because the threads of the program naturally lend themselves to truly concurrent execution. In such a case, the programmer needs to be careful to avoid race conditions, and other non-intuitive behaviors. In order for data to be correctly manipulated, threads will often need to rendezvous in time in order to process the data in the correct order. Threads may also require mutually-exclusive operations (often implemented using semaphores) in order to prevent common data from being simultaneously modified, or read while in the process of being modified. Careless use of such primitives can lead to deadlocks. Another use of multithreading, applicable even for single-CPU systems, is the ability for an application to remain responsive to input. In a single-threaded program, if the main execution thread blocks on a long-running task, the entire application can appear to freeze. By moving such long-running tasks to a worker thread that runs concurrently with the main execution thread, it is possible for the application to remain responsive to user input while executing tasks in the background. On the other hand, in most cases multithreading is not the only way to keep program responsive, and non-blocking I/O can be used to achieve the same result. 1 Operating systems schedule threads in one of two ways: 1. Preemptive multithreading is generally considered the superior approach, as it allows the operating systemtodeterminewhenacontextswitchshouldoccur.Thedisadvantagetopreemptivemultithreadingis that the system may make a context switch at an inappropriate time, causing priority inversion or other negativeeffectswhichmaybeavoidedbycooperativemultithreading. 2. Cooperativemultithreading,ontheotherhand,reliesonthethreadsthemselvestorelinquishcontrol oncetheyareatastoppingpoint.Thiscancreateproblemsifathreadiswaitingforaresourcetobecome available. Until late 1990s, desktop computers did not have much support for multithreading, because switching between threads was generally already quicker than full process context switches. clarification needed Starting the 1990s Linus Torvalds included system level Multithreading support in the Linux Kernel, improving the overall system performance against others Kernels. Processors in embedded systems, which have higher requirements for realtime behaviors, might support multithreading by decreasing the thread-switch time, perhaps by allocating a dedicated register file for each thread instead of saving/restoring a common register file. In the late 1990s, the idea of executing instructions from multiple threads simultaneously, known as simultaneous multithreading, has reached desktops with Intel's Pentium 4 processor, under the name hyper threading. It has been dropped from Intel Core and Core 2 architectures, but later was re-instated in Core i7 architecture. Critics of multithreading contend that increasing the use of threads has significant drawbacks: "Although threads seem to be a small step from sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. Threads, as a model of computation, are wildly nondeterministic, and the job of the programmer becomes one of pruning that nondeterminism." -- 'The Problem with Threads, Edward A. Lee, UC Berkeley, 2006

DistributedMemory:
Distributed Memory refers to a multipleprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data is required, the computational task must communicate with one or more remote processors. In contrast, a shared memory multi processor offers a single memory space used by all processors. Processors do not have to be aware where data resides, except that there may be performance penalties, and that race conditions are to be avoided. Distributed System: The word distributed in terms such as "distributed system", "distributed programming", and "distributed algorithm" originally referred to computer networks where individual computers were physically distributed within some geographical area. The terms are nowadays used in a much wider sense, even referring to autonomous processes that run on the same physical computer and interact with each other by message passing. While there is no single definition of a distributed system, the following defining properties are commonly used: There are several autonomous computational entities, each of which has its own local memory. The entities communicate with each other by message passing. In this article, the computational entities are called computers or nodes. A distributed system may have a common goal, such as solving a large computational problem. Alternatively, each computer may have its own user with individual needs, and the purpose of the distributed system is to coordinate the use of shared resources or provide communication services to the users. Other typical properties of distributed systems include the following: The system has to tolerate failures in individual computers. The structure of the system (network topology, network latency, number of computers) is not known in advance, the system may consist of different kinds of computers and network links, and the system may change during the execution of a distributed program. Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input. (a)(b) A distributed system. (c) A parallel system. Parallel and distributed computing Distributed systems are networked computers operating with same processors. The terms "concurrent computing", "parallel computing", and "distributed computing" have a lot of overlap, and no clear distinction exists between them. The same system may be characterised both as "parallel" and "distributed"; the processors in a typical distributed system run concurrently in parallel. Parallel computing may be seen as a particular tightly-coupled form of distributed computing, and distributed computing may be seen as a loosely-coupled form of parallel computing. Nevertheless, it is possible to roughly classify concurrent systems as "parallel" or "distributed" using the following criteria: In parallel computing, all processors have access to a shared memory. Shared memory can be used to exchange information between processors. In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors.

The figure on the right illustrates the difference between distributed and parallel systems. Figure (a) is a schematic view of a typical distributed system; as usual, the system is represented as a network topology in which each node is a computer and each line connecting the nodes is a communication link. Figure (b) shows the same distributed system in more detail: each computer has its own local memory, and information can be exchanged only by passing messages from one node to another by using the available communication links. Figure (c) shows a parallel system in which each processor has a direct access to a shared memory. The situation is further complicated by the traditional uses of the terms parallel and distributed algorithm that do not quite match the above definitions of parallel and distributed systems; see the section Theoretical foundations below for more detailed discussion. Nevertheless, as a rule of thumb, high-performance parallel computation in a shared-memory multiprocessor uses parallel algorithms while the coordination of a large-scale distributed system uses distributed algorithms. Distributed Shared Memory (DSM), in Computer Architecture is a form of memory architecture where the (physically separate) memories can be addressed as one (logically shared) address space. Here, the term shared does not mean that there is a single centralized memory but shared essentially means that the address space is shared (same physical address on two processors refers to the same location in memory) 1 . Alternatively in computer science it is known as (DGAS), a concept that refers to a wide class of software and hardware implementations, in which each node of a cluster has access to shared memory in addition to each node's nonshared private memory. Software DSM systems can be implemented in an operating system, or as a programming library. Software DSM systems implemented in the operating system can be thought of as extensions of the underlying virtual memory architecture. Such systems are transparent to the developer; which means that the underlying distributed memory is completely hidden from the users. In contrast, Software DSM systems implemented at the library or language level are not transparent and developers usually have to program differently. However, these systems offer a more portable approach to DSM system implementation. Software DSM systems also have the flexibility to organize the shared memory region in different ways. The page based approach organizes shared memory into pages of fixed size. In contrast, the object based approach organizes the shared memory region as an abstract space for storing shareable objects of variable sizes. Another commonly seen implementation uses a tuple space, in which the unit of sharing is a tuple. Shared memory architecture may involve separating memory into shared parts distributed amongst nodes and main memory; or distributing all memory between nodes. A coherence protocol, chosen in accordance with a consistency model, maintains memory coherence.

Amdahl's law FromWikipedia,thefreeencyclopedia Jumpto:navigation,search

Thespeedupofaprogramusingmultipleprocessorsinparallelcomputingislimitedbythesequentialfraction oftheprogram.Forexample,if95%oftheprogramcanbeparallelized,thetheoreticalmaximumspeedup usingparallelcomputingwouldbe20asshowninthediagram,nomatterhowmanyprocessorsareused. Amdahl's law, also known as Amdahl's argument, 1 is named after computer architect Gene Amdahl, and is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors. The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimum execution time cannot be less than that critical 1 hour. Hence the speedup is limited up to 20, as the diagram illustrates. Description Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm, under the assumption that the problem size remains the same when parallelized. For example, if for a given problem size a parallelized implementation of an algorithm can run 12% of the algorithm's operations arbitrarily quickly (while the remaining 88% of the operations are not parallelizable), Amdahl's law states that the maximum speedup of the parallelized version is 1/(1 0.12) = 1.136 times as fast as the non-parallelized implementation. More technically, the law is concerned with the speedup achievable from an improvement to a computation that affects a proportion P of that computation where the improvement has a speedup of S. (For example, if an improvement can speed up 30% of the computation, P will be 0.3; if the improvement makes the portion affected twice as fast, S will be 2.) Amdahl's law states that the overall speedup of applying the improvement will be To see how this formula was derived, assume that the running time of the old computation was 1, for some unit of time. The running time of the new computation will be the length of time the unimproved fraction takes (which is 1 P), plus the length of time the improved fraction takes. The length of time for the improved part of the computation is the length of the improved part's former running time divided by the speedup, making the length of time of the improved part P/S. The final speedup is computed by dividing the old running time by the new running time, which is what the above formula does.

Here's another example. We are given a task which is split up into four parts: P1 = 11%, P2 = 18%, P3 = 23%, P4 = 48%, which add up to 100%. Then we say P1 is not sped up, so S1 = 1 or 100%, P2 is sped up 5, so S2 = 500%, P3 is sped up 20, so S3 = 2000%, and P4 is sped up 1.6, so S4 = 160%. By using the formula P1/S1 + P2/S2 + P3/S3 + P4/S4, we find the running time is or a little less than the original running time which we know is 1. Therefore the overall speed boost is 1 / 0.4575 = 2.186 or a little more than double the original speed using the formula (P1/S1 + P2/S2 + P3/S3 + P4/S4)1. Notice how the 20 and 5 speedup don't have much effect on the overall speed boost and running time when 11% is not sped up, and 48% is sped up by 1.6. Parallelization In the case of parallelization, Amdahl's law states that if P is the proportion of a program that can be made parallel (i.e. benefit from parallelization), and (1 P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is In the limit, as N tends to infinity, the maximum speedup tends to 1 / (1 P). In practice, performance to price ratio falls rapidly as N is increased once there is even a small component of (1 P). As an example, if P is 90%, then (1 P) is 10%, and the problem can be speed up by a maximum of a factor of 10, no matter how large the value of N used. For this reason, parallel computing is only useful for either small numbers of processors, or problems with very high values of P: so-called embarrassingly parallel problems. A great part of the craft of parallel programming consists of attempting to reduce the component (1 P) to the smallest possible value. P can be estimated by using the measured speedup SU on a specific number of processors NP using

P estimated in this way can then be used in Amdahl's law to predict speedup for a different number of processors. Relation to law of diminishing returns Amdahl's law is often conflated with the law of diminishing returns, whereas only a special case of applying Amdahl's law demonstrates 'law of diminishing returns'. If one picks optimally (in terms of the achieved speedup) what to improve, then one will see monotonically decreasing improvements as one improves. If, however, one picks non-optimally, after improving a sub-optimal component and moving on to improve a more optimal component, one can see an increase in return. Note that it is often rational to improve a system in an order that is "non-optimal" in this sense, given that some improvements are more difficult or consuming of development time than others. Amdahl's law does represent the law of diminishing returns if you are considering what sort of return you get by adding more processors to a machine, if you are running a fixed-size computation that will use all available processors to their capacity. Each new processor you add to the system will add less usable power than the previous one. Each time you double the number of processors the speedup ratio will diminish, as the total throughput heads toward the limit of 1 / (1 P). This analysis neglects other potential bottlenecks such as memory bandwidth and I/O bandwidth, if they do not scale with the number of processors; however, taking into account such bottlenecks would tend to further demonstrate the diminishing returns of only adding processors.

Speedup in a sequential program

Assumethatataskhastwoindependentparts,AandB.Btakesroughly25%ofthetimeofthewhole computation.Byworkingveryhard,onemaybeabletomakethispart5timesfaster,butthisonlyreducesthe timeforthewholecomputationbyalittle.Incontrast,onemayneedtoperformlessworktomakepartAbe twiceasfast.ThiswillmakethecomputationmuchfasterthanbyoptimizingpartB,eventhoughB'sspeedup isgreaterbyratio,(5versus2). The maximum speedup in an improved sequential program, where some part was sped up p times is limited by inequality where f (0 < f < 1) is the fraction of time (before the improvement) spent in the part that was not improved. For example (see picture on right): IfpartBismadefivetimesfaster(p = 5),tA = 3,tB = 1andf = tA / (tA + tB) = 0.75,then IfpartAismadetoruntwiceasfast(p = 2),tB = 1,tA = 3andf = tB / (tA + tB) = 0.25,then

Therefore, making A twice as fast is better than making B five times faster. The percentage improvement in speed can be calculated as ImprovingpartAbyafactoroftwowillincreaseoverallprogramspeedbyafactorof1.6,which makesit37.5%fasterthantheoriginalcomputation. However,improvingpartBbyafactoroffive,whichpresumablyrequiresmoreeffort,willonly achieveanoverallspeedupfactorof1.25,whichmakesit20%faster.

Cache: A cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere. If requested data is contained in the cache (cache hit), this request can be served by simply reading the cache, which is comparatively faster. Otherwise (cache miss), the data has to be recomputed or fetched from its original storage location, which is comparatively slower. Hence, the more requests can be served from the cache the faster the overall system performance is. To be cost efficient and to enable an efficient use of data, caches are relatively small. Nevertheless, caches have proven themselves in many areas of computing because access patterns in typical computer applications have locality of reference. References exhibit temporal locality if data is requested again that has been recently requested already. References exhibit spatial locality if data is requested that is physically stored close to data that has been requested already. DiagramofaCPUmemorycache Operation Hardware implements cache as a block of memory for temporary storage of data likely to be used again. CPUs and hard drives frequently use a cache, as do web browsers and web servers. A cache is made up of a pool of entries. Each entry has a datum (a nugget (piece) of data) - a copy of the same datum in some backing store. Each entry also has a tag, which specifies the identity of the datum in the backing store of which the entry is a copy. When the cache client (a CPU, web browser, operating system) needs to access a datum presumed to exist in the backing store, it first checks the cache. If an entry can be found with a tag matching that of the desired datum, the datum in the entry is used instead. This situation is known as a cache hit. So, for example, a web browser program might check its local cache on disk to see if it has a local copy of the contents of a web page at a particular URL. In this example, the URL is the tag, and the contents of the web page is the datum. The percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache. The alternative situation, when the cache is consulted and found not to contain a datum with the desired tag, has become known as a cache miss. The previously uncached datum fetched from the backing store during miss handling is usually copied into the cache, ready for the next access. During a cache miss, the CPU usually ejects some other entry in order to make room for the previously uncached datum. The heuristic used to select the entry to eject is known as the replacement policy. One popular replacement policy, "least recently used" (LRU), replaces the least recently used entry (see cache algorithm). More efficient caches compute use frequency against the size of the stored contents, as well as the latencies and throughputs for both the cache and the backing store. This works well for larger amounts of data, longer latencies and slower throughputs, such as experienced with a hard drive and the Internet, but is not efficient for use with a CPU cache.

Writing Policies

AWriteThroughcachewithNoWriteAllocation AWriteBackcachewithWriteAllocation When a system writes a datum to the cache, it must at some point write that datum to the backing store as well. The timing of this write is controlled by what is known as the write policy. In a write-through cache, every write to the cache causes a synchronous write to the backing store. Alternatively, in a write-back (or write-behind) cache, writes are not immediately mirrored to the store. Instead, the cache tracks which of its locations have been written over and marks these locations as dirty. The data in these locations are written back to the backing store when those data are evicted from the cache, an effect referred to as a lazy write. For this reason, a read miss in a write-back cache (which requires a block to be replaced by another) will often require two memory accesses to service: one to retrieve the needed datum, and one to write replaced data from the cache to the store. Other policies may also trigger data write-back. The client may make many changes to a datum in the cache, and then explicitly notify the cache to write back the datum. No-write allocation (a.k.a. write-no-allocate) is a cache policy which caches only processor reads, i.e. on a write-miss: Datumiswrittendirectlytomemory, Datumatthemissedwritelocationisnotaddedtocache. This avoids the need for write-back or write-through when the old value of the datum was absent from the cache prior to the write. Entities other than the cache may change the data in the backing store, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the client updates the data in the cache, copies of those data in other caches will become stale. Communication protocols between the cache managers which keep the data consistent are known as coherency protocols. Applications CPU cache Small memories on or close to the CPU can operate faster than the much larger main memory. Most CPUs since the 1980s have used one or more caches, and modern high-end embedded, desktop and server microprocessors may have as many as half a dozen, each specialized for a specific function. Examples of caches with a specific function are the D-cache and I-cache (data cache and instruction cache). Disk cache While CPU caches are generally managed entirely by hardware, a variety of software manages other caches. The page cache in main memory, which is an example of disk cache, is managed by the operating system kernel.

While the hard drive's hardware disk buffer is sometimes misleadingly referred to as "disk cache", its main functions are write sequencing and read prefetching. Repeated cache hits are relatively rare, due to the small size of the buffer in comparison to the drive's capacity. However, high-end disk controllers often have their own on-board cache of hard disk data blocks. Finally, fast local hard disk can also cache information held on even slower data storage devices, such as remote servers (web cache) or local tape drives or optical jukeboxes. Such a scheme is the main concept of hierarchical storage management. Other caches The BIND DNS daemon caches a mapping of domain names to IP addresses, as does a resolver library. Write-through operation is common when operating over unreliable networks (like an Ethernet LAN), because of the enormous complexity of the coherency protocol required between multiple write-back caches when communication is unreliable. For instance, web page caches and client-side network file system caches (like those in NFS or SMB) are typically read-only or write-through specifically to keep the network protocol simple and reliable. Search engines also frequently make web pages they have indexed available from their cache. For example, Google provides a "Cached" link next to each search result. This can prove useful when web pages from a web server are temporarily or permanently inaccessible. Another type of caching is storing computed results that will likely be needed again, or memoization. ccache, a program that caches the output of the compilation to speed up the second-time compilation, exemplifies this type. Database caching can substantially improve the throughput of database applications, for example in the processing of indexes, data dictionaries, and frequently used subsets of data. Distributed caching uses caches spread across different networked hosts, for example, Corelli The difference between buffer and cache The terms "buffer" and "cache" are not mutually exclusive and the functions are frequently combined; however, there is a difference in intent. A buffer is a temporary memory location that is traditionally used because CPU instructions cannot directly address data stored in peripheral devices. Thus, addressable memory is used as intermediate stage. Additionally such a buffer may be feasible when a large block of data is assembled or disassembled (as required by a storage device), or when data may be delivered in a different order than that in which it is produced. Also a whole buffer of data is usually transferred sequentially (for example to hard disk), so buffering itself sometimes increases transfer performance or reduce the variation or jitter of the transfer's latency as opposed to caching where the intent is to reduce the latency. These benefits are present even if the buffered data are written to the buffer once and read from the buffer once. A cache also increases transfer performance. A part of the increase similarly comes from the possibility that multiple small transfers will combine into one large block. But the main performance-gain occurs because there is a good chance that the same datum will be read from cache multiple times, or that written data will soon be read. A cache's sole purpose is to reduce accesses to the underlying slower storage. Cache is also usually an abstraction layer that is designed to be invisible from the perspective of neighboring layers. Cache coherence In computing, cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource.

MultipleCachesofSharedResource When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system. Referring to the "Multiple Caches of Shared Resource" figure, if the top client has a copy of a memory block from a previous read and the bottom client changes that memory block, the top client could be left with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory. Definition: Coherence defines the behavior of reads and writes to the same memory location. The coherence of caches is obtained if the following conditions are met: 1. AreadmadebyaprocessorPtoalocationXthatfollowsawritebythesameprocessorPtoX,withno writes of Xby anotherprocessor occurring between the writeand the read instructions made by P, X must alwaysreturnthevaluewrittenbyP.Thisconditionisrelatedwiththeprogramorderpreservation,andthis mustbeachievedeveninmonoprocessedarchitectures. 2. AreadmadebyaprocessorP1tolocationXthatfollowsawritebyanotherprocessorP2toXmust returnthewrittenvaluemadebyP2ifnootherwritestoXmadebyanyprocessoroccurbetweenthetwo accesses.Thisconditiondefinestheconceptofcoherentviewofmemory.Ifprocessorscanreadthesameold valueafterthewritemadebyP2,wecansaythatthememoryisincoherent. 3. Writestothesamelocationmustbesequenced.Inother words,iflocationXreceivedtwodifferent valuesAandB,inthisorder,byanytwoprocessors,theprocessorscanneverreadlocationXasBandthen readitasA.ThelocationXmustbeseenwithvaluesAandBinthatorder. These conditions are defined supposing that the read and write operations are made instantaneously. However, this doesn't happen in computer hardware given memory latency and other aspects of the architecture. A write by processor P1 may not be seen by a read from processor P2 if the read is made within a very small time after the write has been made. The memory consistency model defines when a written value must be seen by a following read instruction made by the other processors. Cache coherence mechanisms Directorybasedcoherence:Inadirectorybasedsystem,thedatabeingsharedisplacedinacommon directorythatmaintainsthecoherencebetweencaches.Thedirectoryactsasafilterthroughwhichthe processormustaskpermissiontoloadanentryfromtheprimarymemorytoitscache.Whenanentryis changedthedirectoryeitherupdatesorinvalidatestheothercacheswiththatentry. Snooping is the process where the individual caches monitor address lines for accesses to memory locationsthattheyhavecached.Whenawriteoperationisobservedtoalocationthatacachehasacopy of,thecachecontrollerinvalidatesitsowncopyofthesnoopedmemorylocation. Snarfingiswhereacachecontrollerwatchesbothaddressanddatainanattempttoupdateitsown copy of a memory location when a second master modifies a location in main memory. When a write operationisobservedtoalocationthatacachehasacopyof,thecachecontrollerupdatesitsowncopyof thesnarfedmemorylocationwiththenewdata. Distributed shared memory systems mimic these mechanisms in an attempt to maintain consistency between blocks of memory in loosely coupled systems.

The two most common types of coherence that are typically studied are Snooping and Directory-based, each having its own benefits and drawbacks. Snooping protocols tend to be faster, if enough bandwidth is available, since all transactions are a request/response seen by all processors. The drawback is that snooping isn't scalable. Every request must be broadcast to all nodes in a system, meaning that as the system gets larger, the size of the (logical or physical) bus and the bandwidth it provides must grow. Directories, on the other hand, tend to have longer latencies (with a 3 hop request/forward/respond) but use much less bandwidth since messages are point to point and not broadcast. For this reason, many of the larger systems (>64 processors) use this type of cache coherence. Memory coherence Memory coherence is an issue that affects the design of computer systems in which two or more processors or cores share a common area of memory. In a uniprocessor system (whereby, in today's terms, there exists only one core), there is only one processing element doing all the work and therefore only one processing element that can read or write from/to a given memory location. As a result, when a value is changed, all subsequent read operations of the corresponding memory location will see the updated value, even if it is cached. Conversely, in multiprocessor (or multicore) systems, there are two or more processing elements working at the same time, and so it is possible that they simultaneously access the same memory location. Provided none of them changes the data in this location, they can share it indefinitely and cache it as they please. But as soon as one updates the location, the others might work on an out-of-date copy that, e.g., resides in their local cache. Consequently, some scheme is required to notify all the processing elements of changes to shared values; such a scheme is known as a "memory coherence protocol", and if such a protocol is employed the system is said to have a "coherent memory". The exact nature and meaning of the memory coherency is determined by the consistency model that the coherence protocol implements. In order to write correct concurrent programs, programmers must be aware of the exact consistency model that is employed by their systems.

Message Passing Technique: Message passing is the paradigm of communication where messages are sent from a sender to one or more recipients. Forms of messages include (remote) method invocation, signals, and data packets. When designing a message passing system several choices are made: Whethermessagesaretransferredreliably Whethermessagesareguaranteedtobedeliveredinorder Whether messages are passed onetoone, onetomany (unicasting or multicast), or manytoone (clientserver). Whethercommunicationissynchronousorasynchronous. Prominent theoretical foundations of concurrent computation, such as the Actor model and the process calculi are based on message passing. Implementations of concurrent systems that use message passing can either have message passing as an integral part of the language, or as a series of library calls from the language. Examples of the former include many distributed object systems. Examples of the latter include Microkernel operating systems pass messages between one kernel and one or more server blocks, and the Message Passing Interface used in high-performance computing. Message passing systems and models Distributed object and remote method invocation systems like ONC RPC, Corba, Java RMI, DCOM, SOAP, .NET Remoting, CTOS, QNX Neutrino RTOS, OpenBinder, D-Bus and similar are message passing systems. Message passing systems have been called "shared nothing" systems because the message passing abstraction hides underlying state changes that may be used in the implementation of sending messages. Message passing model based programming languages typically define messaging as the (usually asynchronous) sending (usually by copy) of a data item to a communication endpoint (Actor, process, thread, socket, etc.). Such messaging is used in Web Services by SOAP. This concept is the higher-level version of a datagram except that messages can be larger than a packet and can optionally be made reliable, durable, secure, and/or transacted. Messages are also commonly used in the same sense as a means of interprocess communication; the other common technique being streams or pipes, in which data are sent as a sequence of elementary data items instead (the higher-level version of a virtual circuit). Synchronous versus asynchronous message passing Synchronous message passing systems require the sender and receiver to wait for each other to transfer the message. That is, the sender will not continue until the receiver has received the message. Synchronous communication has two advantages. The first advantage is that reasoning about the program can be simplified in that there is a synchronisation point between sender and receiver on message transfer. The second advantage is that no buffering is required. The message can always be stored on the receiving side, because the sender will not continue until the receiver is ready. Asynchronous message passing systems deliver a message from sender to receiver, without waiting for the receiver to be ready. The advantage of asynchronous communication is that the sender and receiver can overlap their computation because they do not wait for each other. Synchronous communication can be built on top of asynchronous communication by ensuring that the sender always wait for an acknowledgement message from the receiver before continuing. The buffer required in asynchronous communication can cause problems when it is full. A decision has to be made whether to block the sender or whether to discard future messages. If the sender is blocked, it may lead to an unexpected deadlock. If messages are dropped, then communication is no longer reliable. Message passing versus calling Message passing should be contrasted with the alternative communication method for passing information between programs the Call. In a traditional Call, arguments are passed to the "callee" (the receiver) typically by one or more general purpose registers or in a parameter list containing the addresses of each of the arguments. This form of communication differs from message passing in at least three crucial areas: totalmemoryusage transfertime locality

In message passing, each of the arguments has to have sufficient available extra memory for copying the existing argument into a portion of the new message. This applies irrespective of the size of the original arguments so if one of the arguments is (say) an HTML string of 31,000 octets describing a web page (similar to the size of this article), it has to be copied in its entirety (and perhaps even transmitted) to the receiving program (if not a local program). By contrast, for the call method, only an address of say 4 or 8 bytes needs to be passed for each argument and may even be passed in a general purpose register requiring zero additional storage and zero "transfer time". This of course is not possible for distributed systems since an (absolute) address in the callers address space is normally meaningless to the remote program (however, a relative address might in fact be usable if the callee had an exact copy of, at least some of, the callers memory in advance). Web browsers and web servers are examples of processes that communicate by message passing. A URL is an example of a way of referencing resources that does depend on exposing the internals of a process. A subroutine call or method invocation will not exit until the invoked computation has terminated. Asynchronous message passing, by contrast, can result in a response arriving a significant time after the request message was sent. A message handler will, in general, process messages from more than one sender. This means its state can change for reasons unrelated to the behaviour of a single sender or client process. This is in contrast to the typical behaviour of an object upon which methods are being invoked: the latter is expected to remain in the same state between method invocations. (in other words, the message handler behaves analogously to a volatile object). Message passing and locks Message passing can be used as a way of controlling access to resources in a concurrent or asynchronous system. One of the main alternatives is mutual exclusion or locking. Examples of resources include shared memory, a disk file or region thereof, a database table or set of rows. In locking, a resource is essentially shared, and processes wishing to access it (or a sector of it) must first obtain a lock. Once the lock is acquired, other processes are blocked out, ensuring that corruption from simultaneous writes does not occur. The lock is then released. With the message-passing solution, it is assumed that the resource is not exposed, and all changes to it are made by an associated process, so that the resource is encapsulated. Processes wishing to access the resource send a request message to the handler. If the resource (or subsection) is available, the handler makes the requested change as an atomic event, that is conflicting requests are not acted on until the first request has been completed. If the resource is not available, the request is generally queued. The sending programme may or may not wait until the request has been completed.

Superscalar

Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a maximum of two instructionspercyclecanbecompleted.

ProcessorboardofaCRAYT3esupercomputerwithfoursuperscalarAlpha21164processors A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier. In the Flynn Taxonomy, a superscalar processor is classified as a MIMD processor (Multiple Instructions, Multiple Data). While a superscalar CPU is typically also pipelined, pipelining and superscalar architecture are considered different performance enhancement techniques. The superscalar technique is traditionally associated with several identifying characteristics (within a given CPU core): Instructionsareissuedfromasequentialinstructionstream CPU hardware dynamically checks for data dependencies between instructions at run time (versus softwarecheckingatcompiletime) TheCPUacceptsmultipleinstructionsperclockcycle History Seymour Cray's CDC 6600 from 1965 is often mentioned as the first superscalar design. The Intel i960CA (1988) and the AMD 29000-series 29050 (1990) microprocessors were the first commercial single-chip superscalar microprocessors. RISC CPUs like these were first microcomputers to use the superscalar concept, because the RISC design results in a simple core, thereby allowing the inclusion of multiple functional units (such as ALUs) on a single CPU in the constrained design rules of the time (this was why RISC designs were faster than CISC designs through the 1980s and into the 1990s). Except for CPUs used in low-power applications, embedded systems, and battery-powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar. The P5 Pentium was the first superscalar x86 processor; the Nx586, P6 Pentium Pro and AMD K5 were among the first designs which decode x86-instructions asynchronously into dynamic microcode-like micro-op sequences prior to actual execution on a superscalar microarchitecture; this opened up for dynamic scheduling

of buffered partial instructions and enabled more parallelism to be extracted compared to the more rigid methods used in the simpler P5 Pentium; it also simplified speculative execution and allowed higher clock frequencies compared to designs such as the advanced Cyrix 6x86. From scalar to superscalar The simplest processors are scalar processors. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy is the difference between scalar and vector arithmetic. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple redundant functional units within each CPU thus multiple instructions can be processing separate data items concurrently. Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple functional units in use at all times. This has become increasingly important when the number of units increased. While early superscalar CPUs would have two ALUs and a single FPU, a modern design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will suffer. A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle. But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined, multiprocessor or multi-core architectures also achieve that, but with different methods. In a superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to redundant functional units contained inside a single CPU. Therefore a superscalar processor can be envisioned having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread. Limitations Available performance improvement from superscalar techniques is limited by three key areas: 1. Thedegreeofintrinsicparallelismintheinstructionstream,i.e.limitedamountofinstruction levelparallelism. 2. Thecomplexityandtimecostofthedispatcherandassociateddependencycheckinglogic. 3. Thebranchinstructionprocessing. Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of the other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; b = e + f might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units. When the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU's clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research shows the gate cost in some cases may be nk gates, and the delay cost k2logn, where n is the number of instructions in the processor's instruction set, and k is the number of simultaneously dispatched instructions. In mathematics, this is called a combinatoric problem involving permutations. Even though the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there is no assurance otherwise and failure to detect a dependency would produce incorrect results. No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of functional units (e.g., ALUs), the burden of checking instruction dependencies grows so rapidly that the achievable superscalar dispatch limit is fairly small. -- likely on the order of five to six simultaneously dispatched instructions. However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation.

Alternatives Collectively, these limits drive investigation into alternative architectural changes such as Very Long Instruction Word (VLIW), Explicitly Parallel Instruction Computing (EPIC), simultaneous multithreading (SMT), and multi-core processors. With VLIW, the burdensome task of dependency checking by hardware logic at run time is removed and delegated to the compiler. Explicitly Parallel Instruction Computing (EPIC) is like VLIW, with extra cache prefetching instructions. Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. Superscalar processors differ from multi-core processors in that the redundant functional units are not entire processors. A single processor is composed of finer-grained functional units such as the ALU, integer multiplier, integer shifter, floating point unit, etc. There may be multiple versions of each functional unit to enable execution of many instructions in parallel. This differs from a multi-core processor that concurrently processes instructions from multiple threads, one thread per core. It also differs from a pipelined CPU, where the multiple instructions can concurrently be in various stages of execution, assembly-line fashion. The various alternative techniques are not mutually exclusivethey can be (and frequently are) combined in a single processor. Thus a multicore CPU is possible where each core is an independent processor containing multiple parallel pipelines, each pipeline being superscalar. Some processors also include vector capability.

Vector processor processor, is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. This is in contrast to a scalar processor, whose instructions operate on single data items. The vast majority of CPUs are scalar citation needed . Vector processors first appeared in the 1970s, and formed the basis of most supercomputers through the 1980s and into the 1990s. Improvements in scalar processors, particularly microprocessors, resulted in the decline of traditional vector processors in supercomputers, and the appearance of vector processing techniques in mass market CPUs around the early 1990s. Today, most commodity CPUs implement architectures that feature instructions for some vector processing on multiple (vectorized) data sets, typically known as SIMD (Single Instruction, Multiple Data). Common examples include MMX, SSE, and AltiVec. Vector processing techniques are also found in video game console hardware and graphics accelerators. In 2000, IBM, Toshiba and Sony collaborated to create the Cell processor, consisting of one scalar processor and eight vector processors, which found use in the Sony PlayStation 3 among other applications. Other CPU designs may include some multiple instructions for vector processing on multiple (vectorised) data sets, typically known as MIMD (Multiple Instruction, Multiple Data). Such designs are usually dedicated to a particular application and not commonly marketed for general purpose computing. History Vector processing development began in the early 1960s at Westinghouse in their Solomon project. Solomon's goal was to dramatically increase math performance by using a large number of simple math co-processors under the control of a single master CPU. The CPU fed a single common instruction to all of the arithmetic logic units (ALUs), one per "cycle", but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set, fed in the form of an array. In 1962, Westinghouse cancelled the project, but the effort was re-started at the University of Illinois as the ILLIAC IV. Their version of the design originally called for a 1 GFLOPS machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless it showed that the basic concept was sound, and, when used on data-intensive applications, such as computational fluid dynamics, the "failed" ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, massively parallel computing. The first successful implementation of vector processing appears to be the Control Data Corporation STAR-100 and the Texas Instruments Advanced Scientific Computer (ASC). The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes. The STAR was otherwise slower than CDC's own supercomputers like the CDC 7600, but at data related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up. The vector technique was first fully exploited in the famous Cray-1. Instead of leaving the data in memory like the STAR and ASC, the Cray design had eight "vector registers," which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions themselves to be pipelined, a technique they called vector chaining. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS a respectable number even as of 2002. Other examples followed. Control Data Corporation tried to re-enter the high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies (Fujitsu, Hitachi and Nippon Electric Corporation (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon-based Floating Point Systems (FPS) built add-on array processors for minicomputers, later

building their own minisupercomputers. However Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since then, the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as a vector processor. Vector processing techniques have since been added to almost all modern CPU designs, although they are typically referred to as SIMD. In these implementations, the vector unit runs beside the main scalar CPU, and is fed data from programs that know it is there. Description In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, many CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could bein theory at leastencoded directly into the instruction. However things are rarely that simple. In general the data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time. As CPU speeds have increased, this memory latency has historically become a large impediment to performance; see Memory wall. In order to reduce the amount of time this takes, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in the fashion of an assembly line, so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency, but the CPU can process an entire batch of operations much faster than if it did so one at a time. Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. They are fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, it reads a single instruction from memory, and "knows" that the next address will be one larger than the last. This allows for significant savings in decoding time. To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language you would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this: execute this loop 10 times read the next instruction and decode it fetch this number fetch that number add them put the result here end loop But to a vector processor, this task looks considerably different: read instruction and decode it fetch these 10 numbers fetch those 10 numbers add them put the results here There are several savings inherent in this approach. For one, only two address translations are needed. Depending on the architecture, this can represent a significant savings by itself. Another saving is fetching and decoding the instruction itself, which has to be done only one time instead of ten. The code itself is also smaller, which can lead to more efficient memory use. But more than that, a vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple

independent operations. This simplifies the control logic required, and can improve performance by avoiding stalls. As mentioned earlier, the Cray implementations took this a step further, allowing several different types of operations to be carried out at the same time. Consider code that adds two numbers and then multiplies by a third; in the Cray, these would all be fetched at once, and both added and multiplied in a single operation. Using the pseudocode above, the Cray did: read instruction and decode it fetch these 10 numbers fetch those 10 numbers fetch another 10 numbers add and multiply them put the results here The math operations thus completed far faster overall, the limiting factor being the time required to fetch the data from memory. Not all problems can be attacked with this sort of solution. Adding these sorts of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run sloweri.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding. In fact, vector processors work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were, in general, found in places such as weather prediction centres and physics labs, where huge amounts of data are "crunched".

Virtual memory

Virtual memory combines active RAM and inactive memory in disk form into a large range of contiguous addresses. In computing, virtual memory is a memory management technique developed for multitasking kernels. This technique virtualizes a computer architecture's various forms of computer data storage (such as random-access memory and disk storage), allowing a program to be designed as though there is only one kind of memory, "virtual" memory, which behaves like directly addressable read/write memory (RAM). Most modern operating systems that support virtual memory also run each process in its own dedicated address space, allowing a program to be designed as though it has sole access to the virtual memory. However, some older operating systems (such as OS/VS1 and OS/VS2 SVS) and even modern ones (such as IBM i) are single address space operating systems that run all processes in a single address space composed of virtualized memory. Systems that employ virtual memory: citationneeded usehardwarememorymoreefficientlythansystemswithoutvirtualmemory. maketheprogrammingofapplicationseasierby: o hidingfragmentation, o delegatingtothekerneltheburdenofmanagingthememoryhierarchy(thereisnoneedfor theprogramtohandleoverlaysexplicitly), o and, when each process is run in its own dedicated address space, by obviating the need to relocateprogramcodeortoaccessmemorywithrelativeaddressing. Memory virtualization is a generalization of the concept of virtual memory. Virtual memory is an integral part of a computer architecture; all implementations (excluding dubious discuss emulators and virtual machines) require hardware support, typically in the form of a memory management unit built into the CPU. Consequently, older operating systems, such as those for the mainframes of the 1960s, and those for personal computers of the early to mid 1980s, such as DOS 1 generally have no virtual memory functionality dubious discuss , though notable exceptions for mainframes of the 1960s include the Atlas Supervisor for the Atlas, the MCP for the Burroughs B5000, TSS/360 and CP/CMS for the IBM System/360 Model 67, Multics for the GE 645, and the Time Sharing Operating System for the RCA Spectra 70/46, and for personal computer operating systems of the 1980s include the operating system for the Apple Lisa.

Embedded systems and other special-purpose computer systems that require very fast and/or very consistent response times may opt not to use virtual memory due to decreased determinism; virtual memory systems trigger unpredictable interrupts that may produce unwanted "jitter" during I/O operations. This is because embedded hardware costs are often kept low by implementing all such operations with software (a technique called bit-banging) rather than with dedicated hardware. In any case, embedded systems usually have little use for complicated memory hierarchies. History In the 1940s and 1950s, all larger programs had to contain logic for managing primary and secondary storage, such as overlaying. Virtual memory was therefore introduced not only to extend primary memory, but to make such an extension as easy as possible for programmers to use. 2 To allow for multiprogramming and multitasking, many early systems divided memory between multiple programs without virtual memory, such as early models of the PDP-10 via registers. Paging was first developed at the University of Manchester as a way to extend the Atlas Computer's working memory by combining its 16 thousand words of primary core memory with an additional 96 thousand words of secondary drum memory. The first Atlas was commissioned in 1962 but working prototypes of paging had been developed by 1959. 2 (p2) 3 4 In 1961, the Burroughs Corporation independently released the first commercial computer with virtual memory, the B5000, with segmentation rather than paging. 5 6 Before virtual memory could be implemented in mainstream operating systems, many problems had to be addressed. Dynamic address translation required expensive and difficult to build specialized hardware; initial implementations slowed down slightly access to memory. 2 There were worries that new system-wide algorithms utilizing secondary storage would be less effective than previously used application-specific algorithms. By 1969, the debate over virtual memory for commercial computers was over; 2 an IBM research team led by David Sayre showed that their virtual memory overlay system consistently worked better than the best manually controlled systems. citation needed The first minicomputer to introduce virtual memory was the Norwegian NORD-1; during the 1970s, other minicomputers implemented virtual memory, notably VAX models running VMS. Virtual memory was introduced to the x86 architecture with the protected mode of the Intel 80286 processor, but its segment swapping technique scaled poorly to larger segment sizes. The Intel 80386 introduced paging support underneath the existing segmentation layer, enabling the page fault exception to chain with other exceptions without double fault. However, loading segment descriptors was an expensive operation, causing operating system designers to rely strictly on paging rather than a combination of paging and segmentation. Paged virtual memory Nearly all implementations of virtual memory divide a virtual address space into pages, blocks of contiguous virtual memory addresses. Pages are usually at least 4 kibibytes in size; systems with large virtual address ranges or amounts of real memory generally use larger page sizes. Page tables Page tables are used to translate the virtual addresses seen by the application into physical addresses used by the hardware to process instructions; such hardware that handles this specific translation is often known as the memory management unit. Each entry in the page table holds a flag indicating whether the corresponding page is in real memory or not. If it is in real memory, the page table entry will contain the real memory address at which the page is stored. When a reference is made to a page by the hardware, if the page table entry for the page indicates that it is not currently in real memory, the hardware raises a page fault exception, invoking the paging supervisor component of the operating system. Systems can have one page table for the whole system, separate page tables for each application and segment, a tree of page tables for large segments or some combination of these. If there is only one page table, different applications running at the same time use different parts of a single range of virtual addresses. If there are multiple page or segment tables, there are multiple virtual address spaces and concurrent applications with separate page tables redirect to different real addresses. Paging supervisor This part of the operating system creates and manages page tables. If the hardware raises a page fault exception, the paging supervisor accesses secondary storage, returns the page that has the virtual address that resulted in

the page fault, updates the page tables to reflect the physical location of the virtual address and tells the translation mechanism to restart the request. When all physical memory is already in use, the paging supervisor must free a page in primary storage to hold the swapped-in page. The supervisor uses one of a variety of page replacement algorithms such as least recently used to determine which page to free. Pinned pages Operating systems have memory areas that are pinned (never swapped to secondary storage). For example, interrupt mechanisms rely on an array of pointers to their handlers, such as I/O completion and page fault. If the pages containing these pointers or the code that they invoke were pageable, interrupt-handling would become far more complex and time-consuming, particularly in the case of page fault interrupts. Hence, some part of the page table structures is not pageable. Some pages may be pinned for short periods of time, others may be pinned for long periods of time, and still others may need to be permanently pinned. For example: Thepagingsupervisorcodeanddriversforsecondarystoragedevicesonwhichpagesresidemustbe permanently pinned, as otherwise paging wouldn't even work because the necessary code wouldn't be available. Timingdependentcomponentsmaybepinnedtoavoidvariablepagingdelays. Databuffersthatareaccesseddirectlyby,say,peripheraldevicesthatusedirectmemoryaccessorI/O channelsmustresideinpinnedpageswhiletheI/Ooperationisinprogressbecausesuchdevicesandthe buses to which they are attached expect to find data buffers located at physical memory addresses; regardlessofwhetherthebushasamemorymanagementunitforI/O,transferscannotbestoppedifa pagefaultoccursandthenrestartedwhenthepagefaulthasbeenprocessed. In IBM's operating systems for System/370 and successor systems, the term is "fixed", and pages may be longterm fixed, or may be short-term fixed. Control structures are often long-term fixed (measured in wall-clock time, i.e., time measured in seconds, rather than time measured in less than one second intervals) whereas I/O buffers are usually short-term fixed (usually measured in significantly less than wall-clock time, possibly for a few milliseconds). Indeed, the OS has a special facility for "fast fixing" these short-term fixed data buffers (fixing which is performed without resorting to a time-consuming Supervisor Call instruction). Additionally, the OS has yet another facility for converting an application from being long-term fixed to being fixed for an indefinite period, possibly for days, months or even years (however, this facility implicitly requires that the application firstly be swapped-out, possibly from preferred-memory, or a mixture of preferred- and nonpreferred memory, and secondly be swapped-in to non-preferred memory where it resides for the duration, however long that might be; this facility utilizes a documented Supervisor Call instruction). Virtualrealoperation In OS/VS1 and similar OSes, some parts of systems memory are managed in virtual-real mode, where every virtual address corresponds to a real address, specifically interrupt mechanisms, paging supervisor and tables in older systems, and application programs using non-standard I/O management. For example, IBM's z/OS has 3 modes (virtual-virtual, virtual-real and virtual-fixed). 7 page needed Thrashing When paging is used, a problem called "thrashing" can occur, in which the computer spends an unsuitable amount of time swapping pages to and from a backing store, hence slowing down useful work. Adding real memory is the simplest response, but improving application design, scheduling, and memory usage can help. Segmented virtual memory Some systems, such as the Burroughs B5500, 8 do not use paging; instead, they use segmentation, dividing virtual address spaces into variable-length segments. A virtual address here consists of a segment number and an offset within the segment. The Intel 80286 supports a similar segmentation scheme as an option, but it is little used. Segmentation and paging are meldable by dividing each segment into pages; systems with this memory structure, such as Multics and IBM System/38, are usually paging-predominant, segmentation providing memory protection. 9 10 11 In the Intel 80386 and later IA-32 processors, the segments reside in a 32-bit linear, paged address space. Segments can be moved in and out of that space; pages there can "page" in and out of main memory, providing

two levels of virtual memory; few if any operating systems do so, instead using only paging. Early nonhardware-assisted x86 virtualization solutions combined paging and segmentation because x86 paging offers only two protection domains whereas a VMM / guest OS / guest applications stack needs three. 12 :22 The difference between paging and segmentation systems is not only about memory division; segmentation is visible to user processes, as part of memory model semantics. Hence, instead of memory that looks like a single large vector, it is structured into multiple spaces. This difference has important consequences; a segment is not a page with variable length or a simple way to lengthen the address space. Segmentation that can provide a single-level memory model in which there is no differentiation between process memory and file system consists of only a list of segments (files) mapped into the process's potential address space. 13 This is not the same as the mechanisms provided by calls such as mmap and Win32's MapViewOfFile, because inter-file pointers do not work when mapping files into semi-arbitrary places. In Multics, a file (or a segment from a multi-segment file) is mapped into a segment in the address space, so files are always mapped at a segment boundary. A file's linkage section can contain pointers for which an attempt to load the pointer into a register or make an indirect reference through it causes a trap. The unresolved pointer contains an indication of the name of the segment to which the pointer refers and an offset within the segment; the handler for the trap maps the segment into the address space, puts the segment number into the pointer, changes the tag field in the pointer so that it no longer causes a trap, and returns to the code where the trap occurred, re-executing the instruction that caused the trap. 14 This eliminates the need for a linker completely 2 and works when different processes map the same file into different places in their private address spaces. 15

Page (computer memory) A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory that is the smallest unit of data for the following: memoryallocationperformedbytheoperatingsystemforaprogram;and transferbetweenmainmemoryandanyotherauxiliarystore,suchasaharddiskdrive. Virtual memory allows a page that does not currently reside in main memory to be addressed and used. If a program tries to access a location in such page, an exception called a page fault is generated. The hardware or operating system is notified and loads the required page from auxiliary store automatically. A program addressing the memory has no knowledge of a page fault or a process following it. Thus a program can address more (virtual) RAM than physically exists in the computer. A transfer of pages between main memory and an auxiliary store, such as a hard disk drive, is referred to as paging or swapping. 1 Page size tradeoff Page size is usually determined by processor architecture. Traditionally, pages in a system had uniform size, for example 4096 bytes. However, processor designs often allow two or more, sometimes simultaneous, page sizes due to the benefits and penalties. There are several points that can factor into choosing the best page size. Page size versus page table size A system with a smaller page size uses more pages, requiring a page table that occupies more space. For example, if a 232 virtual address space is mapped to 4KB (212 bytes) pages, the number of virtual pages is 220 =( 232 / 212). However, if the page size is increased to 32KB (215 bytes), only 217 pages are required. Page size versus TLB usage Since every access to memory must be mapped from virtual to physical address, reading the page table every time can be quite costly. Therefore, a very fast kind of cache, the Translation Lookaside Buffer (TLB), is often used. The TLB is typically of limited size, and when it cannot satisfy a given request (a TLB miss) the page tables must be searched manually (either in hardware or software, depending on the architecture) for the correct mapping. Larger page sizes mean that a TLB cache of the same size can keep track of larger amounts of memory, which avoids the costly TLB misses. Internal fragmentation of pages Rarely do processes require the use of an exact number of pages. As a result, the last page will likely only be partially full, wasting some amount of memory. Larger page sizes clearly increase the potential for wasted memory this way, as more potentially unused portions of memory are loaded into main memory. Smaller page sizes ensure a closer match to the actual amount of memory required in an allocation. As an example, assume the page size is 1024KB. If a process allocates 1025KB, two pages must be used, resulting in 1023KB of unused space (where one page fully consumes 1024KB and the other only 1KB). Page size versus disk access When transferring from disk, much of the delay is caused by seek time, the time it takes to correctly position the read/write heads above the disk platters. Because of this, large sequential transfers are more efficient than several smaller transfers. Transferring the same amount of data from disk to memory often requires less time with larger pages than with smaller pages. Determining the page size in a program Most operating systems allow programs to discover the page size at runtime. This allows programs to use memory more efficiently by aligning allocations to this size and reducing overall internal fragmentation of pages. Unix and POSIX-based operating systems Unix and POSIX-based systems may use the system function sysconf(), as illustrated in the following example written in the C programming language. #include <stdio.h> #include <unistd.h> /* sysconf(3) */ int main(void) {

printf("The page size for this system is %ld bytes.\n", sysconf(_SC_PAGESIZE)); /* _SC_PAGE_SIZE is OK too. */ return 0; } In many Unix systems the command line utility getconf can be used. For example getconf PAGESIZE will return the page size in bytes. Windows-based operating systems Win32-based operating systems, such as Windows 9x, and NT may use the system function GetSystemInfo() from kernel32.dll. #include <stdio.h> #include <windows.h> int main(void) { SYSTEM_INFO si; GetSystemInfo(&si); printf("The si.dwPageSize); page size for this system is %u bytes.\n",

return 0; } Huge pages Huge page size depends on procesor architecture, his type and operating (addressing) mode. Operating system configuration selects one from statictly defined by architecture value size. Note that not all processors implement all defined Huge/Large page sizes. ArchitecturePageSizeHugePageSize LargePageSize i386 4KB 4M(2MinPAEmode) 1GB ia64 4KB 4K,8K,64K,256K,1M,4M,16M,256M ppc64 4KB 16M Info from: http://wiki.debian.org/Hugepages (todo: supplement information for processors manufactures documentations) Some instruction set architectures can support multiple page sizes, including pages significantly larger than the standard page size. Starting with the Pentium Pro, x86 processors support 4MB pages (called Page Size Extension) (2MB pages if using PAE) in addition to their standard 4kB pages; newer x86-64 processors, such as AMD's newer AMD64 processors and Intel's Westmere 2 , processors can use 1GB pages in long mode. IA-64 supports as many as eight different page sizes, from 4kB up to 256MB, and some other architectures have similar features. specify This support for huge pages (known as superpages in FreeBSD, and large pages in Microsoft Windows terminology) allows for "the best of both worlds", reducing the pressure on the TLB cache (sometimes increasing speed by as much as 15%, depending on the application and the allocation size) for large allocations while still keeping memory usage at a reasonable level for small allocations. Huge pages, despite being available in the processors used in most contemporary personal computers, are not in common use except in large servers and computational clusters. Commonly, their use requires elevated privileges, cooperation from the application making the large allocation (usually setting a flag to ask the operating system for huge pages), or manual administrator configuration; operating systems commonly, sometimes by design, cannot page them out to disk. Linux has supported huge pages on several architectures since the 2.6 series via the hugetlbfs filesystem 3 and without hugetlbfs since 2.6.38 4 . Windows Server 2003 (SP1 and newer), Windows Vista and Windows Server 2008 support huge pages under the name of large pages. Windows 2000 and Windows XP

support large pages internally 5 , but are not exposed to applications. Solaris beginning with version 9 supports large pages on SPARC and x86. 6 7 FreeBSD 7.2-RELEASE features superpages. 8 Note that until recently in Linux, applications needed to be modified in order to use huge pages. The 2.6.38 kernel introduced support for transparent use of huge pages. 9 On FreeBSD and Solaris, applications take advantage of huge pages automatically, without the need for modification. citation needed Memory segmentation Memory segmentation is the division of computer memory into segments or sections. Segments or sections are also used in object files of compiled programs when they are linked together into a program image, or the image is loaded into memory. In a computer system using segmentation, a reference to a memory location includes a value that identifies a segment and an offset within that segment. Different segments may be created for different program modules, or for different classes of memory usage such as code and data segments. Certain segments may even be shared between programs. 1 Hardware implementation Memory segmentation is one of the most common ways to achieve memory protection; another common one is paging, and both methods can be combined. The size of memory segments is generally not fixed and is usually less than any limitation imposed by the computer, otherwise segmentation could be treated the same as paging. Also segmentation is generally more visible than paging because the programmer or compiler has to define the segments. 1 A segment has a set of permissions, and a length, associated with it. A process is only allowed to make a reference into a segment if the type of reference is allowed by the permissions, and the offset within the segment is within the range specified by the length of the segment. Otherwise, a hardware exception such as a segmentation fault is raised. As well as its set of permissions and length, a segment also has associated with it information indicating where the segment is located in memory. It may also have a flag indicating whether the segment is present in main memory or not; if a segment is accessed that is not present in main memory, an exception is raised, and the operating system will read the segment into memory from secondary storage. The information indicating where the segment is located in memory might be the address of the first location in the segment, or might be the address of a page table for the segment, if the segmentation is implemented with paging. In the first case, if a reference to a location within a segment is made, the offset within the segment will be added to address of the first location in the segment to give the address in memory of the referred-to item; in the second case, the offset of the segment is translated to a memory address using the page table. When a segment does not have a page table associated with it, the address of the first location in the segment is usually an address in main memory; in those situations, no paging is done. In the Intel 80386 and later, that address can either be an address in main memory, if paging is not enabled, or an address in a paged address space, if paging is enabled. A memory management unit (MMU) is responsible for translating a segment and offset within that segment into a memory address, and for performing checks to make sure the translation can be done and that the reference to that segment and offset is permitted. x86 architecture Mainarticle:x86memorysegmentation The x86 memory segmentation used by early x86 processors beginning with the Intel 8086 does not provide any protection. Any program running on these can access any segment with no restrictions. A segment is only identified by its starting location; there is no length checking. Segmentation in the Intel 80286 and later provides protection; with the introduction of the 80286, Intel retroactively named the sole operating mode of the previous x86 CPU models "real mode" and introduced a new "protected mode" with protection features. For backward compatibility, all x86 CPUs start in "real mode" with no memory protection, fixed 64 KiB segments, and only 20-bit (1024 KiB) addressingand an 80286 or later processor must be switched into another mode by software in order to use its full address space and advanced MMU features.

Memory Consistency Models These notes describe some of the important memory consistency models which have been considered in recent years. The basic point is going to be that trying to implement our intuitive notion of what it means for memory to be consistent is really hard and terribly expensive, and isn't necessary to get a properly written parallel program to run correctly. So we're going to produce a series of weaker definitions that will be easier to implement, but will still allow us to write a parallel program that runs predictably. Notation In describing the behavior of these memory models, we are only interested in the shared memory behavior - not anything else related to the programs. We aren't interested in control flow within the programs, data manipulations within the programs, or behavior related to local (in the sense of non-shared) variables. There is a stnadard notation for this, which we'll be using in what follows. In the notation, there will be a line for each processor in the system, and time proceeds from left to right. Each shared-memory operation performed will appear on the processor's line. The two main operations are Read and Write, which are expressed as W(var)value which means "write value to shared variable var", and R(var)value which means "read shared variable var, obtaining value." So, for instance, W(x)1 means "write a 1 to x" and R(y)3 means "read y, and get the value 3." More operations (especially synchronization operations) will be defined as we go on. For simplicity, variables are assumed to be initialized to 0. An important thing to notice about this is that a single high-level language statement (like x = x + 1;) will typically appear as several memory operations. If x previously had a value of 0, then that statement becomes (in the absence of any other processors) P1: R(x)0 W(x)1 ----------------On a RISC-style processor, it's likely that C statement would have turned into three instructions: a load, an add, and a store. Of those three instructions, two affect memory and are shown in the diagram. On a CISC-style processor, the statement would probably have turned into a single, in-memory add instruction. Even so, the processor would have executed the instruction by reading memory, doing the addition, and then writing memory, so it would still appear as two memory operations. Notice that the actual memory operations performed could equally well have been performed by some completely different high level language code; maybe an if-then-else statement that checked and then set a flag. If I ask for memory operations and there is anything in your answer that looks like a transformation or something of the data, then something is wrong! Strict Consistency The intuitive notion of memory consistency is the strict consistency model. In the strict model, any read to a memory location X returns the value stored by the most recent write operation to X. If we have a bunch of processors, with no caches, talking to memory through a bus then we will have strict consistency. The point here is the precise serialization of all memory accesses. We can give an example of what is, and what is not, strict consistency and also show an example of the notation for operations in the memory system. As we said before, we assume that all variables have a value of 0 before we begin. An example of a scenario that would be valid under the strict consistency model is the following: P1: W(x)1

----------------------P2: R(x)1 R(x)1 This says, ``processor P1 writes a value of 1 to variable x; at some later time processor P2 reads x and obtains a value of 1. Then it reads it again and gets the same value'' Here's another scenario which would be valid under strict consistency: P1: W(x)1 ------------------------------P2: R(x)0 R(x)1 This time, P2 got a little ahead of P1; its first read of x got a value of 0, while its second read got the 1 that was written by P1. Notice that these two scenarios could be obtained in two runs of the same program on the same processors. Here's a scenario which would not be valid under strict consistency: P1: W(x)1 ----------------------P2: R(x)0 R(x)1 In this scenario, the new value of x had not been propagated to P2 yet when it did its first read, but it did reach it eventually. I've also seen this model called atomic consistency. Sequential Consistency Sequential consistency is a slightly weaker model than strict consistency. It was defined by Lamport as the result of any execution is the same as if the reads and writes occurred in some order, and the operations of each individual processor appear in this sequence in the order specified by its program. In essence, any ordering that could have been produced by a strict ordering regardless of processor speeds is valid under sequential consistency. The idea is that by expanding from the sets of reads and writes that actually happened to the sets that could have happened, we can reason more effectively about the program (since we can ask the far more useful question, "could the program have broken?"). We can reason about the program itself, with less interference from the details of the hardware on which it is running. It's probably fair to say that if we have a computer system that really uses strict consistency, we'll want to reason about it using sequential consistency The third scenario above would be valid under sequential consistency. Here's another scenario that would be valid under sequential consistency: P1: W(x)1 ----------------------P2: R(x)1 R(x)2 ----------------------P3: R(x)1 R(x)2 ----------------------P4: W(x)2 This one is valid under sequential consistency because the following alternate interleaving would have been valid under strict consistency: P1: W(x)1 ----------------------------P2: R(x)1 R(x)2 -----------------------------

P3: R(x)1 R(x)2 ----------------------------P4: W(x)2 Here's a scenario that would not be valid under sequential consistency: P1: W(x)1 ----------------------P2: R(x)1 R(x)2 ----------------------P3: R(x)2 R(x)1 ----------------------P4: W(x)2 Oddly enough, the precise definition, as given by Lamport, doesn't even require that ordinary notions of causality be maintained; it's possible to see the result of a write before the write itself takes place, as in: P1: W(x)1 ----------------------P2: R(x)1 This is valid because there is a different ordering which, in strict consistency, would yield P2 reading x as having a value of 1. This isn't a flaw in the model; if your program can indeed violate causality like this, you're missing some synchronization operations in your program. Note that we haven't talked about synchronization operations yet; we will soon. Cache Coherence Most authors treat cache coherence as being virtually synonymous with sequential consistency; it is perhaps surprising that it isn't. Sequential consistency requires a globally (i.e. across all memory locations) consistent view of memory operations, cache coherence only requires a locally (i.e. per-location) consistent view. Here's an example of a scenario that would be valid under cache coherence but not sequential consistency: P1: W(x)1 W(y)2 ----------------------P2: R(x)0 R(x)2 R(x)1 R(y)0 R(y)1 ----------------------P3: R(y)0 R(y)1 R(x)0 R(x)1 ----------------------P4: W(x)2 W(y)1 P2 and P3 both saw P1's write to x as occurring after P4's (and in fact P3 never saw P4's write to x at all), and saw P4's write to y as occurring after P1's (this time, neither saw P1's write as occurring at all). But P2 saw P4's write to y as occurring after P1's write to x, while P3 saw P1's write to x occurring after P4's write to y. This couldn't happen with a snoopy-cache based scheme. But it certainly could with a directory-based scheme. Do We Really Need Such a Strong Model? Considerthefollowingsituationinasharedmemorymultiprocessor:processesrunningontwoprocessors eachchangethevalueofasharedvariablex,likethis: P1 P2 x = x + 1;x = x + 2; What happens? Without any additional information, there are four different orders in which the two processes can execute these statements, resulting in three different results: P1executesfirst

xwillgetanewvalueof3. P2executesfirst xwillgetanewvalueof3. P1andP2bothreadthedata;P1writesthemodifiedversionbeforeP2does. xwillgetanewvalueof2. P1andP2bothreadthedata;P2writesthemodifiedversionbeforeP1does. xwillgetanewvalueof1. We can characterize a program like this pretty easily and concisely: it's got a bug. With a bit more precision, we can say it has a data race: there is a variable modified by more than one process in a way such that the results depend on who gets there first. For this program to behave reliably, we have to have locks guaranteeing that one of the processes performs its entire operation before the other one starts. So... given that we have a data race, and the program's behavior is going to be unpredictable anyway, does it really matter if all the processors see all the changes in the same order? Attempting to achieve strict or sequential consistency might be regarded as trying to support the semantics of buggy programs -- since the result of the program is random anyway, why should we care whether it results in the right random value? But it gets worse, as we consider in the next sections... Optimizations and Consistency Even if the program contains no bugs as written, compilers actually don't support sequential consistency in general (compilers don't see the existence of other processors in general, let alone a consistency model. We can argue that perhaps this argues a need for languages with parallel semantics, but as long as programmers are going to use C and Java for parallel programs we're going to have to support them). Most languages support a semantics in which program order is maintained for each memory location, but not across memory locations; this gives compilers freedom to reorder code. So, for instance, if a program writes two variables x and y, and they do not depend on each other, the compiler is free to write these two values to memory in either order without affecting the correctness of the program. In a parallel environment, however, it is quite likely that a process running on some other processor does depend on the order in which x and y were written. Two-process mutual exclusion gives a good example of this. Remember the code to enter a critical section is given by flag i = true; turn = 1-i; while (flag 1-i

&& (turn == (1-i))) ;

Ifthecompilerdecides(forwhateverreason)toreversetheorderofthewritestoflag i andturn, thisisperfectlycorrectcodeinasingleprocessenvironmentbutbrokeninamultiprocessingenvironment (and,ofcourse,that'sthesituationthatmatters). Worse, since processors support out of order execution, there's no guarantee that the program, as executed, will perform its memory accesses in the order specified by the machine code! Worse, as processors and caches get ever more tightly coupled, and as machines use more and more aggressive instruction reording, these sorts of optimizations can end up happening in hardware with little or no control (it's very easy to imagine a machine finishing the update to turn while it's still setting flag i up above, since accessing flag i involves access to an array). This is a little bit of a red herring, since we can require that our compiler perform accesses of shared memory in the order specified by the program (the volatile keyword specifies this). In the case of Intel processors, we can also force some ordering on memory accesses by using the lock prefix on instructions. But notice that what we are doing by adding these keywords and prefixes is establishing places in the code where we care about the precise ordering, and places where we do not. The following memory models expand on this idea. Processor Consistency This model is also called PRAM (an acronym for Pipelined Random Access Memory, not the Parallel Random Access Machine model from computability theory) consistency. It is defined as Writes done by a single processor are received by all other processors in the order in which they were issued, but writes from

different processors may be seen in a different order by different processors. The basic idea of processor consistency is to better reflect the reality of networks in which the latency between different nodes can be different. The last scenario in the sequential consistency section, which wasn't valid for sequential consistency, would be valid for processor consistency. Here's how it could come about, in a machine in which the processors are connected by something more complex than a bus: 1. Theprocessorsareconnectedinalineararray,likethis. 2. Onthefirstcycle,P1andP4writetheirvaluesandpropagatethem. 3. Onthesecondcycle,thevaluefromP1hasreachedP2,andthevaluefromP4hasreachedP3. Theyreadthevalues,seeing1and2respectively. 4. Onthethirdcycle,thevalueshavemadeittwohops.SonowP2sees2andP3sees1. So you can see we meet the "hard" part of the definition (the part requiring writes from a single processor getting seen in-order) somewhat vacuously: P1 and P4 only make one write each, so P2 and P3 end up seeing P1's writes, and P4's writes, in order. But the point of the example is the counterintuitive part of the definition: they don't see the writes from P1 and from P4 as being in the same order. Here'sascenariowhichwouldnotbevalidforprocessorconsistency: P1: W(x)1 W(x)2 ---------------------------------P2: R(x)2 R(x)1 P2 has seen the writes from P1 in an order different than they were issued. It turns out that the two-process mutual exclusion code above is broken under processor consistency. One final note on processor consistency and pram consistency is that some authors make processor consistency slightly stronger than PRAM by requiring PC to be both PRAM consistent and cache coherent. Synchronization Accesses vs. Ordinary Accesses A correctly written shared-memory parallel program will use mutual exclusion to guard access to shared variables. In the first buggy example above, we can guarantee deterministic behavior by adding a barrier to the code, which we'll denote as S for reasons that will become apparent later: P1 P2 x = x + 1; S; S; x = x + 2; In general, in a correct parallel program we obtain exclusive access to a set of shared variables, manipulate them any way we want, and then relinquish access, distributing the new values to the rest of the system. The other processors don't need to see any of the intermediate values; they only need to see the final values. With this in mind, we can look at the different types of memory accesses more carefully. Here's a figure that shows a classification of shared memory accesses Gharachorloo : The various types of memory accesses are defined as follows: SharedAccess Actually,wecanhavesharedaccesstovariablesvs.privateaccess.Butthequestionswe'reconsidering areonlyrelevantforsharedaccesses,sothat'sallwe'reshowing. Competingvs.NonCompeting Ifwehavetwoaccessesfromdifferentprocessors,andatleastoneisawrite,theyarecompeting accesses.Theyareconsideredascompetingaccessesbecausetheresultdependsonwhichaccess occursfirst(iftherearetwoaccesses,butthey'rebothreads,itdoesn'tmatterwhichisfirst).

Synchronizingvs.NonSynchroning Ordinarycompetingaccesses,suchasvariableaccesses,arenonsynchronizingaccesses.Accessesused insynchronizingtheprocessesare(ofcourse)synchronizingaccesses. Acquirevs.Release Finally,wecandividesynchronizationaccessesintoaccessestoacquirelocks,andaccessestorelease locks. Remember that synchronization accesses should be much less common than other competing accesses (if you're spending all your time performing synchronization accesses there's something seriously wrong with your program!). So we can further weaken the memory models we use by treating sync accesses differently from other accesses. Weak Consistency Weak consistency results if we only consider competing accesses as being divided into synchronizing and nonsynchronizing accesses, and require the following properties: 1. Accessestosynchronizationvariablesaresequentiallyconsistent. 2. Noaccesstoasynchronizationvariableisallowedtobeperformeduntilallpreviouswrites havecompletedeverywhere. 3. Nodataaccess(readorwrite)isallowedtobeperformeduntilallpreviousaccessesto synchronizationvariableshavebeenperformed. Here's a valid scenario under weak consistency, which shows its real strength: P1: W(x)1 W(x)2 S -----------------------------------P2: R(x)0 R(x)2 S R(x)2 -----------------------------------P3: R(x)1 S R(x)2 In other words, there is no requirement that a processor broadcast the changed values of variables at all until the synchronization accesses take place. In a distributed system based on a network instead of a bus, this can dramatically reduce the amount of communication needed (notice that nobody would deliberately write a program that behaved like this in practice; you'd never want to read variables that somebody else is updating. The only reads would be after the S. I've mentioned in lecture that there are a few parallel algorithms, such as relaxation algorithms, that don't require normal notions of memory consistency. These algorithms wouldn't work in a weakly consistent system that really deferred all data communications until sync points). Release Consistency Having the single synchronization access type requires that, when a synchronization occurs, we need to globally update memory - our local changes need to be propagated to all the other processors with copies of the shared variable, and we need to obtain their changes. Release consistency considers locks on areas of memory, and propagates only the locked memory as needed. It's defined as follows: 1. Beforeanordinaryaccesstoasharedvariableisperformed,allpreviousacquiresdonebytheprocess musthavecompletedsuccessfully. 2. Beforeareleaseisallowedtobeperformed,allpreviousreadsandwritesdonebytheprocessmust havecompleted. 3. Theacquireandreleaseaccessesmustbesequentiallyconsistent. One Last Point It should be pretty clear that a sync access is a pretty heavyweight operation, since it requires globally syncronizing memory. But where the strength of these memory models comes is that the cost of these sync operations isn't any worse than the cost of every memory access in a sequentially consistent syste

Simple Pipeline Thefirstwaytogoaboutspeedingupauniprocessoristousepipelining.Wedothisbyidentifyingthestages intheexecutionofaninstruction(notethatifwehavesimpleinstructions,thisiseasiertodooneofthe motivationsofRISC).InDLX,wecandivideinstructionexecutioninto 1. InstructionFetch 2. PCIncrement 3. InstructionDecode 4. RegisterFetch 5. SecondRegisterFetch(ifneeded) 6. Arithmetic 7. Memoryread/write(ifneeded) 8. Registerwriteback Now, how can we get these steps to operate as quickly as possible? An early observation is that the instruction set is oriented toward instructions that require two source registers and produce one result. Looks really appropriate to do dual-ported reads, so the Register Fetch step can be done in a single cycle. A second observation is that if we want to fetch two registers, the register numbers will always be in the same place in the instruction. In fact, if we want, we can go ahead and read the registers before we even know what instruction is to be performed, and throw away the data if we don't turn out to need it! This is a point about hardware that tends to be very difficult for computer scientists to grasp: you want to think long and hard about adding extra hardware to the processor. In general, you only want to add it if you're going to use it a LOT. If some feature of your instruction set requires you to add hardware to implement the feature, consider carefully whether the feature is really needed. But once you've added the hardware, using it is free. So it doesn't cost you anything to go ahead and read the registers, whether you need the data or not. So the Insruction Decode and Register Fetch steps can be combined. Our third observation is that we can actually increment the PC while doing other work. We can send the PC contents to the ALU at the same time we are sending them to the instruction memory, and load the new value into the PC at the same time the instruction is coming out of the instruction memory. These observations, taken together, lead to pipelining as a low-cost means of improving performance. By identifying the steps in the instruction execution that use different parts of the processor, we can have multiple instructions in execution at a time, in different parts of the processor. The standard pipeline for a Harvard architecture computer uses five stages: 1. InstructionFetch 2. Instructiondecode/registerread 3. ALUop 4. Memoryop 5. Registerwriteback. The standard pipeline for a Princeton architecture computer only uses four stages; there is no memory op stage (which causes problems we'll get to in just a moment).

Pipeline (computing) InstructionschedulingonIntelPentium4. In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements. Computer-related pipelines include: Instructionpipelines,suchastheclassicRISCpipeline,whichareusedinprocessorstoallow overlappingexecutionofmultipleinstructionswiththesamecircuitry.Thecircuitryisusuallydividedup intostages,includinginstructiondecoding,arithmetic,andregisterfetchingstages,whereineachstage processesoneinstructionatatime. Graphicspipelines,foundinmostgraphicscards,whichconsistofmultiplearithmeticunits,or completeCPUs,thatimplementthevariousstagesofcommonrenderingoperations(perspective projection,windowclipping,colorandlightcalculation,rendering,etc.). Softwarepipelines,wherecommandscanbewrittensothattheoutputofoneoperationis automaticallyusedastheinputtothenext,followingoperation.TheUnixcommandpipeisaclassic exampleofthisconcept;althoughotheroperatingsystemsdosupportpipesaswell. Concept and motivation Pipelining is a natural concept in everyday life, e.g. on an assembly line. Consider the assembly of a car: assume that certain steps in the assembly line are to install the engine, install the hood, and install the wheels (in that order, with arbitrary interstitial steps). A car on the assembly line can have only one of the three steps done at once. After the car has its engine installed, it moves on to having its hood installed, leaving the engine installation facilities available for the next car. The first car then moves on to wheel installation, the second car to hood installation, and a third car begins to have its engine installed. If engine installation takes 20 minutes, hood installation takes 5 minutes, and wheel installation takes 10 minutes, then finishing all three cars when only one car can be assembled at once would take 105 minutes. On the other hand, using the assembly line, the total time to complete all three is 75 minutes. At this point, additional cars will come off the assembly line at 20 minute increments. Costs and drawbacks As the assembly line example shows, pipelining doesn't decrease the time for a single datum to be processed; it only increases the throughput of the system when processing a stream of data. High pipelining leads to increase of latency - the time required for a signal to propagate through a full pipe. A pipelined system typically requires more resources (circuit elements, processing units, computer memory, etc.) than one that executes one batch at a time, because its stages cannot reuse the resources of a previous stage. Moreover, pipelining may increase the time it takes for an instruction to finish. Design considerations One key aspect of pipeline design is balancing pipeline stages. Using the assembly line example, we could have greater time savings if both the engine and wheels took only 15 minutes. Although the system latency would still be 35 minutes, we would be able to output a new car every 15 minutes. In other words, a pipelined process outputs finished items at a rate determined by its slowest part. (Note that if the time taken to add the engine could not be reduced below 20 minutes, it would not make any difference to the stable output rate if all other components increased their production time to 20 minutes.) Another design consideration is the provision of adequate buffering between the pipeline stages especially when the processing times are irregular, or when data items may be created or destroyed along the pipeline. Implementations Buffered, Synchronous pipelines

Conventional microprocessors are synchronous circuits that use buffered, synchronous pipelines. In these pipelines, "pipeline registers" are inserted in-between pipeline stages, and are clocked synchronously. The time between each clock signal is set to be greater than the longest delay between pipeline stages, so that when the registers are clocked, the data that is written to them is the final result of the previous stage. Buffered, Asynchronous pipelines Asynchronous pipelines are used in asynchronous circuits, and have their pipeline registers clocked asynchronously. Generally speaking, they use a request/acknowledge system, wherein each stage can detect when it's "finished". When a stage is finished and the next stage has sent it a "request" signal, the stage sends an "acknowledge" signal to the next stage, and a "request" signal to the previous stage. When a stage receives an "acknowledge" signal, it clocks its input registers, thus reading in the data from the previous stage. The AMULET microprocessor is an example of a microprocessor that uses buffered, asynchronous pipelines. Unbuffered pipelines Unbuffered pipelines, called "wave pipelines", do not have registers in-between pipeline stages. Instead, the delays in the pipeline are "balanced" so that, for each stage, the difference between the first stabilized output data and the last is minimized. Thus, data flows in "waves" through the pipeline, and each wave is kept as short (synchronous) as possible. The maximum rate that data can be fed into a wave pipeline is determined by the maximum difference in delay between the first piece of data coming out of the pipe and the last piece of data, for any given wave. If data is fed in faster than this, it is possible for waves of data to interfere with each other.

Instruction pipeline

BasicfivestagepipelineinaRISCmachine(IF=InstructionFetch,ID=InstructionDecode,EX=Execute, MEM=Memoryaccess,WB=Registerwriteback).Inthefourthclockcycle(thegreencolumn),the earliestinstructionisinMEMstage,andthelatestinstructionhasnotyetenteredthepipeline. An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time). The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the links of a pipe.) The origin of pipelining is thought to be either the ILLIAC II project or the IBM Stretch project though a simple version was used earlier in the Z1 in 1939 and the Z3 in 1941. 1 The IBM Stretch Project proposed the terms, "Fetch, Decode, and Execute" that became common usage. Most modern CPUs are driven by a clock. The CPU consists internally of logic and register (flip flops). When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the classic RISC pipeline is broken into five stages with a set of flip flops between each stage. 1. Instructionfetch 2. Instructiondecodeandregisterfetch 3. Execute 4. Memoryaccess 5. Registerwriteback When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist. A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely cancel out idle time in a CPU but making those modules work in parallel improves program execution significantly. Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is fed to another stage until the job is done. This organization of the processor allows overall processing time to be significantly reduced. A deeper pipeline means that there are more stages in the pipeline, and therefore, fewer logic gates in each stage. This generally means that the processor's frequency can be increased as the cycle time is lowered. This happens because there are fewer components in each stage of the pipeline, so the propagation delay is decreased for the overall stage. 2 Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5 stages. To operate at full performance, this pipeline will need to run 4 subsequent independent

instructions while the first is completing. If 4 instructions that do not depend on the output of the first instruction are not available, the pipeline control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock frequency also scales with the number of stages), in reality, most code does not allow for ideal execution. Advantages and disadvantages Pipelining does not help in all cases. There are several possible disadvantages. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline. Advantages of Pipelining: 1. Thecycletimeoftheprocessorisreduced,thusincreasinginstructionissuerateinmostcases. 2. Somecombinationalcircuitssuchasaddersormultiplierscanbemadefasterbyaddingmore circuitry.Ifpipeliningisusedinstead,itcansavecircuitryvs.amorecomplexcombinationalcircuit. Disadvantages of Pipelining: 1. Anonpipelinedprocessorexecutesonlyasingleinstructionatatime.Thispreventsbranch delays(ineffect,everybranchisdelayed)andproblemswithserialinstructionsbeingexecuted concurrently.Consequentlythedesignissimplerandcheapertomanufacture. 2. Theinstructionlatencyinanonpipelinedprocessorisslightlylowerthaninapipelined equivalent.Thisisbecauseextraflipflopsmustbeaddedtothedatapathofapipelinedprocessor. 3. Anonpipelinedprocessorwillhaveastableinstructionbandwidth.Theperformanceofa pipelinedprocessorismuchhardertopredictandmayvarymorewidelybetweendifferent programs. Examples Generic pipeline

Generic4stagepipeline;thecoloredboxesrepresentinstructionsindependentofeachother To the right is a generic pipeline with four stages: 1. Fetch 2. Decode 3. Execute

4. Writeback (for lw and sw memory is accessed after execute stage) The top gray box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline. Execution is as follows: Time Execution 0 1 2 3 Fourinstructionsareawaitingtobeexecuted

thegreeninstructionisfetchedfrommemory thegreeninstructionisdecoded thepurpleinstructionisfetchedfrommemory thegreeninstructionisexecuted(actualoperationisperformed) thepurpleinstructionisdecoded theblueinstructionisfetched thegreeninstruction'sresultsarewrittenbacktotheregisterfileormemory thepurpleinstructionisexecuted theblueinstructionisdecoded theredinstructionisfetched thegreeninstructioniscompleted thepurpleinstructioniswrittenback theblueinstructionisexecuted theredinstructionisdecoded Thepurpleinstructioniscompleted theblueinstructioniswrittenback theredinstructionisexecuted theblueinstructioniscompleted theredinstructioniswrittenback theredinstructioniscompleted

6 7 8 9

Allinstructionsareexecuted

Bubble

Abubbleincycle3delaysexecution When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and the decoding stage in cycle 3 now contains a bubble. Everything "behind" the purple instruction is delayed as well but everything "ahead" of the purple instruction continues with execution. Clearly, when compared to the execution above, the bubble yields a total execution time of 8 clock ticks instead of 7. Bubbles are like stalls, in which nothing useful will happen for the fetch, decode, execute and writeback. It can be completed with a nop code. Example 1 A typical instruction to add two numbers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of tasks similar to: LOAD R1, A LOAD R2, B ADD R3, R1, R2 STORE C, R3 LOAD next instruction The locations 'R1', 'R2' and 'R3' are registers in the CPU. The values stored in memory locations labeled 'A' and 'B' are loaded (copied) into the R1 and R2 registers, then added, and the result (which is in register R3) is stored in a memory location labeled 'C'. In this example the pipeline is three stages long- load, execute, and store. Each of the steps are called pipeline stages. On a non-pipelined processor, only one stage can be working at a time so the entire instruction has to complete before the next instruction can begin. On a pipelined processor, all of the stages can be working at once on different instructions. So when this instruction is at the execute stage, a second instruction will be at the decode stage and a 3rd instruction will be at the fetch stage. Pipelining doesn't reduce the time it takes to complete an instruction; it increases the number of instructions that can be processed at once and reduces the delay between completed instructions. The more pipeline stages a processor has, the more instructions it can be working on at once and the less of a delay there is between completed instructions. Every microprocessor manufactured today uses at least 2 stages of pipeline. citation needed (The Atmel AVR and the PIC microcontroller each have a 2 stage pipeline.) Intel Pentium 4 processors have 20 stage pipelines. Example 2 To better visualize the concept, we can look at a theoretical 3-stage pipeline: Stage Description Load Readinstructionfrommemory ExecuteExecuteinstruction Store Storeresultinmemoryand/orregisters and a pseudo-code assembly listing to be executed: LOAD A, #40 ; load 40 in A MOVE B, A ; copy A in B ADD B, #20 ; add 20 to B STORE 0x300, B ; store B into memory cell 0x300 This is how it would be executed: Clock1 Load ExecuteStore LOAD The LOAD instruction is fetched from memory.

Clock2 Load ExecuteStore MOVELOAD The LOAD instruction is executed, while the MOVE instruction is fetched from memory. Clock3 LoadExecuteStore ADD MOVE LOAD The LOAD instruction is in the Store stage, where its result (the number 40) will be stored in the register A. In the meantime, the MOVE instruction is being executed. Since it must move the contents of A into B, it must wait for the ending of the LOAD instruction. Clock4 Load Execute Store STOREADD MOVE The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD is calculating. And so on. Note that, sometimes, an instruction will depend on the result of another one (like our MOVE example). When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to hazards (mentioned above). There are several established techniques for either preventing hazards from occurring, or working around them if they do. Complications Many designs include pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4). The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D derivatives) had a 31-stage pipeline, the longest in mainstream consumer computing. The Xelerated X10q Network Processor has a pipeline more than a thousand stages long. The downside of a long pipeline is that when a program branches, the processor cannot know where to fetch the next instruction from and must wait until the branch instruction finishes, leaving the pipeline behind it empty. In the extreme case, the performance of a pipelined processor could theoretically approach that of an un-pipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages. Branch prediction attempts to alleviate this problem by guessing whether the branch will be taken or not and speculatively executing the code path that it predicts will be taken. When its predictions are correct, branch prediction avoids the penalty associated with branching. However, branch prediction itself can end up exacerbating the problem if branches are predicted poorly, as the incorrect code path which has begun execution must be flushed from the pipeline before resuming execution at the correct location. In certain applications, such as supercomputing, programs are specially written to branch rarely and so very long pipelines can speed up computation by reducing cycle time. If branching happens constantly, reordering branches such that the more likely to be needed instructions are placed into the pipeline can significantly reduce the speed losses associated with having to flush failed branches. Programs such as gcov can be used to examine how often particular branches are actually executed using a technique known as coverage analysis; however such analysis is often a last resort for optimization. Self-Modifying Programs: Because of the instruction pipeline, code that the processor loads will not immediately execute. Due to this, updates in the code very near the current location of execution may not take effect because they are already loaded into the Prefetch Input Queue. Instruction caches make this phenomenon even worse. This is only relevant to self-modifying programs. Mathematical pipelines: Mathematical or arithmetic pipelines are different from instructional pipelines, in that when mathematically processing large arrays or vectors, a particular mathematical process, such as a multiply is repeated many thousands of times. In this environment, an instruction need only kick off an event whereby the arithmetic logic unit (which is pipelined) takes over, and begins its series of calculations. Most of these circuits can be found today in math processors and math processing sections of CPUs like the Intel Pentium line.

S-ar putea să vă placă și