Sunteți pe pagina 1din 8

Predicting Cache Needs and Cache Sensitivity for Applications in Cloud Computing on CMP Servers with Configurable Caches

Jacob Machina
School of Computer Science University of Windsor Windsor, Canada machina@uwindsor.ca
AbstractQoS criteria in cloud computing require guarantees about application runtimes, even if CMP servers are shared among multiple parallel or serial applications. Performance of computation-intensive application depends significantly on memory performance and especially cache performance. Recent trends are toward configurable caches that can dynamically partition the cache among cores. Then, proper cache partitioning should consider the applications different cache needs and their sensitivity towards insufficient cache space. We present a simple, yet effective and therefore practically feasible black-box model that describes application performance in dependence on allocated cache size and only needs three descriptive parameters. Learning these parameters can therefore be done with very few sample points. We demonstrate with the SPEC benchmarks that the model adequately describes application behavior and that curve fitting can accomplish very high accuracy, with mean relative error of 2.8% and maximum relative error of 17%. Keywords-cloud computing; configurable caches; multi-core CPUs; CMPs; performance modelling; QoS; SPEC benchmarks

Angela Sodan
School of Computer Science University of Windsor Windsor, Canada acsodan@uwindsor.ca controllers which prevent contention in regards to memory access as long as sufficient memory bandwidth is available. Configurable CPUs are the most promising trend as they support performance isolation, i.e. predictable quality-ofservice, which will become especially important with the trend to exploit multiple cores for running multiple virtual machines. However, different applications have different cache needs (cache working sets) and their performance degrades to a different extent if not allocated sufficient cache space (i.e. they are differently sensitive to insufficient space) [3]. Thus, to be able to make optimal decisions in regards to how to configure the cache among multiple applications, it is important to know how the performance of an application depends on the allocated cache size. For example, with equal partitioning, Application A may run 1.5 times slower and Application B 1.2 times, though both A and B may be slowed down by a factor of 1.3 if allocating 65% of the cache to A and 35% to B. Under QoS criteria, Application A may need at least 70% of the cache to experience a maximum slowdown of 1.2. A proper performance model can support policies for system-wide cache partitioning as described in [4][8][15] under different optimization criteria such as QoS, throughput, aggregate IPC, or inter-process fairness. Compiler tools that would analyze the program and predict runtimes under different cache allocation are extremely complex and not practically feasible. Thus, our goal was to develop a model which is Black-box to be practically applicable Uses few parameters which permit learning with only few samples As the main contribution of this paper, we present a simple model that has a semantic meaning and captures application behavior and CPU architecture, and is black-box with only a few descriptive parameters. The model describes application performance as a weighted composition of the performance of the applications constituent access patterns. The three application-dependent parameters can easily be learned (curve fitting) from very few samples (different cache-share allocations). In a practical setting, such test

I.

INTRODUCTION

Cloud computing with computation-intensive parallel or serial applications and shared servers and QoS guarantees requires that the effects of varying resource contention on the application runtime can be well predicted. With proper prediction, the resource allocation can be correspondingly controlled. Since multi-core processors (CMPs) are becoming the dominant CPU design and application runtime depends to a large extent on cache performance, caches are an important resource for proper allocation among applications. Most current CMPs either share the L2 cache (beneficial if the threads running on the different cores share data) or statically partition the L2 cache (private caches which avoid cache contention as contention can lead to poor and hard-to-predict performance) [13]. Recent trends are toward dynamically partitionable / configurable caches [1][8][15], with the first simplified approach being provided by the Intel Pentium Dual Core [18]. Configurable CPUs typically provide multiple (also partitionable) memory

points can be obtained for all frequently running or longer running programs by changing the cache allocation and collecting the measurements. We demonstrate with synthetic and SPEC benchmarks that the model adequately captures the typical behavior of applications and that the model can accurately predict the overall performance curve with an average relative error of 2.8% and maximum relative error of 17%. Subsequently we can predict At which cache allocation the working set of an application exceeds the available cache How sensitive the application is to insufficient space In the following, we first discuss related work in Section II, and memory and application architecture in Section III. Section IV motivates the model, and Section V describes it. Section VI pertains to curve fitting with the results for the SPEC benchmarks. Section VII finalizes the paper with a summary. II. RELATED WORK Previous research in predicting application performance focused on simulation-based prediction [5][12][16]. The instrumentation overhead imposed to collect data from executing code and the complex hardware modeling required are a disincentive to applying this approach to working machines for scheduling purposes. Though simulation can generate excellent predictions, they do require a large number of computations for each prediction making them suited primarily to system evaluation and code optimization tasks. Simple memory models as in [2] are more efficient than full system models while remaining effective at predicting runtime, but they require program traces and substantial offline processing. Research in the direction of dynamic scheduling for multithreaded processors is presented in [6], with their model relying only on hardware counters. This model was generated without benefit of any application and architectural (domain) knowledge, but rather as a weighted sum of all possible cross terms of the hardware counters. As it describes only the system used to generate it and the applications tested, it is not suited to making decisions under varying memory hierarchies and different application mixes. A common finding among recent research is the dominance of memory-access time for the performance of sequential code and its subsequent relevance in performance prediction [2][5][6][7][16]. Analysis of hardware counters in [6] found that contention for the cache to have four times the impact as contention on execution units in SMT processors. Though bandwidth between the L2 cache and main memory was shown to be a potential bottleneck when executing synthetic benchmarks, the only real application which showed significant bus usage was the ART benchmark from SPEC2000 [8]. The majority of research on memory contention considered contention for cache space to be the main concern.

III.

MEMORY AND APPLICATION ARCHITECTURE

In modern processor designs, the bottleneck for the majority of computation is not the availability of execution units but rather the memory hierarchys ability to supply the pipelines and resolve data dependencies. For computational codei.e. code with relatively little disk or network accessthe sum of time spent accessing the memory hierarchy is very close to the total running time due to the high percentage of memory accesses in typical programs and the increasing gap between processor and memory speeds. Of the many levels of memory in a modern computer, the L1 caches are generally dedicated per core and thus cannot be a source of inter-core contention, and main memory is cheap and plentiful enough that contention over main memory space is low in most cases. Though current multicore processors have either a shared L2 cache or a statically partitioned cache, recent research proposes dynamically partitionable L2 caches as the most promising direction [1]. Dynamically partitionable caches permit changing the size of cache available to each core of the processor on the fly. This can be accomplished by reserving a fraction of each associative set in the cache per core, giving n+1 partitions for an n-way associative cache. Processors featuring a dynamically partitionable cache would have the benefits of both shared and dedicated caches, including support for performance isolation. A. Application Memory-Access Patterns The time spent accessing the memory for an application executing on a given machine is dependent on both the size and the latency of each level of the memory hierarchy, as well as what proportion of the memory accesses fall to each level of the hierarchy [9]. The latency of accessing each level of the cache depends on the nature of the access being performed[2]. The research showed it was sufficient to classify memory accesses along two dimensions: either read or write, and either sequential, strided, or random access. Sequential accesses are those to successive elements within a cache line, strided accesses are to elements where successive accesses have a common memory offset, and random accesses have no predictable pattern. Sequential accesses can benefit from both spatial locality and temporal locality in the cache, with temporal locality becoming relatively more important for strided accesses because fewer accesses fall onto the same cache line. Random accesses benefit principally from temporal locality. However, due to the regular nature of the accesses, both sequential and strided accesses can be prefetched on most current processors, to reduce the latency of repeated cache misses. Temporal locality means that applications tend to confine memory access to certain subsets of the overall data in certain time periods. This subset of the data is called working set [11]. Cache misses can be kept low if the current working set can be kept in the cache, as investigated in [7] for the SPEC benchmarks. If the working set does not entirely fit into the cache, the benefit of temporal locality for sequential or strided access depends on the fraction of the working set fitting into the cache, as accesses to uncached elements will overwrite cached elements.

IV.

MOTIVATION FOR THE MODEL

As mentioned in the introduction, we aim to create a simple black-box model with relatively few parameters which can be used to make predictions regarding application runtime with respect to varying cache allocations. As the model is meant to aid in scheduler decisions, it must be inexpensive to obtain supportive data and fit the model and to calculate the expected runtime of an application on a machine once the model has been created. This includes that very few sample points with cache allocations should be necessary, and that we want to rely entirely on wall-clock time measurements rather than code instrumentation. To make such a model possible, it needs to have semantic meaning, i.e. capture the most essential architecture and application features. For our model and the presented experiments, we use slowdown as a normalized measure of performance, defined as
Sl = Tt ( Sc ) / Tt ( S f )

(1)

where Tt(Sf) is the runtime with full cache allocation Sf, and Tt(Sc) is the runtime with partial cache allocation Sc. All the predictions made by the model will be slowdown predictions, which when given Tt(Sf) can be converted to runtime predictions. A. Experimental Setup To motivate and verify our model, we will use the widely applied SPEC CPU2000 floating point and integer

benchmark suite [14]. The suite comprises a collection of kernel benchmarks which approximate the typical workload of a server. The data size used for all of our tests is the larger, reference data set. The individual benchmarks in the SPEC suite are varied in their temporal locality, spatial locality, and data intensiveness, as classified in [7]. Our primary data set is drawn from experiments on a PC featuring an Intel Core2 Duo CPU which has a 4MB, 16-way associative cache, with 3GB of main memory, running Ubuntu 8.04LTS (Linux). Some initial testing was conducted on an Intel Core2 Duo CPU which has a 2 MB, 8-way shared L2 cache, 2 GB of main memory, running Ubuntu 7.10 (Linux). Given that the cache of the Intel Core2 Duo is not partitionable but rather a shared cache, we employed the complement testing approach developed in our earlier work [19]. The approach uses a synthetic test program which occupies a specifiable amount of the cache by accessing an appropriate volume of data in a stride-1 pattern, refreshing it's ownership on each cacheline frequently. Considering that the Intel Core2 Duo uses LRU as cache replacement policy, this makes it likely that the tools data remains in cache. The test tool therefore restricts the amount of cache available for the application under test. This tool was started before the application under test and exceeded the applications runtime, and the priority of both processes was also increased to be at the highest level to reduce the impact of potential background load.

Figure 1. Slowdown for SPEC benchmarks under varying cache allocation. Labels appear in the same order as the corresponding lines at right. Dotted lines have no semantic meaning and are used to enhance legibility.

LUCAS, FACEREC, SWIM, AMMP, PARSER, VORTEX, WUPWISE, GAP, FMA3D MGRID, and TWOLF, as high variance. We classify GCC, APSI, PERLBMK, CRAFTY, MESA, GZIP, EON, and SIXTRACK as low-variance. V. MODEL DERIVATION To validate our hypothesis that sequential and random accesses mainly contributed to the observed slowdown curves, we created synthetic benchmarks. The first synthetic benchmark accesses data sequentially with a given stride, reading all elements a given number of times. The second synthetic benchmark performs pointer chasing on a pseudorandom linked list, with the linked list accessed repeatedly. Runtimes were measured at varying cache allocations on the machine with an 8-way associative, 2MB cache (resulting in 9 data points). The results in Figure 2a) and b) show an approximately linear curve for random access and a curve with two flat sections and one transition section for stride access. We hypothesize that the shape of the slowdown curves of the SPEC benchmarks are due to a summation of the performance degradation of their sequential and random access portions. Thus, we created a third synthetic benchmark as the composition of the former two benchmarks. The results are shown in Figure 2c). Indeed, the created curves closely resemble the shape the higher variance SPEC benchmark curves. Therefore, we will define a corresponding model in the following section. A. The Composite Cache Model To model the behavior observed with our synthetic benchmarks, we assume that the time to randomly access a number of memory locations is inversely proportional to the fraction of the set which resides in cache. If the randomly accessed working set fits entirely into the cache, we calculate the access time as the number of elements accessed multiplied by the L2 access time. If only part of the set fits into the cache, that fraction is accessed at the L2 rate, while the rest is accessed at the higher main-memory rate. We simplify the model by ignoring the fill phase of the cache when no elements would be cached. Our model predicts the time for random memory access as follows (for the variable meaning, see TABLE I):
(S r - S c ) * N mr + S c * N l2 Tr ( S c ) = S c * N l2 if S r > S c otherwise

Figure 2. Two example slowdown curves for each synthetic benchmark under varying parameters: a) top Random access pattern. b) middle Sequential access pattern. c) bottom Mixed access pattern.

B. Cache-Related Performance of SPEC Benchmarks We have run the SPEC benchmarks with varying cache allocations to obtain insight into the benchmarks cacherelated performance. The results, calculated as slowdowns vs. full-cache allocation, are shown in Figure 1. Given the 16-way associativity of the cache on our primary test machine, we collected 17 data points for each benchmark in the set, from full cache access to no cache access. We can easily identify two classes of slowdown curves: the first class are those curves which exhibit Sl 1.4 with no cache allocation (termed high-variance curves), and the second class are those curves with Sl < 1.4 for no cache allocation (termed low-variance curves). From the individual curves in Figure 4 and Figure 5, we observe a qualitative difference between the two classes. This means, we classify MCF, ART, EQUAKE, VPR, GALGEL, BZIP2, APPLU,

(2)

As mentioned in Section III.B, sequential and strided accesses can benefit from spatial locality if the stride size and the data element sizes are relatively small, but principally benefit from temporal locality. We do not model the spatial locality as adding another parameter to describe the benefits derived from accessing elements within the same cache line would increase the complexity of the model significantly. More importantly, the curves and predictions which can be generated by such a model are identical to the curves and predictions which can be generated by the model without it as discussed later.

TABLE I. VARIABLE DEFINITIONS Tr(Sc) Ts(Sc) Sr Ss Sc Nmr Nms Nl2 A Fr Time for random access, given Sc Time for sequential access, given Sc Size of random access working set Size of sequential access working set Size of current cache allocation Access time (cycles) for an L2 miss on a random element (85 cycles) Access time (cycles) for an L2 miss on a sequential element (31 cycles) Access time (cycles) for an L2 hit (14 cycles) Cache Size / Cache Associativity Fraction of memory accesses to random block

VI.

PARAMETER-LEARNING FOR THE CACHE MODEL

Given that there are three parameters to be fitted, normally four sample points are needed to perform curve fitting. Informing the fitting method with more sample points is likely to increase accuracy. However, in some cases we can fit the model with fewer points: if the observed range spans most of the cache and has very little difference between the sample points, the results of four-point fitting can be approximated by using only two sample point and performing linear interpolation. A. Machine Learning In the machine-learning phase, the application-specific parameters are fitted to the data. For our model, there are three parameters which need to be learned. We used the GAlib [17] genetic algorithm library to find parameter sets for which the sum of squared errors was minimized. The parameter Fr was constrained to be within the range (0,1) while Ss and Sr were constrained to be positive. B. Automated Sample-Point Selection If using the model to make slowdown predictions on a working system, we will not be able to hand-select which sample-points to test for any given application. This is both because we will not know the shape of the slowdown curve until after we collect the sample points and because an automatic approach is required for any large scale system. However, we assume that we can select the cache allocation for a few runs of the application. We need an algorithm to intelligently select which sample points to test, while minimizing the uncertainty of the fitted model.
if (Sample_Points.Length == 0) return Max_Cache if (Sample_Points.Length == 1) if (Sample_Points[0].Cache_Alloc > (Min_Cache + Max_Cache)/2) return Min_Cache else return Max_Cache else max_slope = -1 for(i=0..Sample_Points.Length-1) s=(Sample_Points[i+1].Slowdown Sample_Points[i].Slowdown)/ (Sample_Points[i+1].Cache_Alloc Sample_Points[i].Cache_Alloc) d= Sample_Points[i+1].Cache_Alloc Sample_Points[i].Cache_Alloc if (s>max_slope && d>2*cache_gran) max_slope = s best_target = (Sample_Points[i+1].Cache_Alloc + Sample_Points[i].Cache_Alloc) / 2 return best_target Figure 3. Algorithm for selecting the next sample point. Sample_Points is a sorted list of already chosen sample points, Min_ and Max_Cache describe limits to partition sizes, and cache_gran is the granularity of cache partitioning.

To calculate the time for sequential access, for small working sets which fit entirely into the cache, the whole block is accessed at L2 rates, and for large working sets which force replacement of their own cache lines, the entire block is accessed at the streaming main-memory rate. For working sets which only partially replace themselves, we calculate the fraction remaining in the cache between sequential accesses. The portion remaining in the cache is accessed at L2 rates with the remainder accessed at streaming main-memory rates. Thus, our model predicts the time for sequential and strided memory access as follows:
S s * N ms S * (S - S ) / A * N + s s c ms Ts (S c ) = S s * (1 - (S s - S c ) / A) * N l 2 S s * N l 2 if S s > S c + A if S s > S c otherwise

(3)

We create the composite model for the application behavior as a whole by taking a weighted sum of Functions (2) and (3) and introducing a third parameter describing the relative frequency of accesses to the random access block. In this way, we obtain:
Tt ( S c ) = Fr * Tr (S c ) + (1 - Fr ) * Ts (S c )

(4)

In (2) to (4), the parameters Nl2, Nms, Nmr, and A are machine dependent and can be measured by any memory benchmarking tools. We used Rightmark[10] to find average access times for each level of the memory hierarchy, with values for our test machine indicated in TABLE I. The parameter Sc is machine dependent and configurable, representing the size of the cache partition accessible by a core. The parameters Sr, Ss and Fr need to be learned by supplying sample points as input to the model and using a model fitting / machine learning approach. To simulate the reduced slowdown due to spatial locality on sequential and strided accesses, we may adjust the parameter Fr. For lower values of Fr, we see less slowdown after the transition phase. This will also require a change in the parameter Sr, thus it will no longer act as a reasonable predictor for the size of the randomly accessed working set. This is an acceptable simplification, as we aim to produce slowdown estimations and not an analysis of the working set sizes of applications, and supports our goal of reducing the number of parameters to be learned.

The idea of our algorithm is to first select two points at or near opposite ends of the spectrum to determine the class (low variance or high variance) of the application. With

some similarity to binary search, each successive point is chosen as the midpoint of the steepest unknown section of the curve. However, we will not explore sections of the curve with only one unknown point. Justification for this selection process is that the transition phase generally lies on or between the two points which have the steepest slope between them. The detailed algorithm is shown in Figure 3. C. Fitting Results Using this algorithm described in Section VI.B and the fitting method described in Section VI.A, all of the SPEC benchmarks were modeled; with high-variance curves generated using four, five, and six sample points, and lowvariance curves using four and two sample points. See Figure 4 and Figure 5 for measured slowdown curves for the SPEC benchmarks as well as our model predictions. Results for fitting using five sample points are omitted from the figures as they are similar to the results of using four sample points. Using four sample points, the model predictions for the high-variance curves compared to the measured data show a mean relative error of 4.0%, and a maximum relative error of 28%. Five sample points resulted in a mean relative error of 3.7% and a maximum relative error of 28%. Accuracy more significantly increased if moving from five to six sample points, resulting in mean error of 3.4% and a maximum error of only 17%. Using four points, the largest relative errors appear in fitting ART, EQUAKE and MCF, principally in predicting slowdown in the section of cache from 4096KB to

2048KB. This error was minimized with an increase in the data provided to the model. Using two sample points and linear interpolation, the predictions for the low-variance curves compared to the measured data show a mean relative error of 2.2%, and a maximum relative error of 10%. Using four sample points resulted in a mean relative error of 1.4% and a maximum relative error of 8.1%. Thus, the prediction results obtained are very accurate. Predictions using four points for high-variance curves and two sample points for low-variance curves, though predicting a little less well, are generally accurate enough for making scheduling decisions. The entire suite modeled in this way shows 3.4% mean and 28% maximum error. Using six and four sample points results in 2.8% mean and 17% maximum error. We also ran our benchmark suite on the second test machine with an 8-way associative, 2MB cache to collect slowdown for the SPEC benchmarks. This data was used to inform the fitting of models using four sample points, resulting in an average error of 2.1% and a maximum error of 24%, though that error was exceptional. The maximum error exclusive of that point was 14%. Predicting slowdown for this sample set is an easier problem as there are relatively fewer unknown data points. Thus, detailed results were only shown for the more difficult case of the 16-way associative data set.

Figure 4. Measured and modeled slowdown curves for selected SPEC benchmarks. Actual data shown as black line, model using fewer sample points shown in red (dark grey) with sample points marked as boxes, model using more sample points shown in blue (light grey) with sample points marked as dots.

Figure 5. Measured and modeled slowdown curves for remaining SPEC benchmarks, excluding CRAFTY, EON, GZIP, and SIXTRACK as they closely resemble other fits in the low-variance class. Measured data shown as black line, model using fewer sample points shown in red (dark grey) with sample points marked as boxes, model using more sample point shown in blue (light grey) with sample points marked as dots. Bottom row of graphics are lowvariance curves with model generated using fewer sample points.

To verify the value of calculating the total time as a function of both the sequential and random slowdown, we simplified the model to incorporate only the sequential slowdown and a scaling factor. Fitting this model to the SPEC benchmark suite using four points resulted in a mean relative error of 8.9% and maximum relative error of 65% (composite model: 3.4% and 28%). Using six sample points resulted in a mean relative error of 7.7% and a maximum relative error of 58% (composite model: 2.8% and 17%). Thus, our presented composite model provides greatly increased accuracy over the simpler model.

We also conducted tests comparing our model to linear piecewise interpolation, where simple linear interpolation was performed between adjacent sample points. Though the simpler prediction scheme had slightly better average error (2.3% for four points, and 1.7% for six points) it had greater maximum errors (31% for four points and 22% for six points). To test the extrapolative power of our model and the simpler scheme, we removed the three highest and lowest sample points from the data set and re-fit using five sample points. In this test, the mean errors were similar, though the maximum errors were 41% for our model and 59% for the

simpler scheme. Thus, the simple prediction scheme failed to extrapolate well, with large errors in prediction beyond the known region. VII. CONCLUSION We have presented a simple cache model with a semantic meaning based on typical application access patterns. The model uses three parameters that need to be learned from sample points, i.e. runtimes with different cache allocations. Our composite cache model has strong semantic meaning tied to its parameters and derivation while providing three times the accuracy of a simpler model which attempts to model only a slowdown associated with the sequential accesses. We developed an algorithm which automatically selects suitable sample points and fitted the model using a genetic algorithms learning approach. In most cases, four to six sample points provided excellent accuracy (2.8% mean relative error, 17% maximum relative error) in matching the actual curves with our predictions when applying the approach to the SPEC benchmarks. These results compare very well to the 7% average and 15% maximum errors in prediction using the white-box model in [12] which simulates the target architecture, and the 9.3% average error in prediction using the trace-driven memory model in [16]. With a much simpler black-box model, we were able to obtain better average accuracy than the more complex whitebox models. ACKNOWLEDGEMENTS This research was partially supported by an NSERC discovery grant. Thanks to our department for access to the 4MB cache machine, and to Arash Deshmeh for his mentorship on machine learning. REFERENCES
[1] N. Aggarwal, P. Ranganathan, N.P. Jouppi, and J.E. Smith, Isolation in Commodity Multicore Processors, IEEE Computer, June 2007, pp. 49-59. I. Chihaia, and T. Gross, Effectiveness of Simple Memory Models for Performance Prediction, Proc. IEEE Internat. Symp. on Performance Analysis of Systems and Software (ISPASS), March 2004, Austin, USA, pp. 98-105. A. El-Moursy, R. Garg, D. Albonesi, and S. Dwarkadas, Compatible Phase-Coscheduling on a CMP of Multi-Threaded Processors, Proc. IPDPS, April 2006. L.R. Hsu, S.K. Reinhardt, R. Iyer, and S. Makineni, Communist, Utilitarian, and Capitalist Cache Policies on Cmps: Caches as a Shared Resource, Proc. PACT '06. ACM, September 2006 M. Laurenzano, B. Simon, A. Snavely, and M. Gunn, Low Cost Trace-Driven Memory Simulation Using SimPoint, SIGARCH Computer Architecture News 33(5), Dec. 2005, pp. 81-86. T. Moseley, A. Shye, V.J. Reddi, M. Iyer, D. Fay, D. Hodgdon, J.L. Kihm, A. Settle, D. Grunwald, and D.A. Connors, Dynamic RunTime Architecture Techniques for Enabling Continuous Optimization, Proc. 2nd Conf. on Computing Frontiers (CF), ACM, Ischia, Italy, May 2005. R. Murphy, A. Rodrigues, P. Kogge, and K. Underwood, The Implications of Working Set Analysis on Supercomputing Memory Hierarchy Design, Proc. 19th Ann. Internat. Conf. on Supercomputing (ICS), Cambridge, MA, June 2005.

[8] [9] [10] [11]

[12]

[13]

[14] [15]

[16]

[17] [18]

[19]

K.J. Nesbit, J. Laudon, and J.E. Smith, Virtual Private Caches, SIGARCH Comput. Archit. News 35(2), June 2007, pp. 57-68. Patterson, D.A., and Hennessy, J.L., Computer Organization & Design, San Francisco: Morgan Kaufman, 2nd edition, 1998. RightMark Memory Analyzer, www.rightmark.org, retrieved September 2008 E. Rothberg, J.P. Singh, and A. Gupta, Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors, SIGARCH Computer. Architecture News 21(2), May 1993, pp. 1426. A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha, A Framework for Performance Modeling and Prediction. Proc. 2002 ACM/IEEE Conf. on Supercomputing, IEEE Computer Society Press, Los Alamitos, CA, 1-17. A.C. Sodan, A. Deshmeh, B. Esbaugh, and J. Machina, ThreadLevel Parallelism in Modern CPUs, submitted to journal, also available as Technical Report 08-022, University of Windsor, Computer Science, June 2008. SPEC, www.spec.org, retrieved September 2008 G.E. Suh, S. Devadas, and L. Rudolph, Analytical Cache Models With Applications to Cache Partitioning, Proceedings of the 15th international Conference on Supercomputing, ICS '01. ACM, New York, NY, pp. 1-12. M.M. Tikir, L. Carrington, E. Strohmaier, and A. Snavely, A Genetic Algorithms Approach to Modeling the Performance of Memory-Bound Computations, Proc. ACM/IEEE Conf. on Supercomputing (SC), Reno, NV, Nov. 2007. M. Wall, GAlib Genetic Algorithm Library, lancet.mit.edu/ga, retrieved November 2008, version 2.4.7. O. Wechsler, Inside Intel Core Micorarchitecture: Setting New Standards for Energy-Efficient Performance, Technology@Intel Magazine, March 2006. X. Zeng, J. Shi, X. Cao, and A.C. Sodan, Grid Scheduling With ATOP-Grid Under Time Sharing, Proc. CoreGrid Workshop on Grid Middleware, Dresden, June 2007, Springer, pp. 3-18.

[2]

[3]

[4]

[5]

[6]

[7]

S-ar putea să vă placă și