Sunteți pe pagina 1din 34

A good reference book is Realtime Collision Detection by Christer Ericson, Morgan Kaufmann, 2005 See chaper 13: Optimization

Both 360 and Ps3 documentation provide estensive description of the inner workings of their CPUs and optimization guidelines For intel platforms and for optimization of C++ constructs a very good reference is in the Agner Fogs manuals: http://www.agner.org/optimize/

We dont have even all the full processing power available for 33 milliseconds, the console OS will reclaim some time to time (i.e. to handle a background download) so we have to leave some slack space... "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.). Note that Knuth attributed the premature optimization statement to Hoare (http://en.wikiquote.org/wiki/C._A._R._Hoare). See Mature Optimization by Mick West, Game Developer Magazine, January 2006 http://cowboyprogramming.com/2007/01/04/mature-optimization-2/

Do not prematurely pessimize: write code avoiding common performance pitfalls...

Aaaaaaaargh!

Code and screenshot from lattice 256 bytes intro...

Runtime dependencies: communication, ownership I.e. game does push data to rendering, rendering does not talk to the game Compile-time dependencies: types, libraries I.e. rendering data is made of game-defined types, statically depends on the game More general systems = more complex, more code = harder to change We should move towards hot-swappable services: http://en.wikipedia.org/wiki/Service-oriented_architecture Even better: live-coding

http://en.wikipedia.org/wiki/Big_O_notation Some algorithms may be better for large inputs but worse for the input sizes we are using in practice! It happens all the time...

I.E. Vector versus Map or Hashtable Insertion-sort versus Merge- or Quick-sort Also, cache-efficiency is a very common issue that make log(n) algorithms slower than linear ones on small input sizes.
http://en.wikipedia.org/wiki/Skip_list http://en.wikipedia.org/wiki/Bloom_filter http://en.wikipedia.org/wiki/Radix_tree http://en.wikipedia.org/wiki/Trie http://en.wikipedia.org/wiki/Hash_array_mapped_trie http://en.wikipedia.org/wiki/R-tree http://en.wikipedia.org/wiki/Disjoint-set_data_structure http://en.wikipedia.org/wiki/Kd-tree http://en.wikipedia.org/wiki/Heap_(data_structure) http://en.wikipedia.org/wiki/Treap http://en.wikipedia.org/wiki/Finger_trees http://en.wikipedia.org/wiki/Dancing_Links

Multiple hardware threads per core: CPUs like to have multiple independent instruction paths so if they are stalled on an instruction in one path, they can use the other one to keep them busy... Stalls are usually caused by memory accesses. All cores of the 360 and the PS3 PPU have two hardware threads (so we have six hardware threads to use on 360 and two PPU threads plus six SPU ones on the Ps3)

See Coding For Multiple Cores on Xbox 360 and Microsoft Windows on your xbox 360 sdk documentation! Design subsystems being aware of thread safety. Minimize shared _mutable_ data. Const-correctness helps. http://en.wikipedia.org/wiki/Thread-safety Some further reads: Stream processing (Stream Processing in General-Purpose Processors: http://www.cs.utexas.edu/users/skeckler/wild04/Paper14.pdf) Futures in AliceML: http://www.ps.uni-saarland.de/alice/manual/futures.html, Erlang: http://en.wikipedia.org/wiki/Erlang_(programming_language) How the GPU works: http://c0de517e.blogspot.com/2008/04/gpu-part-1.html Fibers http://en.wikipedia.org/wiki/Fiber_(computer_science) Map/Reduce (http://en.wikipedia.org/wiki/MapReduce) (http://en.wikipedia.org/wiki/Map_(higherorder_function) .Net Parallel FX PLINQ implementation (http://en.wikipedia.org/wiki/Task_Parallel_Library) OpenMP (http://en.wikipedia.org/wiki/OpenMP) GPU is another unit that executes in parallel and depends on the Render thread. Usually the render thread prepares data for the next frame, while the GPU is executing the previous one, much like the simulation thread prepares data for the render thread and pushes it in a buffer. Most of our system libraries are not thread safe, thread safety should be ensured when using them, in our highlevel implementation classes. This is done to maximize performances and not end up with a locks everywere (as the syncronized Java containers do for example, with the catch that some JIT virtual machines can automatically avoid locks if needed) For very simple data structures its possible to write thread safe versions without locks (http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms). But lock-free programming is a nightmare, avoid it. Purely functional (persistent) data structures (http://books.google.ca/books?id=SxPzSTcTalAC) can be of some usefulness too.

Slides from: bps10.idav.ucdavis.edu/talks/04lefohn_ParallelProgrammingGraphics_BPS_SIGGRAPH2010.pdf

10

http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-fourteen The king of our modern CPU problems, CPUs and GPUs are becoming faster at a higher pace than memory is! This is also a limit to multithreading performance, as we have to fetch data to be processed! L1 cache hits (accesses to data that is in the L1 cache) costs are hidden by pipeline latency (by the execution of other instructions between the loading of the memory into a register and the actual use of that register). L1 cache misses if L2 hits cost more than 40 cycles, but they can be partically hidden by the execution of the instructions of the other hardware thread in the same core... Beware of cache behaviour when multithreading: nearby data read and written by two different cores leads to bad performance (false sharing the L1 caches of the cores will be invalidated each time the other core writes data) while if the same happens on two hardware threads of the same core executing the same code, then the L1 data and code cache will be used optimally (see Caches and Multithreading on your xbox 360 sdk documentation) Refer to the xbox sdk paper: Xbox 360 CPU Caches Some numbers: on 360 the cache-lines (minimum data that will be transfered to the cache due to a cache-miss) are 128-byte wide, L2 is 1MB, L1 is 32+32kb. L1 cache is write-through (all stores will always, after the store-gathering buffer, go to update the L2 cache) and non-allocating (stores do not fill L1 cache lines if the address is not already there). Stores and loads go to queues, stores are further organized in store-gathering buffers to reorder scattered stores into linear ones to go into the caches. Theres no predictive prefercher logic on 360 and just basic predictive prefetching on ps3 (way different from x86 world, in general ps3 and 360 cpus have a lot of raw power and less logic, theyre made for experienced programmers not to improve random code by clever rescheduling and out-of-order execution, and this seems to be the direction of the future anyway) Another important concept is cache SETS. On the 360 xenon CPU the L2 cache is 8-way associative, it means that the 1mb/128bytes = 8192 cache lines are organized into 8192/8 = 1024 sets. Caching of a memory address goes into a given set using the formula set = (memory_address / line_size) % number_of_sets. So in our case the set numebr is (memory_address / 128) % 1024, that means that two addresses that are number_of_sets * line_size = 128kb apart fall (critical stride) into the same set. There is space only for 8 cache lines in each set, so if we have a loop where we read consecutively from 9 addresses each one 128kb apart, we will have a cache miss each iteration of the loop, even if the cache is not full! That issue is more serious on the L1 cache, that is 32k for data, thus has 256 lines arranged in 4-ways, leading to 64 sets, the critical stride there is only 8k! That can cause sometimes problems, i.e. in a bidimensional array with rows of 8k size, accessing 5 elements of a column causes a cache miss always.

11

12

Cache-agnostic linearization of trees can be performed via the van Emde Boas layout (see http://en.wikipedia.org/wiki/Van_Emde_Boas_tree)

13

Within-vector operations are not common on SWAR (SIMD withing a register) architectures as commonly found on CPUs. They are possible on the GPU SIMD processors. To be efficiently loaded and stored SIMD data should be 16-byte aligned (16 bytes = 4 floats).

14

15

16

Moores law: it was originally about transistor count, and processors roughly managed to respect it. But CPUs are also respecting it in performance, that is odd, as the performance should increase due to the transistor count AND CPU frequency (faster!). GPUs are following Moores in transistor count but beating it (as it should be expected) when it comes to performance, but only on heavily data-parallel tasks where all the code runs in parallel (Amdahls law is the limiting factor there) Whats on the die (PC processors...) 8086...386 --Mostly processing power: Logic units. 486...Pentium2 --Processing power and caches: A bit of cache, FPUs. Multiple pipelines. Pentium3...Pentium4 --Caches and scheduling logic: Heavy instruction decode/reorder units, branch predition, cache prediction. Longer pipelines. Core2...i7 --Multicore + Big caches Future --Back to pure processing power, ALUs on most of the die (and cache) Manycore, small decode stages (in-order, shared between units) and caches (shared between units), wide hardware and logical SIMD, lower power/flops ratio (GPUs, Cell...) Manycore (GPU) integrated with multicore (CPU), sharing a cache level or direct bus interconnection (single die or fast paths between units: Xenon/Xenos, Ps3 PPU/SPU...) Past: http://www.tayloredge.com/museum/processor/processorhistory.html http://www.cpu-world.com/CPUs/index.html http://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html http://www.thg.ru/cpu/19990809/onepage.html http://www.cs.clemson.edu/~mark/330/p6.html Future: www.gpgpu.org/static/s2007/slides/02-gpu-architecture-overview-s07.pdf s09.idav.ucdavis.edu/talks/02_kayvonf_gpuArchTalk09.pdf bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf http://www.research.ibm.com/cell/ http://en.wikipedia.org/wiki/Cell_(microprocessor) http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-tested http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx

17

18

19

The prediction is easy, again, we already do that on GPUs, even if shader languages are very constrained in terms of communication. CUDA and OpenCL are more general but on the other hand expose too much of the underlying hardware. We still have to improve our tools, but it will happen.

20

How much latency? On the 360 GPU, from the start of a shader (task) to the end (write into the framebuffer) there are roughly 1000 gpu cycles of latency

21

Just a few examples! There are many fast sequential sorts (i.e. Radix and the other distribution sorts), many are even faster if the sequence to sort has certain properties (i.e. Uniform: Flash, Almost sorted: Smooth) or if we some given behaviour are desiderable (i.e. Cache efficient: Funnel, Few writes: Cycle, Extract LIS: Patience, Online: Library) and most of them can be parallelized (not only the MergeSort). Also hybrids are often useful (i.e. Radix sort and parallel merge). www.cse.ohio-state.edu/~kerwin/MPIGpu.pdf theory.csail.mit.edu/classes/6.895/fall03/projects/final/youn.ppt http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htm http://elliottback.com/wp/sorting-in-linear-time/ http://en.wikipedia.org/wiki/Sorting_algorithm http://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.c om/&um=1&ie=UTF8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=slrelated&resnum=1&ved=0CCQQzwIwAA

22

The language examples are only a sample of what we can or could use for games. OpenCL, Intel SPMD are data-parallel programming languages (stream oriented? Not really yet) oCaML, Haskell, C# support functional programming (lambdas, closures). They also support data-parallel tasks (data parallel haskell, parallel fx) and coroutines (c# only in the Mono runtime)

Go, Lua and Stackless Python are examples of languages implementing coroutines/continuations (fibers, cooperative threading)
http://c0de517e.blogspot.com/2011/04/2011-current-and-future-programming.html

23

24

Caching is another technique that can be useful to improve data locality. If the access to a big data array is random, but coherent in time, we can copy the last n-accessed items in a buffer that holds them near each other in memory. Then next time we can check the buffer first, if it still contains the data we need we avoid performing random accesses and their cache misses.

25

26

This is applicable to n-dimensional arrays. The NxNx... blocks arrrangment is cacheaware (has to be tuned for a specific cache size), the spacefilling curve approach is cache-oblivious (works optimally, withing a constant factor, for any cache size) In the end, most of the time, good design for performances is equal to good design, as the main thing we require to be able to tune the code is ease of changing it. Thats way different from bad, premature optimization, that usually locks the code in a given shape. The main difference between generic design best practices and performance best practices is that being aware of some hardware details early on, its possible to nail a more optimal design from the start. See: http://www.multi.fi/~mbc/sources/fatmap.txt http://en.wikipedia.org/wiki/Space-filling_curve http://my.safaribooksonline.com/0201914654/ch14lev1sec6

27

28

http://macton.smugmug.com/gallery/8936708_T6zQX/1/593426709_ZX4pZ#5934267 09_ZX4pZ

29

http://macton.smugmug.com/gallery/8936708_T6zQX/1/593426709_ZX4pZ#5934267 09_ZX4pZ

30

31

The 80/20 rule works only if the code was coded in a proper way... If you write code without any awareness of its performances you wont find any significant hotspot to optimize, everything will be bad, especially in huge projects like ours! Trivial functions should be inlined otherwise compiler cant perform a huge number of optimizations as it cant know the implementation of a given function until link-time (that can be avoided with bulk-builds, or enabling linktime or whole-program optimizations). Forcing to inline complex functions can lead to increased code size and thus decreased code cache efficiency. It should be done only in inner loops, probably unrolling them too, but only when tuning, and using a profiler to find the right inner loops to optimize! The templates versus virtual function calls issue is nasty (dynamic versus static branching), its a design issue thats hard to make early on, without profiling. Using sized-integers can be useful to save space and thus optimize cache, but this is something that can be done later after profiling without a big impact on the code design (if the proper getters and setters were created). The only thing that is worth to do early-on is the use of bitfields to store multiple boolean values, as the standard bool type takes quite some space. Usually static and global variables are slower to access than class members (that COULD live on the heap) that are slower than local variables (that live on the stack, stack data is most probably in our chaces) 32

See, on the xbox sdk documentation the paper: Xbox 360 CPU: Best Practices

33

Design your DATA first!

34

S-ar putea să vă placă și