TransGaming SwiftShader Whitepaper-20130129

TRANSGAMING INC.
WHITE PAPER: SWIFTSHADER TECHNOLOGY JANUARY 29, 2013
SwiftShader:
Why the Future of 3D Graphics is in Software
For some time now, it has been clear that there is strong momentum for convergence between CPU and GPU technologies. Initially, each technology supported radically different kinds of processing, but over time GPUs have evolved to support more general purpose use while CPUs have evolved to include advanced vector processing and multiple execution cores. In 2013, even more key GPU features will make their way into mainstream CPUs. At TransGaming, we believe that this convergence will continue to the point where typical systems have only one type of processing unit, with large numbers of cores and very wide vector execution units available for high performance parallel execution. In this kind of environment, all graphics processing will ultimately take place in software. Through our SwiftShader software GPU toolkit, TransGaming has long been a world leader in software based rendering, with widespread adoption of our technology by some of the
Copyright 2013 TransGaming Inc. All rights reserved.
worlds top technology companies, and a US patent on key techniques issued in late 2012. This whitepaper explores the past, present, and future of software rendering, and why TransGaming expects that the technology behind SwiftShader will be a critical component of future graphics systems.
SwiftShader Today
In 2005, TransGaming launched SwiftShader, for the first time providing a software-only implementation of a commonly used graphics API (Microsoft Direct3D), including shader support, at performance levels fast enough for realtime interactive use. Since then, SwiftShader has found an important niche in the graphics market as a fallback solution ensuring that even in cases where available hardware or graphics drivers are inadequate, out of date, or unstable, our customers software will still run.
SwiftShader: Why the Future of 3D Graphics is in Software
This fallback case is a critical one for software that needs to run no matter what system an end user has in place. TransGaming has licensed SwiftShader to companies such as Adobe for use with Flash as a fallback for the Stage3D API, and to Google to implement the WebGL API within Chrome and Native Client. Beyond this, SwiftShader has found customers in markets as diverse as
available or not reliable is that it is capable of achieving performance that approaches that of dedicated hardware. With a 2010-era quad-core CPU, SwiftShader scores 620 points in the popular 3DMark06 DirectX 9 benchmark; this is higher than the scores for many previous generation integrated GPUs.
Figure 1: SwiftShader running 3DMark06
medical imaging and the defense industry. All of these customers require a solution that will put the right pixels on the screen 100% of the time. Another important area where SwiftShader is being used today is in cloud computing and virtualization systems. Servers in data centers that include GPU capabilities are currently substantially more expensive than normal servers. Using SwiftShader and software rendering thus allows substantial savings and flexibility for developers with server-oriented applications that require some degree of graphics capability. A key part of the reason that SwiftShader is useful as a fallback option in situations where a hardware GPU is not
Software Rendering Future Advantages

While todays software rendering results are good enough for some applications, current generation integrated GPUs still have a substantial performance advantage. Why then does TransGaming believe that software rendering will have a more important role in the future, beyond a reliable fallback? The answers are straightforward. As CPUs continue to increase their parallel processing performance, they become adequate for a wider range of graphics applications, thus saving the cost of additional unnecessary hardware. Hardware manufacturers can then focus resources into
optimizing and improving hardware with a single architecture, and thus avoid the costs of melding separate CPU and GPU architectures in a system. As graphics drivers and APIs get more complex and diverse, the issues of driver correctness and stability become ever more important. In todays world, software developers must test their applications on an almost infinite variety of different GPUs, drivers, and OS revisions. With a pure software approach, these problems all go away. There are no feature variations to worry about, other than performance, and developers can always ship applications with a fully stable graphics library, knowing that it will work as expected no matter what. Software rendering thus saves time and money for all participants in the platform and ecosystem during development, testing and maintenance. Beyond cost savings, software rendering has numerous additional advantages. For example, graphics algorithms that today use a combination of CPU and GPU processing must split the workload in a suboptimal way, and developers must deal with the complexity of handling different bottlenecks to ensure that each pipeline remains balanced. Software rendering also simplifies optimization and debugging by using a single architecture, allowing the use of well established CPU-side profilers and debuggers. A simpler, uniform memory model also liberates developers from having to deal with multiple memory pools and inconsistent data access characteristics, creating additional freedom for developers to explore new graphics algorithms. Most importantly however, software rendering allows for unlimited new capabilities to be used at any time. New graphics API releases can always be compatible with existing hardware, and developers can add new functionality at any layer of their graphics stack. The only limits become those of the developers imagination. All of this however can only become true if software rendering can close the performance gap. At TransGaming, we believe that this is very achievable, and that upcoming hardware advances will prove this out. To understand why requires a deeper dive into the technical side of SwiftShader.
SwiftShader: The State of the Art in Software Rendering

This section highlights some of the key technologies that differentiate SwiftShader from other renderers, and illustrates how the challenges posed by software rendering can be overcome.
One of the seemingly major advantages of dedicated 3D graphics hardware is that it can switch between different operations at no significant cost. This is particularly relevant to real-time 3D graphics because all of the graphics pipeline stages depend on a certain state that determines which calculations are performed. For instance, take the alpha blending stage used to combine pixels from a new graphics operation with previously drawn pixels. This stage can use several different blending functions, each of which takes various input arguments. Handling this kind of work is a challenge for traditional software approaches that use conditional statements to perform different operations. The resulting CPU code ends up containing more test and branch instructions than arithmetic instructions, resulting in slower performance compared to code that has been specialized for just a single combination of states, or hardware with separate logic for each blending function. A nave software solution that includes pre-built specialized routines for every combination of blending states is not feasible, because combinatorial explosion would result in excessive binary code size. The practical solution that SwiftShader uses for this type of problem is to compile only the routines for state combinations that are needed at run-time. In other words, SwiftShader waits for the application to issue a drawing command and then compiles specialized routines which perform just those operations required by the states that are active at the time the drawing command is issued. The generated routines are then cached to avoid redundant recompilation. The end result is that SwiftShader can support all the graphics operations supported by traditional GPUs, with no render-state dependent branching code in the processing routines. This elimination of branching code also comes with secondary benefits such as much improved register reuse. This technique of dynamic code generation with specialization, has proven to be invaluable in making software rendering a viable choice for many applications today, and naturally extends to the run-time compilation of current types of programmable shaders. Most importantly, it opens up huge opportunities for future techniques. In addition to using dynamic code generation, SwiftShader also achieves some of its performance through the use of CPU SIMD instructions. SwiftShader pioneered the implementation of Shader Model 3.0 software rendering by using the SIMD instructions to process multiple elements such as pixels and vertices in parallel. By contrast, the classic
way of using these instructions is to execute vector operations for only a single element. For example, other software renderers might implement a 3-component dot product using the following sequence of Intel x86 SSE2 instructions:
mulps xmm0, xmm1
movhlps xmm1, xmm0 addps pshufd addss xmm0, xmm1 xmm1, xmm0, 1 xmm0, xmm1
Note that this sequence requires five instructions to compute a single 3-component dot product - a common operation in 3D lighting calculations. This is no faster than using scalar instructions, and thus many legacy software renderers did not obtain an appreciable speedup from the use of vector instructions. SwiftShader instead uses them to compute multiple dot products in parallel:
mulps xmm0, xmm3 mulps xmm1, xmm4 mulps xmm2, xmm5 addps xmm0, xmm1 addps xmm0, xmm2
shader-like language, which integrates directly into C++. This layer, which we call Reactor, outputs an intermediate representation that can then be optimized and translated into binary code using a full compiler back-end. We chose to use the well-known LLVM framework due to its excellent support of SIMD instructions and straightforward use for run-time code generation. The combination of Reactor and LLVM forms a versatile tool for all dynamic code generation needs, exploiting the power of SIMD instructions while abstracting the complexities. A simple example of how Reactor is used in the implementation of the cross product shader instruction illustrates this well:
void ShaderCore::crs( Vector4f &dst,
Vector4f &src0, Vector4f &src1) { dst.x = src0.y * src1.z src0.z * src1.y;
dst.y = src0.z * src1.x src0.x * src1.z;
dst.z = src0.x * src1.y src0.y * src1.x;
The number of instructions is the same, but this sequence computes four dot products at once. Each vector register component contains a scalar variable (which itself can be a logical vector component) from one of four pixels or vertices. Although this is straightforward for an operation like a dot product, the challenge that TransGaming has solved with SwiftShader is to efficiently transform all data into and out of this format, while still supporting branch operations. Earlier versions of SwiftShader made use of our inhouse developed dynamic code generator, SwiftAsm, which used a direct x86 assembly representation of the code to be generated. This offered excellent low-level control, but at the cost of the burden of dealing with different sets of SIMD extensions, and of determining code dependencies based on complex state interactions. Weve since taken things to the next level by abstracting the operations into a high-level
This looks exactly like the calculation to perform a cross product in C++, but the magic is in the use of Reactors C++ template system. The Reactor Vector4f data type is defined with overloaded arithmetic operators that generate the required instructions for SIMD processing in the output code. In addition to eliminating branches and making effective use of SIMD instructions, SwiftShader also achieves substantial speedups through the use of multi-core processing. While this may at first seem obvious and relatively straightforward, it poses both challenges and opportunities. Graphics workloads can be split into concurrently executable tasks in many different ways. One can choose between sort-first or sort-last rendering, or a hybrid approach. Each task can also be divided into more tasks through data parallelism, function parallelism, and/or instruction parallelism. Subdividing tasks and scheduling them onto cores/threads comes with some overhead, and they are typically fixed
processes once a specific approach is chosen. TransGaming has identified opportunities to minimize the overhead and in some cases even exceed the theoretical speedup of using multiple cores, by combining dynamic code generation with the choice of subdivision/scheduling policy. Information about the processing routines, obtained during run-time code generation, can be used during task subdivision and scheduling, while information about the subdivision/scheduling can be used during the run-time code generation. We believe this advantage to be unique to software rendering, because only CPU cores are versatile enough to do dynamic code generation, intelligent task subdivision/scheduling, and high throughput data processing. Further information about these techniques can be found in TransGamings patent filing, Patent #8,284,206: General purpose software parallel task engine. While the patent was granted in late 2012, the original provisional patent was filed in early 2006, well before other modern software rendering efforts such as Intels Larrabee became public.
Convergence and Trends

The previous sections of this whitepaper show some of the substantial advantages of software rendering, and demonstrate that the technology to use the full computing power of a modern CPU as efficiently as possible is already here. In order to fully understand the coming convergence between CPU and GPU however, we must also consider the evolution of the GPU side of the equation. Firstly, we must understand what makes modern GPUs exceptionally fast parallel computation engines, and what the limits of growth on the approaches used to provide that speed may be. Modern GPUs have two critical features that enable the majority of their performance: they provide a large number of heavily pipelined parallel computation units, and they drive many execution threads through these units simultaneously. This allows them to hide the long latencies that frequently occur when executing operations such as texture fetching. While one thread is waiting for a texture fetch result, another thread occupies the computation units. Context switches are therefore designed to be very efficient on a GPU. Keeping many threads active simultaneously requires a large number of registers to be available. The more registers a given instruction sequence requires, the fewer threads can be run simultaneously.
The lowest organizational level of computation on a GPU is known as a warp on NVIDIA GPUs, and a wavefront on AMD GPUs. This is similar to the SIMD width in a CPU vector unit. Current generation GPU hardware typically uses 1024-bit or 512-bit wide SIMD units, compared to the 256bit wide SIMD units used by current generation CPUs. The wide SIMD approach also has some important limitations. One is that control statements within a given instruction sequence cause divergence, which requires evaluating multiple code paths. With a wider SIMD width, this divergence becomes more common, eliminating some of the execution parallelism. Another limiting factor for graphics processing is that pixels are processed in rectangular tiles, so rendering triangles regularly results in leaving some lanes unused. Another limitation is the number of registers available. Larger register files lower computational density, so GPU manufacturers must balance that against stalls caused by running out of storage for covering RAM access latencies By contrast, CPUs are optimized for low-latency operation. On a CPU core, a significant amount of logic is devoted to scheduling logic that allows many functional units to be used simultaneously through out-of-order execution. Branch prediction units and shorter SIMD widths reduce the penalties for branch-heavy code, and more die space is devoted to caches and memory-management functionality. CPUs typically support running at significantly higher clock frequencies as well. CPUs are now evolving to support increased parallelism at the SIMD width level as well as with additional execution units available to simultaneous threads, and larger numbers of CPU cores per die. Intels Haswell chips, available later this year, will include three 256-bit wide SIMD units per core, two of which are capable of a fused multiply-add operation. This arrangement will process up to 32 floatingpoint operations per cycle: with four cores on a mid-range version of this architecture, this provides about 450 raw GFLOPS at 3.5 GHz. Intels AVX2 instruction set offers room to increase the SIMD width size to 1024 bits, which would put the raw CPU GFLOPS at similar levels to the highest end GPUs currently available. At the same time, GPUs are becoming more and more like CPUs, adding more advanced memory management features such as virtual memory and the corresponding MMU
complexity that is required. GPU instruction scheduling is becoming more complex as well, with out-of-order features such as register scoreboarding, and ILP extraction features such as superscalar execution. Furthermore, GPU-vendor sponsored research suggests that running fewer threads simultaneously might lead to better performance in many cases1.
Die-level Integration and Bandwidth

One of the trends that displays the clearest indications of convergence between CPUs and GPUs is the increasing frequency of die-level integration of current-generation differentiated CPU and GPU units. This trend has become more and more important with the rise of mobile devices, which require both graphics and CPU performance in a single lowpower chip. Most desktop chips sold today also include an on-die GPU. The very existence of this important market shows the value of CPU / GPU convergence. While today the market is
Device Discrete GPUs NVidia Kepler GK104 (GTX 680) NVidia Kepler GK107 (GTX 650) ATI Radeon HD 7870 XT Integrated GPUs NVidia Tegra 4 GPU Only (est.) Intel Ivy Bridge HD4000 GPU Only (est) CPU Hardware Intel Haswell Quad-Core, no GPU (est.) Intel Haswell Single Core (est.) Intel Haswell ULX Single Core (est.) 3100 3100 1500 ~96 ~24 ~24 520 1150 ~30 ~57 1006 928 975 294 118 365 Speed MHz Area mm2
workloads as needs arise in the system. While one of the traditional hallmarks of strength with GPU computing has been the use of high bandwidth dedicated memory, this advantage becomes moot in environments where the GPU must share memory with the CPU. While there is no question that the availability of high bandwidth memory will continue to be a strength of discrete GPU computing, there are ways in which both integrated CPU / GPU packages as well as potential future unified chips can offset this distinction. One approach to provide increased performance for these chips is on-package memory. This approach has already proven useful in Microsofts Xbox 360, which includes a 10 MB eDRAM framebuffer. Intels Haswell integrated GPU will optionally include 128 MB of high performance memory for this purpose as well. AMDs next generation Kaveri Fusion architecture chip is designed to fully share memory and address space between its CPU and GPU components.
Performance Power GFLOPS Watts GFLOPS / Watt GFLOPS / mm2
3090.4 812.5 2995.2
195 64 185
15.85 12.7 16.19
10.5 6.88 8.21
74.8 294.4
~3.8 14.8
~19.68 19.89
~2.49 ~5.16
~396.8 ~99.2 ~48
~65 ~16 ~4
~6.2 ~6.2 ~12.0
~4.13 ~4.13 ~2.0
Table 1: GFLOPS per Watt and GFLOPS per mm2
served by chips that integrate separate units on the same die, the potential advantages of a fully unified chip are clear hardware manufacturers would be able to manufacture simpler macro-level designs with computation cores that could be used for either general purpose or graphics
1 Better Performance at Lower Occupancy http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
Clearly, any future unified architecture chip will not suffer from bandwidth limitations any more than existing integrated designs might.
Computational Efficiency
One argument often raised in favor of GPU computing is that GPUs have greater computational efficiency than CPUs.
While this is true today, the advantage is much less than one might think, and there is every reason to believe that it will disappear with future CPU designs as the convergence trends described above continue. Table 1 summarizes information about the performance per unit area and performance per Watt of both discrete GPUs, integrated GPUs, and Intels Haswell CPU. Estimates are drawn from the websites in the footnote below2, with the following additional assumptions: Tegra 4 GPU area is estimated as 37.5% of overall SoC area
overwhelming. Given the potential for scaling raw GFLOPS on a CPU-style architecture at relatively low power cost by providing wider SIMD vectors, there is no reason to imagine that computational efficiency is a bar to future unified architectures.
Transitioning to Unified Computing

As we have seen above, there are strong trends towards convergence between the GPU and CPU, and no obvious obstacles to unified computing. Are there any other important factors required to complete a transition to fully unified architectures? One caveat in the above compute density comparisons is that so far weve only looked at the programmable hardware. The relative amount of fixed-function GPU hardware has been shrinking, but it is worth considering the question of whether anything changes if we implement these functions in software. It is a common misconception that replacing fixedfunction hardware would require significant additional programmable hardware in order to achieve the same peak throughput. To illustrate why this is incorrect, well focus on the texture units, which remain the most prominent fixedfunction logic on todays GPUs. These texture units are at any given time either a bottleneck or underutilized. GPU manufacturers try to prevent them from being a bottleneck by having more texture units than what is needed by the average shader (bandwidth and area permitting). As a result these additional texture units are, more often than not, idle. By contrast, with unified hardware, the additional programmable hardware required would be based on the average utilization. We have confirmed this experimentally by collecting statistics through the use of run-time performance counters in SwiftShader. The average TEX:ALU ratio in observed shaders is lower than the ratio of TEX:ALU hardware in contemporary GPUs. Software rendering on unified hardware thus has the potential to outperform dedicated hardware, by having more programmable logic available that does not suffer from underutilization. Even in the case where the GPUs texture units are a bottleneck, software rendering on a unified CPU may outperform it for simple sampling operations with high cache locality. The second factor that makes it feasible to replace fixed-function texture samplers with programmable hardware is the fact that texture sampling is by nature pipelined,
Tegra 4 GPU TDP is estimated as 50% of overall SoC TDP,

based on reported battery size and time estimates for NVidia Shield device; 38 Watt-hour battery, ~5 hour battery life
Sandy Bridge HD 4000 GPU size is estimated as 31% of

overall die, based on visual estimates from die photographs
Sandy Bridge HD 4000 GPU power consumption taken

from AnandTech measurements
Haswell core size is estimated as 13% of overall die, based

on visual estimates of die photographs
Haswell 3.1 GHz power one core power estimated based

on data from Toms Hardware article above
Haswell ULX 1.5 GHz TDP power estimated based on 50%

reported 10 Watt TDP from Anandtech article above While the data above clearly shows GPUs as more efficient in raw GFLOPS performance, the result is hardly
2 Data for Table 1 was compiled from the sites below: http://en.wikipedia.org/wiki/GeForce_600_Series http://www.zdnet.com/nvidia-claims-tegra-4-gpu-will-outperform-theipad-4s-a6x-7000009888/ http://www.anandtech.com/show/6550/more-details-on-nvidias-tegra4-i500-5th-core-is-a15-28nm-hpm-ue-category-3-lte http://i1247.photobucket.com/albums/gg628/mrob27/HSW-4c-GT2-rev2_ zps3212a12b.png http://www.extremetech.com/computing/144778-atom-vs-cortex-a15-vskrait-vs-tegra-3-which-mobile-cpu-is-the-most-power-efficient http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_ units#Southern_Islands_.28HD_7xxx.29_series http://www.nordichardware.com/CPU-Chipset/intel-core-i7-3770k-ivybridge-and-the-3d-transistor-is-here/New-graphics-the-biggest-newsin-Ivy-Bridge.html http://www.anandtech.com/show/5878/mobile-ivy-bridge-hd-4000-investigation-realtime-igpu-clocks-on-ulv-vs-quadcore http://www.tomshardware.com/gallery/haswell-665x269,0101-364392-02-3-1-jpg-.html http://www.anandtech.com/show/6655/intel-brings-core-down-to-7wintroduces-a-new-power-rating-to-get-there-yseries-skus-demystified http://www.chip-architect.com/news/2012_04_19_Ivy_Bridges_GPU_225_times_Sandys.html
consisting of several logical stages, most of which are configurable: address generation, mipmap level of detail (LOD) determination, texel gather, and filtering. Different functional units inside a CPU core can perform work at each of these stages. For example, texel gathering can be performed by a load/store unit while the SIMD FP ALUs are completing filtering on a previous sample. Furthermore, SwiftShaders dynamic code generation can completely eliminate the LOD determination stage when mipmapping is inactive - either when disabled explicitly or when the texture only has one level. Similarly, filtering can range from none at all, to trilinear anisotropic filtering, and beyond. Modern GPU hardware provides support for one trilinearly filtered sample per clock cycle, implementing anisotropic filtering using multiple cycles. This means that during anisotropic filtering, some of the other stages are idle. To implement more advanced filtering, a shader program is required, i.e., software. Likewise on the CPU we can generate specialized routines for 1D, 2D and 3D texture addressing. All this leads to the general observation that as the usage diversifies, bottlenecks and underutilization can be addressed by new forms of programmability and unification. This brings us to a third argument. Performing texture operations on unified hardware enables global optimizations. We have observed that many shaders sample multiple textures using the same or similar texture coordinates. This enables SwiftShaders run-time compiler back-end to eliminate common sub-expressions. For instance if a shader uses a regular grid of sample locations to implement a higher order filter, fewer addresses have to be computed than if each of the samples computed both coordinates. Unifying the CPU and GPU also means that some portions of the GPUs fixed-function hardware could be added to the CPUs architecture as new instructions, and these could then be used for new purposes as well. One major example of this is the addition of gather support to commodity CPUs, which will become available with Intels Haswell CPU later this year. This will speed up software rendering considerably, by transforming serial texel fetches into a parallel operation. But the gather instruction can also speed up a multitude of other graphics operations, such as vertex attribute fetches, table lookups for transcendental functions, arbitrarily indexed constant buffer accesses, and more. The possible uses go far beyond graphics. The decoupling of the gather operation from filtering also enables the
optimization of texture sampling operations that require no filtering. In recent years the use of graphics algorithms requiring non-texture related memory lookups has increased, spawning the addition of a generic gather intrinsic in shader languages. Besides gather, several other generic instructions could be added to the CPU to ensure efficient texture sampling in software. For example, to efficiently pack and unpack small data fields, a vector version of the bit manipulation instructions (BMI1/2) could be added. Note that our analysis above has covered the decoupling and unification of every major stage in texture sampling. This approach would eliminate both bottlenecks and underutilization, and would also enable new optimizations. It may seem like a large number of instructions would still be required to implement texture sampling, but it is important to keep in mind that on a unified architecture each core can execute any operation. When GPU hardware first became programmable and spent multiple cycles per pixel, vendors were still able to improve performance by adding additional arithmetic units running in parallel. Likewise, breaking up texture sampling into simpler operations allows spreading the work over more CPU functional units and cores. A similar analysis can be made for fixed-function raster output operations (ROP). In fact for GPU hardware that supports OpenGLs GL_EXT_shader_framebuffer_fetch extension3, colour blending is already performed by the shader units. This is, at the time of writing, only supported by mobile GPUs, a fact that also illustrates that replacing dedicated hardware with programmable hardware doesnt have to be detrimental to performance and power consumption. The ROP units are typically also responsible for antialiasing (AA). Interestingly, this functionality was broken in some versions of AMDs R600 architecture4, but this did not prevent them from launching the product, as the drivers were able to implement anti-aliasing using the shader units. Note that the compute density has since increased significantly, so not having dedicated AA hardware would now have an even lower impact. Moreover, replacing dedicated hardware with more general-purpose compute units allows ROP unit die area to be used for many other purposes.
3 An OpenGL ES extension: http://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_framebuffer_fetch.txt 4 Reported here: http://www.theinquirer.net/inquirer/news/1046479/ati-r600-managepixels-clock
Finally, there has recently been a great deal of successful research into screen-based AA algorithms, which do not require any dedicated hardware5. This brings us to a more general discussion of dedicated versus programmable hardware. For decades, computing power has increased at a faster rate than memory bandwidth.Its easy to see why this will remain a universal truth as long as Moores Law holds: with every halving of the semiconductor feature size, four times more logic can be fit on the same area, but the contour can only fit two times more wires. Furthermore, the pin count of a chip does not scale linearly with semiconductor technology. Hitting the inevitable memory wall has been staved off several times by adding more metal layers, by building a hierarchy of caches, and by increasing the effective frequency per pin. Going forward, other techniques will become essential to keep scaling the effective bandwidth. An extensive study6 points out the best candidates, but probably the most interesting result is what wont work well: shrinking the core sizes. This is the least effective approach because when the majority of the die space is occupied by storage and communication structures to feed the execution logic, smaller execution logic only leads to a marginal increase in compute density. In essence, this is an argument against simple programmable cores and fixed-function logic in the long term. In essence, this has been one of the driving forces that has enabled graphics hardware to become programmable thus far, and it shows that even more programmability can be achieved in the future at a low cost. Eventually there wont be a significant advantage in using small GPU-like cores, and every core can instead have a more versatile CPUlike architecture.
full confidence that end users will always see the right pixels on the screen, regardless of drivers or operating systems, and the limitless potential for innovation that comes with software-based approaches7. Some of this convergence is already apparent with upcoming new hardware such as Intels Haswell processor. AMDs Heterogeneous Computing architecture is another proof point on this roadmap, integrating CPU and GPU elements into a single processor, controlled through dynamically generated code. Non-graphics domains are also seeing benefits from similar dynamic, software-based approaches for example, NVidias Tegra 4 processor includes a fully software controlled radio. TransGamings SwiftShader technology offers a pioneering approach to software rendering, backed by powerful IP. SwiftShader is uniquely suited to providing TransGamings customers with the ability to navigate the transition from todays mixed hardware through to future architectures that we can only speculate about. SwiftShaders dynamic code generation approach allows TransGaming to implement commonly used graphics APIs such as Direct3D 9 and OpenGL ES 2.0 on a variety of different contemporary systems, while paving the way towards a future where a graphics library is simply a set of building blocks that developers make use of on a piece by piece basis. Many challenges remain to be overcome in order for this vision of unified computing to become a reality. TransGaming aims to play a key role in meeting these challenges and in helping to deliver on the resulting innovations.
Conclusion
TransGaming believes that an eventual convergence of CPU and GPU computing into a revolutionary unified architecture is inevitable. This merger will give developers and end users the best of both worlds: highly parallel programming environments that interface easily with scalar code,
5 See: http://iryoku.com/aacourse/downloads/Filtering-Approaches-for-Real-TimeAnti-Aliasing.pdf http://software.intel.com/en-us/articles/mlaa-efficiently-moving-antialiasing-from-the-gpu-to-the-cpu 6 Rogers, et al. http://www.ece.ncsu.edu/arpers/Papers/isca09-bwwall.pdf
7 Some interesting ideas well suited to pure software approaches can be found here: http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf
Copyright Notice This document 2013 TransGaming Inc. All rights reserved. Trademark Notice SwiftShader, SwiftAsm, Reactor, and the SwiftShader logo are trademarks of TransGaming, Inc. in Canada and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.
Disclaimer and Limitation of Liability ALL DESIGN SPECIFICATIONS, DRAWINGS, PROGRAMS, SAMPLES, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, MATERIALS) ARE BEING PROVIDED AS IS. TRANSGAMING MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to
be accurate and reliable. TransGaming, Inc. assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of TransGaming Inc. Specifications mentioned in this document are subject to change without notice. SwiftShader and Reactor technologies are not authorized for use in devices or systems without express written approval or license from TransGaming, Inc.
For More Information: http://transgaming.com/swiftshader
About TransGaming Inc.

TransGaming Inc. (TSX-V: TNG) is the global leader in developing and delivering platform-defining social video game experiences to consumers around the world. From engineering essential technologies for the worlds leading companies, to engaging audiences with truly immersive interactive experiences, TransGaming fuels disruptive innovation across the entire spectrum of consumer technology. TransGamings core businesses span the digital distribution of games for Smart TVs, next-generation set-top boxes, and the connected living room, as well as technology licensing for cross-platform game enablement, software 3D graphics rendering, and parallel computing. Website: http://transgaming.com
10

TransGaming SwiftShader Whitepaper-20130129

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

TransGaming SwiftShader Whitepaper-20130129

Încărcat de

Drepturi de autor:

Formate disponibile

TRANSGAMING INC.

WHITE PAPER: SWIFTSHADER TECHNOLOGY JANUARY 29, 2013

Why the Future of 3D Graphics is in Software

SwiftShader: Why the Future of 3D Graphics is in Software

Figure 1: SwiftShader running 3DMark06

Software Rendering Future Advantages

SwiftShader: Why the Future of 3D Graphics is in Software

SwiftShader: The State of the Art in Software Rendering

SwiftShader: Why the Future of 3D Graphics is in Software

Vector4f &src0, Vector4f &src1) { dst.x = src0.y * src1.z src0.z * src1.y;

dst.y = src0.z * src1.x src0.x * src1.z;

dst.z = src0.x * src1.y src0.y * src1.x;

SwiftShader: Why the Future of 3D Graphics is in Software

Convergence and Trends

SwiftShader: Why the Future of 3D Graphics is in Software

Die-level Integration and Bandwidth

3090.4 812.5 2995.2

15.85 12.7 16.19

10.5 6.88 8.21

~396.8 ~99.2 ~48

~6.2 ~6.2 ~12.0

~4.13 ~4.13 ~2.0

Table 1: GFLOPS per Watt and GFLOPS per mm2

Copyright 2013 TransGaming Inc. All rights reserved.

SwiftShader: Why the Future of 3D Graphics is in Software

Transitioning to Unified Computing

Tegra 4 GPU TDP is estimated as 50% of overall SoC TDP,

Sandy Bridge HD 4000 GPU size is estimated as 31% of

Sandy Bridge HD 4000 GPU power consumption taken

Haswell core size is estimated as 13% of overall die, based

Haswell 3.1 GHz power one core power estimated based

Haswell ULX 1.5 GHz TDP power estimated based on 50%

Copyright 2013 TransGaming Inc. All rights reserved.

SwiftShader: Why the Future of 3D Graphics is in Software

SwiftShader: Why the Future of 3D Graphics is in Software

Copyright 2013 TransGaming Inc. All rights reserved.

SwiftShader: Why the Future of 3D Graphics is in Software

For More Information: http://transgaming.com/swiftshader

About TransGaming Inc.

S-ar putea să vă placă și