Home > Nvidia > Processor > Nvidia Geforce 6 Series Manual

Nvidia Geforce 6 Series Manual

    Download as PDF Print this page Share this page

    Have a look at the manual Nvidia Geforce 6 Series Manual online for free. It’s possible to download the document as PDF or print. UserManuals.tech offer 9 Nvidia manuals and user’s guides for free. Share the user manual or guide on Facebook, Twitter or Google+.

    							
    471
    The GeForce 6 Series GPU
    Architecture
    Emmett Kilgariff
    NVIDIA Corporation
    Randima Fernando
    NVIDIA Corporation
    Chapter 30
    The previous chapter described how GPU architecture has changed as a result of compu-
    tational and communications trends in microprocessing. This chapter describes the archi-
    tecture of the GeForce 6 Series GPUs from NVIDIA, which owe their formidable
    computational power to their ability to take advantage of these trends. Most notably, we
    focus on the GeForce 6800 (NVIDIA’s flagship GPU at the time of writing, shown in
    Figure 30-1), which delivers hundreds of gigaflops of single-precision floating-point com-
    putation, as compared to approximately 12 gigaflops for current high-end CPUs. In this
    chapter—and throughout the book—references to GeForce 6 Series GPUs should be read
    to include the latest Quadro FX GPUs supporting Shader Model 3.0, which provide a
    superset of the functionality offered by the GeForce 6 Series. We start with a general
    overview of where the GPU fits into the overall computer system, and then we describe
    the architecture along with details of specific features and performance characteristics.
    30.1 How the GPU Fits into the Overall Computer System
    The CPU in a modern computer system communicates with the GPU through a graph-
    ics connector such as a PCI Express or AGP slot on the motherboard. Because the
    graphics connector is responsible for transferring all command, texture, and vertex data
    from the CPU to the GPU, the bus technology has evolved alongside GPUs over the
    past few years. The original AGP slot ran at 66 MHz and was 32 bits wide, giving a
    transfer rate of 264 MB/sec. AGP 2
    ×, 4×, and 8 ×followed, each doubling the available
    30.1 How the GPU Fits into the  Overall Computer System
    
    430_gems2_ch30_new.qxp  1/31/2005  6:56 PM  Page 471
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    472Chapter 30 The GeForce 6 Series GPU Architecture
    bandwidth, until finally the PCI Express standard was introduced in 2004, with a maxi-
    mum theoretical bandwidth of 4 GB/sec simultaneously available to and from the GPU.
    (Your mileage may vary; currently available motherboard chipsets fall somewhat below
    this limit—around 3.2 GB/sec or less.) 
    It is important to note the vast differences between the GPU’s memory interface band-
    width and bandwidth in other parts of the system, as shown in Table 30-1.
    Table 30-1.Available Memory Bandwidth in Differ ent Parts of the Computer System 
    Component Bandwidth
    GPU Memory Interface 35 GB/sec
    PCI Express Bus (
    ×16) 8 GB/sec
    CPU Memory Interface  6.4 GB/sec
    (800 MHz Front-Side Bus)
    Figure 30-1. The GeForce 6800 Microprocessor
    
    430_gems2_ch30_new.qxp  1/31/2005  6:56 PM  Page 472
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    473
    Table 30-1 reiterates some of the points made in the preceding chapter: there is a vast
    amount of bandwidth available internally on the GPU. Algorithms that run on the
    GPU can therefore take advantage of this bandwidth to achieve dramatic performance
    improvements. 
    30.2 Overall System Architecture
    The next two subsections go into detail about the architecture of the GeForce 6 Series
    GPUs. Section 30.2.1 describes the architecture in terms of its graphics capabilities.
    Section 30.2.2 describes the architecture with respect to the general computational capa-
    bilities that it provides. See Figure 30-2 for an illustration of the system architecture.
    30.2.1 Functional Block Diagram for Graphics Operations
    Figure 30-3 illustrates the major blocks in the GeForce 6 Series architecture. In this
    section, we take a trip through the graphics pipeline, starting with input arriving from
    the CPU and finishing with pixels being drawn to the frame buffer. 
    30.2 Overall System Architecture
    Figure 30-2.The Overall System Architecture of a PC
    
    430_gems2_ch30_new.qxp  1/31/2005  6:56 PM  Page 473
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    474
    First, commands, textures, and vertex data are received from the host CPU through
    shared buffers in system memory or local frame-buffer memory. A command stream is
    written by the CPU, which initializes and modifies state, sends rendering commands, and
    references the texture and vertex data. Commands are parsed, and a vertex fetch unit is
    used to read the vertices referenced by the rendering commands. The commands, vertices,
    and state changes flow downstream, where they are used by subsequent pipeline stages.
    The vertex processors (sometimes called “vertex shaders”), shown in Figure 30-4, allow
    for a program to be applied to each vertex in the object, performing transformations,
    skinning, and any other per-vertex operation the user specifies. For the first time, a
    Chapter 30 The GeForce 6 Series GPU Architecture
    Figure 30-3.A Block Diagram of the GeForce 6 Series Architecture
    
    430_gems2_ch30_new.qxp  1/31/2005  6:56 PM  Page 474
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    GPU—the GeForce 6 Series—allows vertex programs to fetch texture data. All opera-
    tions are done in 32-bit floating-point (fp32) precision per component. The GeForce 6
    Series architecture supports scalable vertex-processing horsepower, allowing the same
    architecture to service multiple price/performance points. In other words, high-end
    models may have six vertex units, while low-end models may have two.
    Because vertex processors can perform texture accesses, the vertex engines are connected
    to the texture cache, which is shared with the fragment processors. In addition, there is
    a vertex cache that stores vertex data both before and after the vertex processor, reduc-
    ing fetch and computation requirements. This means that if a vertex index occurs twice
    in a draw call (for example, in a triangle strip), the entire vertex program doesn’t have to
    be rerun for the second instance of the vertex—the cached result is used instead. 
    Vertices are then grouped into primitives, which are points, lines, or triangles. The
    Cull/Clip/Setup blocks perform per-primitive operations, removing primitives that
    aren’t visible at all, clipping primitives that intersect the view frustum, and performing
    edge and plane equation setup on the data in preparation for rasterization.
    30.2 Overall System Architecture475
    Figure 30-4.
    The GeForce 6 Series Vertex Processor
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 475
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    476
    The rasterization block calculates which pixels (or samples, if multisampling is enabled)
    are covered by each primitive, and it uses the z-cull block to quickly discard pixels (or
    samples) that are occluded by objects with a nearer depth value. Think of a fragment as
    a “candidate pixel”: that is, it will pass through the fragment processor and several tests,
    and if it gets through all of them, it will end up carrying depth and color information
    to a pixel on the frame buffer (or render target).
    Figure 30-5 illustrates the fragment processor (sometimes called a “pixel shader”) and
    texel pipeline. The texture and fragment-processing units operate in concert to apply a
    shader program to each fragment independently. The GeForce 6 Series architecture
    supports a scalable amount of fragment-processing horsepower. Another popular way to
    say this is that GPUs in the GeForce 6 Series can have a varying number of fragment
    pipelines  (or “pixel pipelines”). Similar to the vertex processor, texture data is cached on-
    chip to reduce bandwidth requirements and improve performance.
    The texture and fragment-processing unit operates on squares of four pixels (called
    quads ) at a time, allowing for direct computation of derivatives for calculating texture
    level of detail. Furthermore, the fragment processor works on groups of hundreds of
    pixels at a time in single-instruction, multiple-data (SIMD) fashion (with each fragment
    processor engine working on one fragment concurrently), hiding the latency of texture
    fetch from the computational performance of the fragment processor.
    Chapter 30 The GeForce 6  Series GPU Architecture
    Figure 30-5.The GeForce 6 Series Fragment Processor and Texel Pipeline
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 476
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    The fragment processor uses the texture unit to fetch data from memory, optionally filter-
    ing the data before returning it to the fragment processor. The texture unit supports many
    source data formats (see Section 30.3.3, “Supported Data Storage Formats”). Data can be
    filtered using bilinear, trilinear, or anisotropic filtering. All data is returned to the fragment
    processor in fp32 or fp16 format. A texture can be viewed as a 2D or 3D array of data that
    can be read by the texture unit at arbitrary locations and filtered to reconstruct a continu-
    ous function. The GeForce 6 Series supports filtering of fp16 textures in hardware.
    The fragment processor has two fp32 shader units per pipeline, and fragments are
    routed through both shader units and the branch processor before recirculating through
    the entire pipeline to execute the next series of instructions. This rerouting happens once
    for each core clock cycle. Furthermore, the first fp32 shader can be used for perspective
    correction of texture coordinates when needed (by dividing by w), or for general-purpose
    multiply operations. In general, it is possible to perform eight or more math operations
    in the pixel shader during each clock cycle, or four math operations if a texture fetch
    occurs in the first shader unit.
    On the final pass through the pixel shader pipeline, the fog unit can be used to blend
    fog in fixed-point precision with no performance penalty. Fog blending happens often
    in conventional graphics applications and uses the following function:
    out = FogColor * fogFraction + SrcColor * (1 - fogFraction)
    This function can be made fast and small using fixed-precision math, but in general
    IEEE floating point, it requires two full multiply-adds to do effectively. Because fixed
    point is efficient and sufficient for fog, it exists in a separate small\
     unit at the end of the
    shader. This is a good example of the trade-offs in providing flexible programmable
    hardware while still offering maximum performance for legacy applications.
    Fragments leave the fragment-processing unit in the order that they are rasterized and
    are sent to the z-compare and blend units, which perform depth testing (z comparison
    and update), stencil operations, alpha blending, and the final color wr\
    ite to the target
    surface (an off-screen render target or the frame buffer).
    The memory system is partitioned into up to four independent memory partitions,
    each with its own dynamic random-access memories (DRAMs). GPUs use standard
    DRAM modules rather than custom RAM technologies to take advantage of market
    economies and thereby reduce cost. Having smaller, independent memory partitions
    allows the memory subsystem to operate efficiently regardless of whether large or small
    blocks of data are transferred. All rendered surfaces are stored in the DRAMs, while
    textures and input data can be stored in the DRAMs or in system memory. The four
    30.2 Overall System Architecture 477
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 477
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    478
    independent memory partitions give the GPU a wide (256 bits), flexible memory sub-
    system, allowing for streaming of relatively small (32-byte) memory accesses at near the
    35 GB/sec physical limit.
    30.2.2 Functional Block Diagram for Non-Graphics Operations
    As graphics hardware becomes more and more programmable, applications unrelated to
    the standard polygon pipeline (as described in the preceding section) are starting to
    present themselves as candidates for execution on GPUs.
    Figure 30-6 shows a simplified view of the GeForce 6 Series architecture, when used as a
    graphics pipeline. It contains a programmable vertex engine, a programmable fragment
    engine, a texture load/filter engine, and a depth-compare/blending data write engine.
    In this alternative view, a GPU can be seen as a large amount of programmable floating-
    point horsepower and memory bandwidth that can be exploited for compute-intensive
    applications completely unrelated to computer graphics.
    Figure 30-7 shows another way to view the GeForce 6 Series architecture. When used for
    non-graphics applications, it can be viewed as two programmable blocks that run serially:
    the vertex processor and the fragment processor, both with support for fp32 operands and
    intermediate values. Both use the texture unit as a random-access data fetch unit and access
    data at a phenomenal 35 GB/sec (550 MHz DDR memory clock 
    ×256 bits per clock
    cycle 
    ×2 transfers per clock cycle). In addition, both the vertex and the fragment processor
    are highly computationally capable. (Performance details follow in Section 30.4.)
    Chapter 30 The GeForce 6  Series GPU Architecture
    Figure 30-6.The GeForce 6 Series Architect ure Viewed as a Graphics Pipeline
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 478
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    The vertex processor operates on data, passing it directly to the fragment processor, or
    by using the rasterizer to expand the data into interpolated values. At this point, each
    triangle (or point) from the vertex processor has become one or more fragments.
    Before a fragment reaches the fragment processor, the z-cull unit compares the pixel’s
    depth with the values that already exist in the depth buffer. If the pixel’s depth is
    greater, the pixel will not be visible, and there is no point shading that fragment, so the
    fragment processor isn’t even executed. (This optimization happens only if it’s clear that
    the fragment processor isn’t going to modify the fragment’s depth.) Thinking in a 
    general-purpose sense, this  early cullingfeature makes it possible to quickly decide to
    skip work on specific fragments based on a scalar test. Chapter 34 of this book,\
     “GPU
    Flow-Control Idioms,” explains how to take advantage of this feature to efficiently
    predicate work for general-purpose computations. 
    After the fragment processor runs on a potential pixel (still a “fragment” because it has
    not yet reached the frame buffer), the fragment must pass a number of tests in o\
    rder to
    move farther down the pipeline. (There may also be more than one fragment that
    comes out of the fragment processor if multiple render targets [MRTs] are being used.
    Up to four MRTs can be used to write out large amounts of data—up to 16 scalar
    floating-point values at a time, for example—plus depth.)
    First, the scissor test rejects the fragment if it lies outside a specified subrectangle of the
    frame buffer. Although the popular graphics APIs define scissoring at this location in the
    pipeline, it is more efficient to perform the scissor test in the rasterizer. Scissoring in  xand
    y actually happens in the rasterizer, before fragment processing, and  zscissoring happens
    30.2 Overall System Architecture 479
    Figure 30-7.
    The GeForce 6 Series Architect ure for Non-Graphics Applications
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 479
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    480
    during z-cull. This avoids all fragment processor work on scissored (rejected) pixels. Scis-
    soring is rarely useful for general-purpose computation because general-purpose program-
    mers typically draw rectangles to perform computations in the first place.
    Next, the fragment’s depth is compared with the depth in the frame buffer. If the depth
    test passes, the fragment moves on in the pipeline. Optionally, the depth value in the
    frame buffer can be replaced at this stage.
    After this, the fragment can optionally test and modify what is known as the stencil
    buffer, which stores an integer value per pixel. The stencil buffer was originally
    intended to allow programmers to mask off certain pixels (for example, to restrict draw-
    ing to a cockpit’s windshield), but it has found other uses as a way to count values by
    incrementing or decrementing the existing value. This feature is used for stencil shadow
    volumes, for example.
    If the fragment passes the depth and stencil tests, it can then optionall\
    y modify the
    contents of the frame buffer using the blend function. A blend function \
    can be
    described as
    out = src * srcOp + dst * dstOp
    where sourceis the fragment color flowing down the pipeline; dstis the color value
    in the frame buffer; and 
    srcOpand dstOpcan be specified to be constants, source
    color components, or destination color components. Full blend functionality is sup-
    ported for all pixel formats up to fp16
    ×4. However, fp32 frame buffers don’t support
    blending—only updating the buffer is allowed.
    Finally, a feature called  occlusion querymakes it possible to quickly determine if any of
    the fragments that would be rendered in a particular computation would cause results
    to be written to the frame buffer. (Recall that fragments that do not pass the z-test don’t
    have any effect on the values in the frame buffer.) Traditionally, the occlusion query test
    is used to allow graphics applications to avoid making draw calls for occluded objects,
    but it is useful for GPGPU applications as well. For instance, if the depth test is used to
    determine which outputs need to be updated in a sparse array, updating depth can be
    used to indicate when a given output has converged and no further work is needed. In
    this case, occlusion query can be used to tell when all output calculations are done. See
    Chapter 34 of this book, “GPU Flow-Control Idioms,” for further information about
    this idea.
    Chapter 30 The GeForce 6  Series GPU Architecture
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 480
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    All Nvidia manuals Comments (0)