Home > Nvidia > Processor > Nvidia Geforce 6 Series Manual

Nvidia Geforce 6 Series Manual

    Download as PDF Print this page Share this page

    Have a look at the manual Nvidia Geforce 6 Series Manual online for free. It’s possible to download the document as PDF or print. UserManuals.tech offer 9 Nvidia manuals and user’s guides for free. Share the user manual or guide on Facebook, Twitter or Google+.

    							
    30.3 GPU Features
    This section covers both fixed-function features and Shader Model 3.0 support (de-
    scribed in detail later) in GeForce 6 Series GPUs. As we describe the various pieces, we
    focus on the many new features that are meant to make applications shine (in terms of
    both visual quality and performance) on GeForce 6 Series GPUs.
    30.3.1 Fixed-Function Features
    Geometry Instancing
    With Shader Model 3.0, the capability for sending multiple batches of geometry with
    one Direct3D call has been added, greatly reducing driver overhead in these cases. The
    hardware feature that enables instancing is vertex stream frequency—the ability to read
    vertex attributes at a frequency less than once every output vertex, or to loop over a
    subset of vertices multiple times. Instancing is most useful when the same object is
    drawn multiple times with different positions, for example, when rendering an army of
    soldiers or a field of grass.
    Early Culling/Clipping
    GeForce 6 Series GPUs are able to cull nonvisible primitives before shading at a high
    rate and clip partially visible primitives at full speed. Previous NVIDIA products would
    cull nonvisible primitives at primitive-setup rates, and clip all partially visible primitives
    at full speed.
    Rasterization
    Like previous NVIDIA products, GeForce 6 Series GPUs are capable of rendering the
    following objects:
    ●Point sprites
    ●Aliased and antialiased lines
    ●Aliased and antialiased triangles
    Multisample antialiasing is also supported, allowing accurate antialiased polygon ren-
    dering. Multisample antialiasing supports all rasterization primitives. Multisampling is
    supported in previous NVIDIA products, though the 4
    ×multisample pattern was
    improved for GeForce 6 Series GPUs.
    30.3 GPU Features 481
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 481
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    482
    Z-Cull
    NVIDIA GPUs since GeForce3 have technology, called z-cull, that allows hidden sur-
    face removal at speeds much faster than conventional rendering. The GeForce 6 Series
    z-cull unit is the third generation of this technology, which has increased efficiency for
    a wider range of cases. Also, in cases where stencil is not being updated, early stencil
    reject can be employed to remove rendering early when stencil test (based on equals
    comparison) fails.
    Occlusion Query
    Occlusion query is the ability to collect statistics on how many fragments passed or
    failed the depth test and to report the result back to the host CPU. Occlusion query
    can be used either while rendering objects or with color and z-write masks turned off,
    returning depth test status for the objects that would have been rendered, without
    modifying the contents of the frame buffer. This feature has been available since the
    GeForce3 was introduced.
    Texturing
    Like previous GPUs, GeForce 6 Series GPUs support bilinear, trilinear, and anisotropic
    filtering on 2D and cube-map textures of various formats. Three-dimensional textures
    support bilinear, trilinear, and quad-linear filtering, with and without mipmapping.
    Here are the new texturing features on GeForce 6 Series GPUs:
    ●Support for all texture types (2D, cube map, 3D) with fp16×2, fp16×4, fp32×1,
    fp32
    ×2, and fp32×4 formats
    ●Support for all filtering modes on fp16×2 and fp16×4 texture formats
    ●Extended support for non-power-of-two textures to match support for power-of-two
    textures, specifically:
    – Mipmapping
    – Wrapping and clamping
    – Cube map and 3D textures
    Shadow Buffer Support
    NVIDIA GPUs support shadow buffering directly. The application first renders the
    scene from the light source into a separate z-buffer. Then during the lighting phase, it
    fetches the shadow buffer as a projective texture and performs z-compares of the
    shadow buffer data against a value corresponding to the distance from the light. If the
    Chapter 30 The GeForce 6  Series GPU Architecture
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 482
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    distance passes the test, it’s in light; if not, it’s in shadow. NVIDIA GPUs have dedi-
    cated transistors to perform four z-compares per pixel (on four neighboring z-values)
    per clock, and to perform bilinear filtering of the pass/fail data. This more advanced
    variation of percentage-closer filtering saves many shader instructions compared to
    GPUs that don’t have direct shadow buffer support.
    High-Dynamic-Range Blending Using fp16 Surfaces, Texture Filtering,
    and Blending
    GeForce 6 Series GPUs allow for fp16×4 (four components, each represented by a 
    16-bit float) filtered textures in the pixel shaders; they also allow performing all alpha-
    blending operations on fp16
    ×4 filtered surfaces. This permits intermediate rendered
    buffers at a much higher precision and range, enabling high-dynamic-range rendering,
    motion blur, and many other effects. In addition, it is possible to specify a separate
    blending function for color and alpha values. (The lowest-end member of the GeForce
    6 Series family, the GeForce 6200 TC, does not support floating-point blending or
    floating-point texture filtering because of its lower memory bandwidth, as well as to
    save area on the chip.)
    30.3.2 Shader Model 3.0 Programming Model
    Along with the fixed-function features listed previously, the capabilities of the vertex
    and the fragment processors have been enhanced in GeForce 6 Series GPUs. With
    Shader Model 3.0, the programming models for vertex and fragment processors are
    converging: both support fp32 precision, texture lookups, and the same instruction set.
    Specifically, here are the new features that have been added.
    Vertex Processor
    ●Increased instruction count.  The total instruction count is now 512 static instructions
    and 65,536 dynamic instructions. The static instruction count represents the number
    of instructions in a program as it is compiled. The dynamic instruction count repre-
    sents the number of instructions actually executed. In practice, the dynamic count can
    be much higher than the static count due to looping and subroutine calls.
    ●More temporary registers. Up to 32 four-wide temporary registers can be used in a
    vertex program.
    ●Support for instancing.  This enhancement was described earlier.
    30.3 GPU Features 483
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 483
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    484
    ●Dynamic flow control.Branching and looping are now part of the shader model. On
    the GeForce 6 Series vertex engine, branching and looping have minimal overhead of
    just two cycles. Also, each vertex can take its own branches without being grouped in
    the way pixel shader branches are. So as branches diverge, the GeForce 6 Series vertex
    processor still operates efficiently.
    ●Vertex texturing. Textures can now be fetched in a vertex program, although only
    nearest-neighbor filtering is supported in hardware. More advanced filters can of
    course be implemented in the vertex program. Up to four unique textures can be
    accessed in a vertex program, although each texture can be accessed multiple times.
    Vertex textures generate latency for fetching data, unlike true constant reads. There-
    fore, the best way to use vertex textures is to do a texture fetch and follow it with
    arithmetic operations to hide the latency before using the result of the texture fetch.
    Each vertex engine is capable of simultaneously performing a four-wide SIMD 
    MAD
    (multiply-add) instruction and a scalar special function per clock cycle. Special function
    instructions include:
    ●Exponential functions: EXP, EXPP, LIT, LOG, LOGP
    ●Reciprocal instructions: RCP, RSQ
    ●Trigonometric functions: SIN, COS
    Fragment Processor
    ●Increased instruction count. The total instruction count is now 65,535 static in-
    structions and 65,535 dynamic instructions. There are limitations on how long the
    operating system will wait while the shader finishes working, so a long shader pro-
    gram working on a full screen of pixels may time-out. This makes it important to
    carefully consider the shader length and number of fragments rendered in one draw
    call. In practice, the number of instructions exposed by the driver tends to be smaller,
    because the number of instructions can expand as code is translated from Direct3D
    pixel shaders or OpenGL fragment programs to native hardware instructions.
    ●Multiple render targets. The fragment processor can output to up to four separate
    color buffers, along with a depth value. All four separate color buffers must be the
    same format and size. MRTs can be particularly useful when operating on scalar data,
    because up to 16 scalar values can be written out in a single pass by the fragment
    processor. Sample uses of MRTs include particle physics, where positions and veloci-
    ties are computed simultaneously, and similar GPGPU algorithms. Deferred shading
    is another technique that computes and stores multiple four-component floating-
    point values simultaneously: it computes all material properties and stores them in
    Chapter 30 The GeForce 6  Series GPU Architecture
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 484
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    separate textures. So, for example, the surface normal and the diffuse and specular
    material properties could be written to textures, and the textures could all be used in
    subsequent passes when lighting the scene with multiple lights. This is illustrated in
    Figure 30-8.
    ●Dynamic flow control (branching).Shader Model 3.0 supports conditional branch-
    ing and looping, allowing for more flexible shader programs.
    ●Indexing of attributes. With Shader Model 3.0, an index register can be used to
    select which attributes to process, allowing for loops to perform the same operation
    on many different inputs.
    ●Up to ten full-function attributes. Shader Model 3.0 supports ten full-function 
    attributes/texture coordinates, instead of Shader Model 2.0’s eight full-function at-
    tributes plus specular color and diffuse color. All ten Shader Model 3.0 attributes are
    interpolated at full fp32 precision, whereas Shader Model 2.0’s diffuse and specular
    color were interpolated at only 8-bit integer precision.
    ●Centroid sampling. Shader Model 3.0 allows a per-attribute selection of center sam-
    pling, or  centroid sampling . Centroid sampling returns a value inside the covered por-
    tion of the fragment, instead of at the center, and when used with multisampling, it
    can remove some artifacts associated with sampling outside the polygon (for example,
    when calculating diffuse or specular color using texture coordinates, or when using
    texture atlases).
    ●Support for fp32 and fp16 internal precision. Fragment programs can support full
    fp32-precision computations and intermediate storage or partial-precision fp16 com-
    putations and intermediate storage.
    30.3 GPU Features 485
    Figure 30-8.
    How MRTs Work
    MRTs make it possible for a fra gment program to return four four-wide  color values plus a depth value.
    
    430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 485
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    486
    ●3:1 and 2:2 coissue.Each four-component-wide vector unit is capable of executing
    two independent instructions in parallel, as shown in Figure 30-9: either one three-
    wide operation on RGB and a separate operation on alpha, or one two-wide opera-
    tion on red-green and a separate two-wide operation on blue-alpha. This gives the
    compiler more opportunity to pack scalar computations into vectors, thereby doing
    more work in a shorter time.
    ●Dual issue. Dual issue is similar to coissue, except that the two independent instruc-
    tions can be executed on different parts of the shader pipeline. This makes the
    pipeline easier to schedule and, therefore, more efficient. See Figure 30-10.
    Chapter 30 The GeForce 6  Series GPU Architecture
    
    Figure 30-9.How Coissue Works
    Two separate operations can concurrently exe cute on different parts of a four-wide register.
    Figure 30-10.How Dual Issue Works
    Independent instructions can  be executed on independent units in the computational pipeline.
    430_gems2_ch30_new.qxp  1/31/2005  6:58 PM  Page 486
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    Fragment Processor Performance
    The GeForce 6 Series fragment processor architecture has the following performance
    characteristics:
    ●Each pipeline is capable of performing a four-wide, coissue-able multiply-add (MAD)
    or four-term dot product (
    DP4), plus a four-wide, coissue-able and dual-issuable
    multiply instruction per clock in series, as shown in Figure 30-11. In addition, a
    multifunction unit that performs complex operations can replace the alpha channel
    MADoperation. Operations are performed at full speed on both fp32 and fp16 data,
    although storage and bandwidth limitations can favor fp16 performance sometimes.
    In practice, it is sometimes possible to execute eight math operations and a texture
    lookup in a single cycle.
    ●Dedicated fp16 normalization hardware exists, making it possible to normalize a
    vector at fp16 precision in parallel with the multiplies and 
    MADs just described.
    ●An independent reciprocal operation can be performed in parallel with the multiply,
    MAD, and fp16 normalization described previously.
    ●Because the GeForce 6800 has 16 fragment-processing pipelines, the overall available
    performance of the system is given by these values multiplied by 16 and then by the
    clock rate.
    ●There is some overhead to flow-control operations, as defined in Table 30-2.
    30.3 GPU Features 487
    Figure 30-11.
    Shader Units and Capabilities in the Fragment Processor
    
    430_gems2_ch30_new.qxp  1/31/2005  6:58 PM  Page 487
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    488
    Table 30-2.Overhead Incurred When Executing Flo w-Control Operations in Fragment Programs
    Instruction Cost (Cycles)
    If/ endif4
    If/else/ endif6
    Call2
    Ret2
    Loop/ endloop4
    Furthermore, branching in the fragment processor is affected by the level of divergence
    of the branches. Because the fragment processor operates on hundreds of pixels per
    instruction, if a branch is taken by some fragments and not others, all fragments exe-
    cute both branches, but only writing to the registers on the branches each fragment is
    supposed to take. For low-frequency and mid-frequency branch changes, this effect is
    hidden, although it can become a limiter as the branch frequency increases.
    30.3.3 Supported Data Storage Formats
    Table 30-3 summarizes the data formats supported by the graphics pipeline.
    30.4 Performance
    The GeForce 6800 Ultra is the flagship product of the GeForce 6 Series family at the
    time of writing. Its performance is summarized as follows:
    ●425 MHz internal graphics clock
    ●550 MHz memory clock
    ●600 million vertices/second
    ●6.4 billion texels/second
    ●12.8 billion pixels/second, rendering z/stencil-only (useful for shadow volumes and
    shadow buffers)
    ●6 four-wide fp32 vector MADs per clock cycle in the vertex shader, plus one scalar multi-
    function operation (a complex math operation, such as a sine or reciprocal square root)
    ●16 four-wide fp32 vector MADs per clock cycle in the fragment processor, plus 16
    four-wide fp32 multiplies per clock cycle
    ●64 pixels per clock cycle early z-cull (reject rate)
    As you can see, there’s plenty of programmable floating-point horsepower in the vertex
    and fragment processors that can be exploited for computationally demanding problems.
    Chapter 30 The GeForce 6  Series GPU Architecture
    
    430_gems2_ch30_new.qxp  1/31/2005  6:58 PM  Page 488
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    30.4 Performance489
    Table 30-3.
    Data Storage Formats Supported by GeForce 6 Series GPUs
    FormatDescription of Data in Memory
    Ver tex
    Texture
    SupportFragment Texture
    SupportRender Target
    Support
    B8One 8-bit fixed-point number✗✓✓
    A1R5G5B5A 1-bit value and three 5-bit unsigned fixed-point
    numbers✗✓✓
    A4R4G4B4Four 4-bit unsigned fixed-point numbers✗✓✗
    R5G6B55-bit, 6-bit, and 5-bit fixed-point numbers✗✓✓
    A8R8G8B8Four 8-bit fixed-point numbers✗✓✓
    DXT1Compressed 4×4 pixels into 8 bytes ✗✓✗
    DXT2,3,4,5Compressed 4×4 pixels into 16 bytes✗✓✗
    G8B8Two 8-bit fixed-point numbers ✗✓✓
    B8R8_G8R8Compressed as YVYU; two pixels in 32 bits ✗✓✗
    R8B8_R8G8 Compressed as VYUY; two pixels in 32 bits✗✓✗
    R6G5B56-bit, 5-bit, and 5-bit unsigned fixed-point numbers✗✓✗
    DEPTH24_D8A 24-bit unsigned fixed-point number and 8 bits of
    garbage✗✓✓
    DEPTH24_D8_FLOATA 24-bit unsigned float and 8 bits of garbage✗✓✓
    DEPTH16A 16-bit unsigned fixed-point number✗✓✓
    DEPTH16_FLOATA 16-bit unsigned float✗✓✓
    X16A 16-bit fixed-point number✗✓✗
    Y16_X16Two 16-bit fixed-point numbers✗✓✗
    R5G5B5A1Three unsigned 5-bit fixed-point numbers and a 1-bit
    value✗✓✓
    HILO8Two unsigned 16-bit values compressed into two 8-bit
    values✗✓✗
    HILO_S8Two signed 16-bit values compressed into two 8-bit
    values✗✓✗
    W16_Z16_Y16_X16 FLOATFour fp16 values✗✓✓
    W32_Z32_Y32_X32 FLOATFour fp32 values✓
    (unfiltered)✓
    (unfiltered)✓
    X32_FLOATOne 32-bit floating-point number✓
    (unfiltered)✓
    (unfiltered)✓
    D1R5G5B51 bit of garbage and three unsigned 5-bit fixed-point
    numbers✗✓✓
    D8R8G8B88 bits of garbage and three unsigned 8-bit fixed-point
    numbers✗✓✓
    Y16_X16 FLOATTwo 16-bit floating-point numbers✗✓✗
    ✓ = Yes ✗= No
    
    430_gems2_ch30_new.qxp  1/31/2005  6:58 PM  Page 489
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    							
    490
    30.5 Achieving Optimal Performance
    While graphics hardware is becoming more and more programmable, there are still
    some tricks to ensuring that you exploit the hardware fully to get the most perform-
    ance. This section lists some common techniques that you may find helpful. A more
    detailed discussion of performance advice is available in the NVIDIA GPU Program-
    ming Guide , which is freely available in several languages from the NVIDIA Developer
    Web site (http://developer.nvidia.com/object/gpu_programming_guide.html).
    30.5.1 Use Z-Culling Aggressively
    Z-cull avoids work that won’t contribute to the final result. It’s better to determine early
    on that a computation doesn’t matter and save doing the work. In graphics, this can be
    done by rendering the z-values for all objects first, before shading. For general-purpose
    computation, the z-cull unit can be used to select which parts of the computation are
    still active, culling computational threads that have already resolved. See Section 34.2.3
    of Chapter 34, “GPU Flow-Control Idioms,” for more details on this idea.
    30.5.2 Exploit Texture  Math When Loading Data
    The texture unit filters data before returning it to the fragment processor, thus reducing the
    total data needed by the shader. The texture unit’s bilinear filtering can frequently be used
    to reduce the total work done by the shader if it’s performing more sophisticated shading.
    Often, large filter kernels can be dissected into groups of bilinear footprints, which are
    scaled and accumulated to build the large kernel. A few caveats apply here, most no-
    tably that all filter coefficients must be positive for bilinear footprint assembly to work
    properly. (See Chapter 20, “Fast Third-Order Texture Filtering,” for more information
    about this technique.)
    Similarly, the filtering support given by shadow buffering can be used to offload the
    work from the processor when performing compares, then filtering the results.
    30.5.3 Use Branching in Fragment Programs Judiciously
    Because the fragment processor is a SIMD machine operating on many fragments at a
    time, if some fragments in a given group take one branch and other fragments in that
    group take another branch, the fragment processor needs to take both branches. Also,
    there is a six-cycle overhead for if-else-endif control structures. These two effects can
    reduce the performance of branching programs if not considered carefully. Branching
    can be very beneficial, as long as the work avoided outweighs the cost of branching.
    Chapter 30 The GeForce 6  Series GPU Architecture
    
    430_gems2_ch30_new.qxp  1/31/2005  6:58 PM  Page 490
    Excerpted from GPU Gems 2
    Copyright 2005 by NVIDIA Corporation  
    						
    All Nvidia manuals Comments (0)