June 2nd, 2025
0 reactions

D3D12 Cooperative Vector


Welcome to the preview release for Cooperative Vector support in D3D12.  This exposes powerful new hardware acceleration for vector and matrix operations, enabling developers to efficiently drive neural rendering techniques directly from individual shader threads in real-time graphics pipelines.

In research and in industry, machine learning based approaches have made their way to mainstream, replacing/augmenting traditional techniques. In graphics, neural network based rendering methods are gaining popularity over traditional methods of image reconstruction, texture compression, material shading etc. Simultaneously, the increasing use of GPUs for general purpose ML/DL means that GPU vendors continue to add more specialized hardware in GPUs to accelerate neural network computations, like accelerating matrix operations.

Parent blog for all other features in this release.


Motivation

Suppose we have a typical shader for lighting computation. This can be thousands of lines of computation, looping over light sources, evaluating complex materials. We want a way to replace these computations in individual shader threads with a neural network, with no other change to the rendering pipeline.  The requested inference operation needs to be understood  at a high-level by the driver so it can be mapped to dedicated hardware acceleration.


Feature Overview

Shader Model 6.9 adds a series of HLSL and DXIL features, building on each other, around vectors and matrix-vector operations available to shader threads.  The result is that high level linear algebra operations can be directly consumed by drivers compiling shaders to GPUs with knowledge about how to take advantage of underlying hardware capability.

Here is a summary of the features with links to the specs for each:

HLSL Long Vector The ability to load, store, and perform elementwise operations on HLSL vectors longer than four elements. In addition to the spec linked on the left, check out this overview: HLSL Native and Long Vectors Blog
DXIL Vectors The ability for vectors to appear in DXIL instead of being scalarized.
HLSL Vector-Matrix Operations Builds on Long Vectors above, allowing matrix-vector ops in HLSL to lower to the DXIL ops introduced in Cooperative Vector below.
Cooperative Vector DXIL operations for vector-matrix operations that can be accelerated by the underlying hardware.

The “Cooperative” in Cooperative Vector refers to an implementation detail of the hardware acceleration, where individual vector-matrix multiply requests submitted by threads in a wave are combined into a matrix-matrix operation accelerated collectively for the wave.  This name doesn’t appear in HLSL code itself, just vector types and operations like vector-matrix multiplication as shown in the examples below.

Support for matrix-matrix operations is planned for the future.


Code Example

// Byte Address Buffers used to store vectors/matrices

ByteAddressBuffer InVectors; 
RWByteAddressBuffer OutVectors;
ByteAddressBuffer InMatrices;
RWByteAddressBuffer OutMatrices;

// System header containing Cooperative Vector types, enums, and functions.
#include <dx/linalg.h>

// Such elements are all under the linalg namespace.
using namespace dx::linalg;

// Hand-wavey utility function to generate the input vector for Mul and MulAdd.
template<typename T, uint N> vector<T,N> GenerateVector(...);

[numthreads(8,1,1)]
[shader("compute")]
void main() {

  // Matrix Vector Multiply Mul() Example

  // Matrix and vector to be multiplied together
  uint MatOffset = 0;
  MatrixRef<DATA_TYPE_FLOAT32, 8, 6, MATRIX_LAYOUT_ROW_MAJOR> MulMatrix = {
    InMatrices, MatOffset, /*stride*/6 * sizeof(float)};
  MatOffset += 8 * 6 * sizeof(float);
  
  vector<float, 6> MulVector = GenerateVector<float, 6>(...);

  vector<float, 8> MulRes =  Mul<float>(MulMatrix,
                    MakeInterpretedVector<DATA_TYPE_FLOAT32>(MulVector));

  // Matrix Vector Multiply and Add Bias Vector MulAdd() Example
  MatrixRef<DATA_TYPE_FLOAT8_E4M3, 32, 4, MATRIX_LAYOUT_MUL_OPTIMAL> MulAddMatrix = {
    InMatrices, MatOffset, /*stride*/0};
  MatOffset += 32 * 4;
  
  half4 MulAddVector = GenerateVector<half, 4>(...);
  VectorRef<DATA_TYPE_FLOAT8_E4M3> BiasVector = {InVectors, VecOffset};
  VecOffset += 32;

  vector<half, 32> MulAddRes =  MulAdd<half>(MulAddMatrix,
                                             MakeInterpretedVector<DATA_TYPE_FLOAT8_E4M3>(MulAddVector),
                                             BiasVector);

  // Vector Vector Outer Product OuterProductAccumulate() Example
  vector<uint8_t4_packed, 128> LeftVector = InVectors.Load< vector<uint8_t4_packed, 128> >(VecOffset);
  VecOffset += sizeof(LeftVector);
  vector<uint8_t4_packed, 64> RightVector = InVectors.Load< vector<uint8_t4_packed, 64> >(VecOffset);
  VecOffset += sizeof(RightVector);
  
  // Storage for matrix produced by the outer product of above vectors.
  RWMatrixRef<DATA_TYPE_UINT8, 128, 64, MATRIX_LAYOUT_OUTER_PRODUCT_OPTIMAL>
  OuterMatrix = {OutMatrices, /*offset*/0, /*stride*/0};
  
  OuterProductAccumulate(LeftVector, RightVector, OuterMatrix);

  // Vector Accumulating Addition VectorAccumulate() Example
  vector<uint, 73> AccVector = InVectors.Load< vector<uint, 73> >(VecOffset);
  VecOffset += sizeof(AccVector);
  
  VectorAccumulate(AccVector, OutVectors, 0 /*offset*/);
  
}

Data preparation

There are a couple of D3D methods for converting weight and bias matrix data between formats:

enum D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT {
    D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_ROW_MAJOR,
    D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_COLUMN_MAJOR,
    D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_MUL_OPTIMAL,
    D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_OUTER_PRODUCT_OPTIMAL
}

For instance, D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_MUL_OPTIMAL is a device-specific layout for optimal use with the Cooperative Vector Matrix-Vector intrinsics such as the MulAdd in the code example above.

See ID3D12DevicePreview::GetLinearAlgebraMatrixConversionDestinationInfo() and ID3D12CommandListPreview::ConvertLinearAlgebraMatrix() in the Cooperative Vector spec here.


Get running

Long Vector and Cooperative Vector are part of Shader Model 6.9, currently in preview. This requires:

  • AgilitySDK 1.717.0-preview available here.
  • Preview Shader Model 6.9 support in DXC available here.

Device support:

NVIDIA: Accelerated on all NVIDIA GeForce RTX™ GPUs, access the driver here (requires an NVIDIA Developer Program account).
Intel: Available for Intel® Arc™ B-Series Graphics and Intel® Core™ Ultra Processors (Series 2) with the Intel® Arc™ Graphics developer preview driver found here.
AMD: Driver support for Cooperative Vector will be made available during summer 2025.
WARP: Available on latest WARP software rasterizer preview, available here.

Checking for support

To enable the cooperative vector preview with the AgilitySDK from above, in code turn on the relevant experimental features before creating a D3D12 device:

UUID Features[] = { D3D12ExperimentalShaderModels, D3D12CooperativeVectorExperiment };
ThrowIfFailed(D3D12EnableExperimentalFeatures(_countof(Features), Features, nullptr, nullptr));

Once a device is created, check cooperative vector support.

D3D12_FEATURE_DATA_D3D12_OPTIONS_EXPERIMENTAL FeatureDataTier = {};
ThrowIfFailed(pDevice->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS_EXPERIMENTAL, 
                                              &FeatureDataTier, 
                                              sizeof(FeatureDataTier)));
if(FeatureDataTier.CooperativeVectorTier >= D3D12_COOPERATIVE_VECTOR_TIER_1_0)
{
    // Have Tier 1 cooperative vector support (there's also a Tier 1.1 for training operations)
}

The tier applies specifically to the set of features in the Cooperative Vector spec: a specfic set of vector intrinsics as well as D3D12 APIs for asking drivers to perform matrix data conversions.

There are additional cooperative vector specific capability queries around what combinations of data formats are supported, described in the spec here.

Long vector is supported on any device that reports Shader Model 6.9 support.


PIX

As usual comes with Day One PIX support. Please read the PIX blog post for more information.


Content from GPU Vendors


NVIDIA

NVIDIA and Microsoft’s collaboration brings cooperative vectors to HLSL and DirectX, unlocking GeForce RTX Tensor Cores for developers and driving a new era of real-time realism and performance in PC gaming.

– NVIDIA

An earlier NVIDIA blog shows how to get started with RTX Kit neural rendering technologies.  In addition to detailed demos/samples for other APIs, there is a great overview of the Cooperative Vector feature here including performance considerations.

Their latest blog highlights how NVIDIA has updated the RTX Neural Shaders SDK with a DirectX path for Cooperative Vectors, using Slang converted to HLSL for shader authoring.


Intel

Intel is excited to make our efficient AI hardware accessible through Microsoft DirectX’s Cooperative Vectors.  We strongly believe that neural graphics is the right way to generate visuals in the future.  We are committed to providing the best performance and developer experience across our discrete and integrated Intel Arc GPUs.

– Anton Kaplanyan, VP of Graphics Research at Intel

An earlier Intel blog shows the use of Cooperative Vectors to accelerate Neural Block Texture Compression.  They exploit coherency across arbitrary numbers of texture channels to achieve compression up to five times that of traditional Block Compression.  Using Cooperative Vectors, this approach runs 10x faster on Intel Arc (B Series) GPUs.

Check out Intel’s new announcement showcasing this technique using DirectX Cooperative Vector, available on github here.  The code runs on both Intel and NVIDIA GPUs (i.e. the devices that have drivers so far).

From that repo, here’s a link directly to some HLSL code using Cooperative Vector.

trex and textures image


Category
DirectX

Author

Amar Patel
Engineer
Greg Roth
Dev Lead

2 comments

  • ⸻ “‪How Things Work‬” 2 hours ago · Edited

    Nothing seems to have said about whether or not Co-op Vectors/Neural Shader can be supported with AMD, like on their new RDNA 4 GPU, the RX 9070XT

    • ⸻ “‪How Things Work‬” 2 hours ago

      and if not, would it be a firmware/software limit? or is it not possible with the current hardware