Efficient Differentiable Hardware Rasterization for 3D Gaussian Splatting
Abstract
Recent works demonstrate the advantages of hardware rasterization for 3D Gaussian Splatting (3DGS) in forward-pass rendering through fast GPU-optimized graphics and fixed memory footprint. However, extending these benefits to backward-pass gradient computation remains challenging due to graphics pipeline constraints. We present a differentiable hardware rasterizer for 3DGS that overcomes the memory and performance limitations of tile-based software rasterization. Our solution employs programmable blending for per-pixel gradient computation combined with a hybrid gradient reduction strategy (quad-level + subgroup) in fragment shaders, achieving over 10 faster backward rasterization versus naive atomic operations and 3 speedup over the canonical tile-based rasterizer. Systematic evaluation reveals 16-bit render targets (float16 and unorm16) as the optimal accuracy-efficiency trade-off, achieving higher gradient accuracy among mixed-precision rendering formats with execution speeds second only to unorm8, while float32 texture incurs severe forward pass performance degradation due to suboptimal hardware optimizations. Our method with float16 formats demonstrates 3.07 acceleration in full pipeline execution (forward + backward passes) on RTX4080 GPUs with the MipNeRF dataset, outperforming the baseline tile-based renderer while preserving hardware rasterization’s memory efficiency advantages — incurring merely 2.67% of the memory overhead required for splat sorting operations. This work presents a unified differentiable hardware rasterization method that simultaneously optimizes runtime and memory usage for 3DGS, making it particularly suitable for resource-constrained devices with limited memory capacity.
Index Terms:
3D Gaussian Splatting, Hardware Rasterization, Differentiable RenderingI Introduction
3D Gaussian Splatting (3DGS) Kerbl et al. (2023) has established itself as an effective differentiable scene representation, demonstrating remarkable capabilities in multiview 3D reconstruction, novel view synthesis, and real-time rendering.
The differentiable renderer constitutes the core component of 3DGS-based scene reconstruction. It computes gradients of Gaussian splat parameters with respect to photometric loss functions, enabling iterative parameter optimization. Current implementations primarily rely on tile-based software rasterization, which operates through three key stages: (1) identifying tiles intersected by each Gaussian splat, (2) sorting splat-tile pairs which requires dynamic memory allocation due to the unpredictable number of intersections, and (3) executing two distinct processing phases: a forward pass for color computation via alpha blending and a backward pass for gradient propagation.
For the forward pass of the differentiable 3DGS rendering process, recent work has demonstrated substantial performance improvements over tile-based methods through hardware rasterization pipelines Bulo et al. (2025); Kwok (2023); Xu (2024). The hardware-accelerated approach executes through three stages: (1) depth-sorting all splats, (2) rendering each splat as a quad bounded by its projected 2D covariancezwi (2002), and (3) computing per-pixel transparency through fragment shaders based on the 2D Gaussian distribution, with hardware blending handling color composition. Crucially, this pipeline eliminates the dynamic memory management requirements of tile-based methods, replacing variable-size splat-tile buffers with fixed-capacity buffers scaled to the total splat count, thereby ensuring memory efficiency.
Although hardware rasterization has demonstrated significant performance and memory efficiency advantages for the forward pass in 3DGS rendering, extending these benefits to the backward pass presents a fundamental challenge: computing per-pixel splat gradients within functionality-constrained fragment shaders of the GPU graphics pipeline. Meanwhile, efficiently accumulating these gradients per-splat remains an additional hurdle. Both challenges are unresolved in current implementations.
This paper thus introduces a novel differentiable 3DGS hardware rasterizer that overcomes these challenges. Our first contribution is a mathematically rigorous backward gradient propagation method that fully conforms to hardware rasterization limitations, which is achieved through GPU programmable blending technique to maintain pixel-local data required for gradient computations.
Our second contribution is an efficient splat-wise gradient reduction strategy in fragment shaders that combines quad-level ( pixel) and subgroup operations. This approach achieves 10 faster backward rasterization compared to baseline \seqsplitatomicAdd implementations and demonstrates a 3 speedup over the original tile-based backward rasterization Kerbl et al. (2023).
To mitigate float32 render targets’ performance overhead in forward pass, we evaluate reduced-precision alternatives (float16, unorm16, unorm8) and further develop mixed-precision rendering techniques. Our systematic analysis identifies float16 and unorm16 as the optimal formats, achieving relatively minor gradient computation errors (RMSE and MRE) while delivering near-peak throughput, surpassed only by our unorm8 configuration in raw speed.
On an RTX4080 GPU with the MipNeRF datasetBarron et al. (2022), our float16-based implementation achieves 3.07 faster end-to-end performance compared to the tile-based rasterizer proposed by Kerbl et al. (2023). By adopting hardware rasterization with fixed-capacity buffers, we reduce memory consumption for splat sorting to just 2.67% of tile-based approaches, simultaneously improving both computational speed and memory efficiency.
II Background and Related Work
II-A 3D Gaussian Splatting
II-A1 Scene Parameterization
Kerbl et al. (2023) proposed 3D Gaussian Splatting (3DGS) as a differentiable scene representation that models 3D geometry using anisotropic 3D Gaussians. Each Gaussian primitive is parameterized by: (1) a center position , (2) opacity , (3) view-dependent color encoded through spherical harmonics (SHs), and (4) a covariance matrix constructed from the learnable rotation parameters and scaling parameters as . The influence of a Gaussian (i.e. density) at point follows:
| (1) |
Unlike implicit volumetric representations (e.g., NeRF Mildenhall et al. (2021)), 3DGS enables real-time rendering via explicit splat rasterization while preserving differentiability.
II-A2 Differentiable Rendering
Following the formulation in Kerbl et al. (2023), 3DGS rendering first projects 3D Gaussians to screen space. Given a Gaussian splat with 3D mean and covariance , its 2D projection is computed using the view matrix and the projection Jacobian :
yielding the 2D Gaussian density function , where denotes 2D screen-space coordinates and all parameters are projected to screen space.
Depth-sorted splats (indexed ) are alpha-composited to compute the final pixel color . Let denote the view-dependent color of splat , and (ignore for simplicity) as the transparency at pixel . The blending equations are:
| (2) |
Where represents accumulated transmittance. To enable gradient-based optimization, the renderer computes derivatives of the loss with respect to splat parameters. The key gradients and propagate to spatial parameters (, ) via chain rule differentiation. From Equation 2, we can derive:
| (3) | ||||
| (4) |
The gradient computation supports both front-to-back() and back-to-front () processing of depth-sorted splats.
Although Gaussian splats have infinite theoretical support (since for all ), practical implementations discard marginal contributions with , a threshold aligned with 8-bit color precision. This early culling applies to both color blending and gradient computation.
II-A3 Tile-based Rasterization
The differentiable tile-based rasterizer proposed by Kerbl et al. (2023) partitions the screen into tiles. For each splat, intersecting tiles are computed according to the splat boundary, and buffers are dynamically allocated to store splat-tile pairs. These pairs are sorted by tile ID and depth to generate per-tile depth-ordered splat lists. During the forward pass, pixel colors are computed in front-to-back order as Equation 2.
In the backward pass, splats are processed in reversed depth order. Initial gradient computation focuses on per-pixel and , followed by propagation to intermediate derivatives, such as , and . These gradients are then aggregated per-splat via atomic operations. High-dimensional parameter gradients (3D positions , SH coefficients, etc.) are processed in a dedicated subsequent pass to maintain performance.
II-A4 T-Culling
Kerbl et al. (2023) also implements an optimization termed T-Culling, which early terminates front-to-back splat traversal in the forward pass when the accumulated transparency product drops below , with its results stored and reused to accelerate the backward pass as well. Leveraging the monotonic decay of with depth order, this heuristic skips deeper splats whose contributions to pixel color and gradients become negligible. This optimization has been widely adopted in subsequent tile-based rasterizers as in gsplatYe et al. (2024). Though being straightforward for tile-based methods, T-Culling is incompatible with hardware 3DGS rasterization when employing graphics pipeline fixed-function blending, as GPU’s fixed-function blending intrinsically lacks support for such optimization.
II-B Improving 3DGS Rendering Performance
Recent advances in accelerating 3DGS rendering span multiple directions, including compressing and pruning 3DGS models Fan et al. (2024); Girish et al. (2024); Fang and Wang (2024); Lee et al. (2024b); Niemeyer et al. (2024) and designing specialized hardware architectures for 3DGS rendering Feng et al. (2024b); Lin et al. (2025).
Parallel efforts focus on optimizing tile-based rasterizers. Works like GSCore Lee et al. (2024a) and FlashGS Feng et al. (2024a) enhance splat-tile intersection computation through precise geometric algorithms, significantly reducing noneffective splat-tile pairs compared to the naive AABB-tile approach. This optimization improves both forward/backward pass performance while reducing memory consumption. GSCore further adopts coarse-grained sorting and subtile skipping to optimize mobile deployment. Balanced3DGSGui et al. (2024) on the other hand resolves GPU workload imbalance in tile-based methods caused by non-uniform splat distributions through dynamic tile-to-thread-group task allocation and intra-group splat-wise parallelism. DISTWAR Durvasula et al. (2023, 2025) targets backward pass efficiency by minimizing atomic operation stalls. Its warp-level gradient pre-accumulation strategy aggregates per-thread gradients within a warp before committing to global memory via \seqsplitatomicAdd, achieving up to speedups in gradient computation.
Despite their performance gains, these tile-based methods still suffer from inherit scalability issues: the need for dynamic memory management and a worst-case memory requirement ( splats and tiles) that becomes prohibitive at scale.
In constrast, 3DGS hardware rasterizers operates under fixed memory budgets. These approaches project splats as 2D quads through eigendecomposition of , leveraging hardware interpolation and fixed-function blending to efficiently compute and -composite colors respectively. Notable implementations include the WebGL-based renderer Kwok (2023), OpenGL-powered frameworks Xu (2024) (reporting speedups over unoptimized tile-based rasterizers despite hardware blending limitations precluding T-Culling), and the Unity engine integration Aras-P (2023). These implementations lack frustum culling, a standard optimization in production-grade renderers, leaving measurable performance gains untapped. More critically, none implement backward passes, preventing their application to differentiable rendering tasks.
The core challenge for differentiable hardware rasterization lies in fragment shaders’ inability for depth-ordered splat traversal needed by per-pixel gradient computation. Existing attempts to address this challenge rely on approximations that compromise mathematical fidelity relative to exact tile-based gradient computation. The depth peeling solution proposed by Xu et al. (2024) renders sequential depth layers to approximate full splat traversal, at the cost of omitting gradients from occluded splats beyond the layer limit ( in practice). Kheradmand et al. (2025) circumvents ordered traversal through Monte Carlo gradient estimation based on stochastic transparency Enderton et al. (2010), achieving hardware-compatible implementation at the cost of noisy gradients.
III Method
III-A Gradient Computation via Programmable Blending
In this section, we address the fundamental barrier to differentiable rasterization: fragment shaders’ inability to access overlapping splat sequences. Our approach exploits programmable blending to enable correct per-pixel gradient computation.
III-A1 Programmable Blending
Programmable Blending is a hardware rasterization technique that overcomes the predefined constraints of traditional fixed-function blending, enabling custom color mixing logic to be implemented in shader code. This capability, supported through graphics API extensions on modern GPUs, provides two essential features: (1) race-condition-free operations on per-pixel resources in fragment shaders, and (2) execution order guarantees which match the rasterization sequence for resource access.
Tile-based architecture GPUs prevalent in mobile platforms natively support programmable blendingArm Holdings plc (2025); Apple Inc. (2024). These GPUs guarantee that fragment shaders within a tile execute in strict rasterization order, requiring only framebuffer synchronization to make programmable blending available. For example, within the Vulkan graphics API Bailey (2019), this capability is formally specified through the \seqsplitVK_EXT_shader_tile_image and \seqsplitVK_EXT_rasterization_order_attachment_access extensions, which permit safe read-modify-write operations on the same attachment in fragment shaders while eliminating explicit memory barrier requirements.
Desktop GPUs programs can implement programmable blending via the \seqsplitVK_EXT_fragment_shader_interlock Vulkan extension, which defines a critical section through \seqsplitbeginInvocationInterlock() and \seqsplitendInvocationInterlock() calls, enabling safe access to pixel-local data structures in fragment shaders. Execution order can be further configured within these critical sections as rasterization-ordered. However, unlike tile-based mobile GPUs that guarantee rasterization-order execution, most desktop GPUs employ out-of-order fragment shader scheduling for improved hardware utilization. This architectural approach requires fragment shader interlocks to enforce synchronization via pipeline stalls or execution order constraints, incurring measurable performance overhead.
III-A2 Front-to-Back Gradient Computation
Programmable blending enables persistent per-pixel state management, making per-pixel gradient computation within fragment shaders feasible. While both front-to-back () and back-to-front () traversal orders are valid for gradient computation, we empirically find that front-to-back traversal better aligns with T-Culling optimizations. This stems from hardware blending’s limitations in forward passes (not able to apply T-Culling), thereby constraining the optimization exclusively to backward passes. A back-to-front traversal in backward pass would require dividing out culled terms for values needed by T-Culling, leading to extra memory accesses. Front-to-back traversal avoids this by directly computing through incremental updates.
The front-to-back gradient computation proceeds as follows. Firstly, both Equation 3 and Equation 4 depend on the following recursively updated transmittance:
| (5) |
Additionally, the summation term in Equation 4 requires recursive computation. We define an auxiliary sequence :
| (6) |
Substituting Equation 6 into Equation 4 yields the reformed gradient expression:
| (7) |
Using these relations, we compute in Equation 3 and in Equation 7 through front-to-back recurrence of and , serving as the foundation for deriving any other gradients.
III-A3 Algorithm
Our backward pass draws splats in front-to-back order, and Procedure 1 illustrates our gradient computation approach within the fragment shader, enabled by programmable blending through \seqsplitVK_EXT_fragment_shader_interlock (targeting desktop GPUs). During the rasterization-ordered critical section enforced by the interlocks, we maintain and retrieve values of and by reading/writing a dedicated texture . The shader then loads from a read-only texture , then computes and using analytical derivatives derived from Equation 3 and 7. Similar to Kerbl et al. (2023), these gradients are subsequently propagated to intermediate splat parameters (e.g., 2D positions, colors) via the chain rule, denoted as .
Additionally, we maintain a boolean \seqsplitcullingFlag to track fragment rejection status, which combines two culling criteria: splat boundary culling () and T-Culling (). Fragments marked by \seqsplitcullingFlag skip all gradient computations, reducing both texture accesses and arithmetic operations.
III-B Hybrid Gradient Reduction
With per-pixel splat parameter gradients computed, a complete backpropagation can then be achieved through splat-wise accumulation. As direct use of \seqsplitatomicAdd operations incurs substantial performance overhead due to global memory contention, we introduce a hybrid gradient reduction strategy leveraging fragment shader parallelism through quad-local and subgroup-level reductions to address this limitation.
III-B1 Fragment Shader Subgroups and Quads
Prior to detailing our method, we briefly clarify the architectural foundations of subgroups and quads in modern GPUs. Modern GPU architectures employ two key execution models for fragment shaders:
Subgroups (warps/wavefronts): Minimal schedulable thread blocks (32/64 threads on NVIDIA/AMD GPUs respectively) supporting intra-group data exchange via operations like \seqsplitsubgroupShuffle.
Quads: fragment blocks enabling derivative calculations, with helper invocations (detectable via \seqsplitgl_HelperInvocation) handling primitive boundary conditions.
Fragment shaders rely exclusively on these units for thread communication due to lack of shared memory, constraining gradient aggregation to intra-quad and intra-subgroup operations.
III-B2 Quad Reduction
To mitigate performance bottlenecks from frequent \seqsplitatomicAdd operations, we propose aggregating gradients computed across multiple fragment shader threads using quad and subgroup intrinsics, followed by sparse \seqsplitatomicAdd calls from selected threads to update global memory buffers.
A straightforward approach involves quad reduction: Since hardware rasterization processes splat primitives at quad granularity, all threads within a quad inherently belong to the same splat. By leveraging quad intrinsics (e.g., Vulkan’s \seqsplitsubgroupQuadSwapVertical) to accumulate gradients within the quad, we perform partial sums and delegate a single thread per quad to execute \seqsplitatomicAdd. This strategy reduces \seqsplitatomicAdd calls to below of the baseline approach and achieves 3.05 speedup (see Table I).
III-B3 Splat-Subgroup Cohesion and Subgroup Reduction
To further reduce \seqsplitatomicAdd invocations, we explore subgroup gradient reduction. While there exists no guarantee that fragments within a subgroup originate from the same splat, we propose the Splat-Subgroup Cohesion Hypothesis: GPUs exhibit a statistical tendency to schedule fragments from the same splat (rendered as two adjacent triangle primitives) within a single subgroup, leveraging spatial locality for memory access efficiency during rasterization. Although not universally guaranteed, this hypothesis facilitates effective subgroup-level gradient reduction in most practical scenarios.
We validate this hypothesis through preliminary experiments on an RTX 4080 GPU. Qualitative analysis of splat-subgroup coherency (Figure 1) shows that splat-incoherent fragments (processed in subgroups containing fragments from multiple splats) mainly occur with small splats on screen, while larger splats predominantly produce splat-coherent fragments. This indicates high splat-coherency under typical rendering conditions. Quantitative measurements (Figure 2) further confirm these findings.


These results confirm that our Splat-Subgroup Cohesion Hypothesis holds robustly on RTX 4080 GPUs. While architecture-specific, the observed pixel locality patterns suggest potential applicability to other modern GPUs, requiring further cross-platform validation.
Building on this hypothesis, our subgroup reduction pipeline first assesses splat cohesion using Vulkan’s \seqsplitsubgroupAllEqual intrinsic to verify uniform splat IDs across active threads. For cohesive subgroups, gradients are aggregated via \seqsplitsubgroupAdd, with a single \seqsplitatomicAdd operation performed by a selected active lane.
Standalone subgroup reduction reduces \seqsplitatomicAdd invocations to 6% of baseline levels, and combining it with quad reduction further decreases the ratio to approximately 5%.
III-B4 Balancing Threshold in Subgroup Reduction
While minimizing \seqsplitatomicAdd invocations improves performance, excessive aggregation via subgroup reduction risks overloading streaming multiprocessors (SMs) with computation, creating an imbalance between SM workloads and L2 atomic unit (ROP) utilization that incurs performance penaltiesDurvasula et al. (2023, 2025). Inspired by Durvasula et al. (2023, 2025), we introduce a balancing threshold and restrict subgroup reduction to cases where the number of active threads requiring gradient aggregation exceeds the threshold. The optimal balancing threshold is hardware-and-scene-dependent and need to be selected through performance profiling.
III-B5 Algorithm
Our hybrid gradient reduction method, combining quad-based and subgroup-based reduction strategies, is detailed in Procedure 2. This approach employs Vulkan GLSL intrinsics for subgroup and quad operations. The algorithm initiates by filtering out noneffective threads through a \seqsplitreduceFlag, which combines \seqsplitcullingFlag (from Procedure 1) and hardware-generated \seqsplitgl_HelperInvocation. The \seqsplitreduceFlag excludes helper invocations (which cannot perform memory writes or framebuffer updates) and fragments previously discarded by splat boundaries or T-Culling. A bitmask \seqsplitreduceMask, constructed via \seqsplitsubgroupBallot, identifies active threads eligible for gradient aggregation.
Then, the method dynamically selects between subgroup and quad operations based on runtime conditions. The algorithm first evaluates subgroup reduction feasibility by checking splat ID coherence across active threads and verifying that the valid fragment count, measured via \seqsplitsubgroupBallotBitCount, exceeds the balancing threshold . When these criteria are met, gradients are aggregated within the subgroup. If unsatisfied, the method falls back to quad reduction, aggregating gradients across pixel blocks, and then restricting atomic updates to quad-local threads by zeroing out \seqsplitreduceMask bits for threads outside the current quad. The quad reduction fallback acts as a safety net for cases where subgroup coherence cannot be guaranteed, preserving performance across diverse scene complexities.
Finally, the least-significant thread in the current quad or subgroup reduction scope (identified via \seqsplitsubgroupBallotFindLSB) performs a single \seqsplitatomicAdd, accumulating the reduced gradients to global memory.
This dual-strategy approach is derived through rigorous profiling across varying 3DGS workloads, with empirical validation of its efficacy provided in Table I.
III-C Mixed Precision Rendering
Besides backward pass optimizations, forward pass performance also determines our rasterizer’s overall efficiency. Our experiments identify render target texture formats as the primary performance determinant in 3DGS forward pass execution. Notably, fixed-function blending with float32 proves particularly inefficient, even underperforming fragment interlock approaches. In contrast, reduced-precision formats (float16, unorm16, unorm8) demonstrate the superior performance of GPU rasterization and blending, with unorm8 delivering the fastest rendering speeds, while float16 and unorm16 achieve comparable runtime efficiency (see Table II).
To balance performance and numerical consistency, we adopt a mixed precision rendering strategy that unifies precision across both forward and backward passes: the render target (storing colors and ’s) and the per-pixel state texture (in Procedure 1) adopt reduced-precision formats, while intermediate calculations preserve full float32 precision. This ensures numerically accurate gradient computation while eliminating precision-mismatch artifacts during backpropagation. In addition, when using unorm formats, the color value must be saturated within to prevent overflow.
Systematic evaluation of gradient accuracy (Table IV) demonstrates that 16-bit formats (float16 and unorm16) introduce relatively minor errors, while unorm8 exhibits significantly larger deviations. Thus, we suggest prioritizing 16-bit formats as they deliver substantial speedup over full-precision formats while maintaining accuracy. Although unorm8 offers the fastest rendering speed, the substantially higher error levels may adversely affect 3DGS training stability.
IV Experiments
We evaluated our method on the MipNeRF360 dataset Barron et al. (2022) using pretrained 3D Gaussian Splatting models (30k iterations) from the original implementation Kerbl et al. (2023). The rendering of 3DGS models defaults to the dataset’s native image resolutions. Numerical precision (across float32 and lower-precision formats) and backward pass performance are measured using stochastic image gradients uniformly distributed in .
Our comparative benchmarks include:
3DGS baseline: The original tile-based 3DGS CUDA rasterizer Kerbl et al. (2023).
DISTWAR: Tile-based variant with optimized backward pass for gradient accumulations, while preserving forward pass performance and memory footprint Durvasula et al. (2023, 2025).
3DGS float16: Modified baseline 3DGS rasterizer using float16 image formats and pixel operations, simulating our mixed-precision approach’s storage/computational loads.
All experiments run on an NVIDIA RTX 4080 Laptop GPU, with Vulkan and CUDA performance metrics rigorously measured using low-level timers (\seqsplitvkQueryPool for Vulkan, \seqsplitcudaEvent for CUDA) to ensure accuracy.
IV-A Implementation Details
Our code implementation utilizes C++, GLSL, and Vulkan. We develop a custom version of the Onesweep sorting algorithm Adinets and Merrill (2022); Merrill and Garland (2016) in the forward pass, enabling dynamic-length splat sorting via non-blocking indirect dispatch to support GPU-driven frustum culling, maximizing throughput. An optimization for backward pass occurs after 12 in Procedure 1, where we implement early termination (via return statement) when all fragments in the current quad are rejected (determined by \seqsplitsubgroupQuadAll(cullingFlag)), which reduces SM pressure while preserving compatibility with hybrid reduction’s quad-level fallback.
IV-B Splat-Subgroup Cohesion Rate
Before evaluating performance metrics, we first present quantitative validation of the Splat-Subgroup Cohesion Hypothesis discussed in section III-B3. We measure cohesion rates (incoherent fragments are depicted in red) across varying scenes (from MipNeRF360 datasetBarron et al. (2022)), camera viewpoints, and resolutions, with results visualized in Figure 2.
We observe that scenes with higher geometric complexity (e.g., bicycle, garden, stump) exhibit lower cohesion rates than simpler scenes (room, counter, kitchen) at equivalent resolutions, while cohesion rates scale monotonically with rendering resolution, rising from about at to approximately at , ultimately approaching at .
As geometric complexity decreases and rendering resolution increases, individual splats cover more pixels in screen space, leading to the above correlation. The expanded splat coverage increases the likelihood of processing multiple pixels of a splat primitive within the same fragment shader subgroup, thereby statistically elevating cohesion rates. These measurable dependencies can thus validate our Splat-Subgroup Cohesion Hypothesis.
Notably, splat-incoherent fragments exhibit grid-aligned distributions, suggesting GPU-specific rasterization thread scheduling patterns.
IV-C Gradient Reduction Performance
We begin performance evaluation with gradient reduction methods, testing (1) naive \seqsplitatomicAdd, (2) quad reduction, (3) subgroup reduction, and (4) hybrid reduction on the MipNeRF360 dataset. Our analysis focuses on backward pass rasterization performance to validate hybrid reduction’s efficacy (see Table I). For subgroup and hybrid methods, we assessed the balancing threshold’s () impact while monitoring two factors: the proportion of fragments using \seqsplitatomicAdd (atomic add rate) and runtime performance metrics.
| Method | Rasterization | ||
|---|---|---|---|
| Atomic-Add | Perf. | Speedup | |
| Naive | 1.000 | 462.34 ms | 1 |
| Quad | 0.287 | 151.46 ms | 3.05 |
| Subgroup (no ) | 0.060 | 45.70 ms | 10.12 |
| Subgroup () | 0.064 | 45.34 ms | 10.20 |
| Hybrid (no ) | 0.051 | 44.97 ms | 10.28 |
| Hybrid () | 0.055 | 44.27 ms | 10.44 |
Results demonstrate that hybrid reduction achieves speedup in backward pass rasterization compared to naive \seqsplitatomicAdd, while outperforming standalone quad and subgroup reductions. Employing an optimal balancing threshold ( for subgroup reduction and for hybrid reduction) further enhances performance despite slightly increasing the atomic-add rate.
IV-D Overall Performance
We conducted performance measurements on the MipNeRF360 dataset across various render target formats, comparing our method with tile-based methods (see Table II).
| Method | Forward | Backward | End-to-End | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Rasterization | Overall | Rasterization | Overall | |||||||
| Perf. | Speedup | Perf. | Speedup | Perf. | Speedup | Perf. | Speedup | Perf. | Speedup | |
| 3DGS baseline | 16.74 ms | 1 | 36.06 ms | 1 | 141.3 ms | 1 | 143.7 ms | 1 | 179.7 ms | 1 |
| DISTWAR | 35.83 ms | 3.94 | 38.34 ms | 3.75 | 74.40 ms | 2.42 | ||||
| 3DGS float16 | 17.08 ms | 0.98 | 36.19 ms | 1.00 | 140.3 ms | 1.01 | 142.7 ms | 1.01 | 178.9 ms | 1.00 |
| Ours (float32) | 62.03 ms | 0.27 | 63.86 ms | 0.56 | 48.67 ms | 2.90 | 51.04 ms | 2.82 | 114.9 ms | 1.56 |
| Ours (float16) | 10.07 ms | 1.66 | 11.91 ms | 3.03 | 44.27 ms | 3.19 | 46.65 ms | 3.08 | 58.56 ms | 3.07 |
| Ours (unorm16) | 10.06 ms | 1.66 | 11.89 ms | 3.03 | 45.05 ms | 3.14 | 47.44 ms | 3.03 | 59.32 ms | 3.03 |
| Ours (unorm8) | 6.29 ms | 2.66 | 8.13 ms | 4.44 | 42.77 ms | 3.30 | 45.12 ms | 3.18 | 53.25 ms | 3.37 |
Our method demonstrates superior backward pass and end-to-end performance over baseline 3DGS, achieving over 2.9 speedup in backward rasterization. This validates the effectiveness of hybrid reduction, though limitations of fragment shaders constraint backward pass performance compared to DISTWAR.
When using low-precision render targets (float16, unorm16, unorm8), our approach attains 3 speedup in forward pass compared to tile-based methods, thereby achieving improved performance compared with DISTWAR despite backward pass performance disadvantages.
In contrast, tile-based methods show negligible benefits from the mixed precision method (speedup for float16), as they operate on image pixels using ALU and registers, with no repeated global image read or write — float16 textures thus only marginally reduce memory bandwidth, and half-precision ALU operations for pixel calculations provide limited advantages.
Surprisingly, float32 render targets exhibit severely degraded forward rasterization performance (0.27 of tile-based methods) with backward passes outperforming forward ones. This reveals GPU optimization deficiencies for rasterization and blending on full-precision formats, directly motivating our mixed-precision design.
IV-E Memory-Efficiency
In addition to runtime performance, we perform comprehensive investigation into the memory behavior. Table III compares the memory footprint between our method and tile-based approaches (3DGS baseline) on the MipNeRF360 dataset, where the metrics represent the minimum required GPU memory allocation for processing the entire dataset. By leveraging hardware rasterization, our method inherently bypasses splat-tile pair sorting, achieving a 37 reduction in sorting-stage memory consumption, thus enabling over 4 total memory savings compared to tile-based methods without specialized optimizations. Adopting low-precision texture formats provides additional memory savings through compressed storage.
| Method | Sorting Mem. | Overall Mem. | ||
|---|---|---|---|---|
| Allocated | Reduction | Allocated | Reduction | |
| 3DGS baseline | 3.93 GB | 1 | 5.09 GB | 1 |
| Ours (float32) | 105 MB | 37.4 | 1.23 GB | 4.14 |
| Ours (unorm8) | 1.02 GB | 4.99 | ||
| Format | RMSE | Mean Relative Error | ||
|---|---|---|---|---|
| float16 | 0.197 | 0.022 | 0.104 | 1.594 |
| unorm16 | 0.569 | 0.031 | 0.029 | 0.100 |
| unorm8 | 2.358 | 0.273 | 1.318 | 23.83 |
IV-F Mixed Precision Errors
Lastly we analyze the gradient computation errors of three low-precision texture formats (float16, unorm16, unorm8) relative to the float32 baseline under mixed-precision rendering.
Results (Table IV) indicate that 16-bit float/unorm formats exhibit superior precision retention, demonstrating lower root-mean-square errors (RMSE) and mean relative errors (MRE). Both formats achieve MRE below 0.04 for magnitudes , with RMSE values below 0.6. Notably, unorm16 exhibits higher RMSE (0.569) than float16 (0.197), while its MRE remains lower for magnitudes . This discrepancy suggests that while float16 maintains accuracy, it exhibits sensitivity to small-magnitude gradients, whereas unorm16 demonstrates poorer global precision but better consistency across scales.
In contrast, unorm8 shows significantly degraded accuracy, with RMSE exceeding float16 by over an order of magnitude and MRE peaking at 23.83 within . These findings confirm that float16 and unorm16 preserve gradient accuracy comparable to full-precision (float32) computations, whereas unorm8 introduces substantial errors that could negatively impact the optimization and training stability of 3DGS.
V Conclusion and Future Work
Our differentiable 3DGS hardware rasterization method enables per-pixel gradient computation through programmable blending, accelerated by hybrid synchronization primitives for gradient accumulation. To address GPU inefficiencies with float32 render targets, we implement 16-bit formats that preserves numerical fidelity while delivering significant performance gains—achieving over 3.03 end-to-end speedup compared to the original tile-based 3DGS implementation. The hardware rasterization paradigm inherently reduces memory consumption by over 4.1 compared to tile-based approaches, enabling resource-constrained 3DGS training.
Future works include extending this architecture to Gaussian Splatting variants like 2DGSHuang et al. (2024), and exploring mobile deployment for on-device training. While mobile GPUs natively support programmable blending, architectural differences in hybrid reduction implementations require thorough validation, which is a key technical challenge for our cross-platform adaptation.
References
- (1)
- zwi (2002) 2002. EWA splatting. IEEE Transactions on Visualization and Computer Graphics 8, 3 (2002), 223–238.
- Adinets and Merrill (2022) Andy Adinets and Duane Merrill. 2022. Onesweep: A faster least significant digit radix sort for gpus. arXiv preprint arXiv:2206.01784 (2022).
- Apple Inc. (2024) Apple Inc. 2024. Metal feature set tables. Apple Inc. https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf Accessed: 2025-05-17.
- Aras-P (2023) Aras-P. 2023. Aras-P/Unitygaussiansplatting: Toy Gaussian splatting visualization in Unity. https://github.com/aras-p/UnityGaussianSplatting
- Arm Holdings plc (2025) Arm Holdings plc 2025. Arm GPU Best Practices Developer Guide - Multipass rendering. Arm Holdings plc. https://developer.arm.com/documentation/101897/0304/Fragment-shading/Multipass-rendering Accessed: 2025-05-17.
- Bailey (2019) Mike Bailey. 2019. Introduction to the Vulkan® computer graphics API. In SIGGRAPH Asia 2019 Courses. 1–155.
- Barron et al. (2022) Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. 2022. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5470–5479.
- Bulo et al. (2025) Samuel Rota Bulo, Nemanja Bartolovic, Lorenzo Porzi, and Peter Kontschieder. 2025. Hardware-Rasterized Ray-Based Gaussian Splatting. arXiv preprint arXiv:2503.18682 (2025).
- Durvasula et al. (2025) Sankeerth Durvasula, Adrian Zhao, Fan Chen, Ruofan Liang, Pawan Kumar Sanjaya, Yushi Guan, Christina Giannoula, and Nandita Vijaykumar. 2025. ARC: Warp-level Adaptive Atomic Reduction in GPUs to Accelerate Differentiable Rendering. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 64–83.
- Durvasula et al. (2023) Sankeerth Durvasula, Adrian Zhao, Fan Chen, Ruofan Liang, Pawan Kumar Sanjaya, and Nandita Vijaykumar. 2023. Distwar: Fast differentiable rendering on raster-based rendering pipelines. arXiv preprint arXiv:2401.05345 (2023).
- Enderton et al. (2010) Eric Enderton, Erik Sintorn, Peter Shirley, and David Luebke. 2010. Stochastic transparency. In Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games. 157–164.
- Fan et al. (2024) Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang, et al. 2024. Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps. Advances in neural information processing systems 37 (2024), 140138–140158.
- Fang and Wang (2024) Guangchi Fang and Bing Wang. 2024. Mini-splatting: Representing scenes with a constrained number of gaussians. In European Conference on Computer Vision. Springer, 165–181.
- Feng et al. (2024a) Guofeng Feng, Siyan Chen, Rong Fu, Zimu Liao, Yi Wang, Tao Liu, Zhilin Pei, Hengjie Li, Xingcheng Zhang, and Bo Dai. 2024a. Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering. arXiv preprint arXiv:2408.07967 (2024).
- Feng et al. (2024b) Yu Feng, Weikai Lin, Zihan Liu, Jingwen Leng, Minyi Guo, Han Zhao, Xiaofeng Hou, Jieru Zhao, and Yuhao Zhu. 2024b. Potamoi: Accelerating neural rendering via a unified streaming architecture. ACM Transactions on Architecture and Code Optimization 21, 4 (2024), 1–25.
- Girish et al. (2024) Sharath Girish, Kamal Gupta, and Abhinav Shrivastava. 2024. Eagles: Efficient accelerated 3d gaussians with lightweight encodings. In European Conference on Computer Vision. Springer, 54–71.
- Gui et al. (2024) Hao Gui, Lin Hu, Rui Chen, Mingxiao Huang, Yuxin Yin, Jin Yang, Yong Wu, Chen Liu, Zhongxu Sun, Xueyang Zhang, et al. 2024. Balanced 3DGS: Gaussian-wise parallelism rendering with fine-grained tiling. arXiv preprint arXiv:2412.17378 (2024).
- Huang et al. (2024) Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2024. 2d gaussian splatting for geometrically accurate radiance fields. In ACM SIGGRAPH 2024 conference papers. 1–11.
- Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42, 4 (2023), 139–1.
- Kheradmand et al. (2025) Shakiba Kheradmand, Delio Vicini, George Kopanas, Dmitry Lagun, Kwang Moo Yi, Mark Matthews, and Andrea Tagliasacchi. 2025. StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting. arXiv preprint arXiv:2503.24366 (2025).
- Knapitsch et al. (2017) Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction. ACM Transactions on Graphics 36, 4 (2017).
- Kwok (2023) Kevin Kwok. 2023. Splat: A WebGL Implementation of A Real-time Renderer for 3D Gaussian Splatting for Real-Time Radiance Field Rendering. https://github.com/antimatter15/splat.
- Lee et al. (2024a) Junseo Lee, Seokwon Lee, Jungi Lee, Junyong Park, and Jaewoong Sim. 2024a. Gscore: Efficient radiance field rendering via architectural support for 3d gaussian splatting. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 497–511.
- Lee et al. (2024b) Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. 2024b. Compact 3d gaussian representation for radiance field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21719–21728.
- Lin et al. (2025) Weikai Lin, Yu Feng, and Yuhao Zhu. 2025. MetaSapiens: Real-Time Neural Rendering with Efficiency-Aware Pruning and Accelerated Foveated Rendering. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 669–682.
- Merrill and Garland (2016) Duane Merrill and Michael Garland. 2016. Single-pass parallel prefix scan with decoupled look-back. NVIDIA, Tech. Rep. NVR-2016-002 (2016).
- Mildenhall et al. (2021) Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
- Niemeyer et al. (2024) Michael Niemeyer, Fabian Manhardt, Marie-Julie Rakotosaona, Michael Oechsle, Daniel Duckworth, Rama Gosula, Keisuke Tateno, John Bates, Dominik Kaeser, and Federico Tombari. 2024. Radsplat: Radiance field-informed gaussian splatting for robust real-time rendering with 900+ fps. arXiv preprint arXiv:2403.13806 (2024).
- Xu (2024) Zhen Xu. 2024. Fast Gaussian Rasterization. GitHub. https://github.com/dendenxu/fast-gaussian-rasterization
- Xu et al. (2024) Zhen Xu, Sida Peng, Haotong Lin, Guangzhao He, Jiaming Sun, Yujun Shen, Hujun Bao, and Xiaowei Zhou. 2024. 4k4d: Real-time 4d view synthesis at 4k resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20029–20040.
- Ye et al. (2024) Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, and Angjoo Kanazawa. 2024. gsplat: An Open-Source Library for Gaussian Splatting. arXiv preprint arXiv:2409.06765 (2024). arXiv:2409.06765 [cs.CV] https://arxiv.org/abs/2409.06765