Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training

Lu Y., Armour W.

Modern GPUs are equipped with large amounts of high-bandwidth memory, enabling them to support mini-batch sizes of up to tens of thousands of training samples. However, most existing optimizers struggle to perform effectively at such a large batch size. As batch size increases, gradient noise decreases due to averaging over many samples, limiting the ability of first-order methods to escape sharp or suboptimal minima and reach the global minimum. Meanwhile, second-order methods like the natural gradient with Kronecker-Factored Approximate Curvature (KFAC) often require excessively high damping to remain stable at large batch sizes. This high damping effectively ``washes out" the curvature information that gives these methods their advantage, reducing their performance to that of simple gradient descent. In this paper, we introduce Fisher-Orthogonal Projection (FOP), a novel technique that restores the effectiveness of the second-order method at very large batch sizes, enabling scalable training with improved generalization and faster convergence. FOP constructs a variance-aware update direction by leveraging gradients from two sub-batches, enhancing the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher-metric. Through extensive benchmarks, we show that FOP accelerates convergence by ×1.2–1.3 over K-FAC and ×1.5–1.7 over SGD/AdamW at the same moderate batch sizes, while at extreme scales it achieves up to a ×7.5 speedup. Unlike other methods, FOP maintains small-batch accuracy when scaling to extremely large batch sizes. Moreover, it reduces Top-1 error by 2.3–3.3% on long-tailed CIFAR benchmarks, demonstrating robust generalization under severe class imbalance. Our lightweight, geometry-aware use of intra-batch variance makes natural-gradient optimization practical on modern data-centre GPUs. FOP is open-source and pip-installable, which can be integrated into existing training code with a single line and no extra configuration.

More information Original publication

DOI

10.1609/aaai.v40i29.39590

Type

Conference paper

Publisher

Association for the Advancement of Artificial Intelligence

Publication Date

2026-03-14T00:00:00+00:00

Volume

Pages

24115 - 24123

Total pages

Keywords

maxima and minima, gradient descent, generalization, projection, convergence, noise, stochastic gradient descent, scaling, curvature, video, economics, relational algebra

Cookies on this website