This project is a research framework focused on second-order optimization dynamics and information geometry.
The core design of ARS2-Neo is based on a deep reconstruction of modern optimization algorithms, aimed at overcoming the limitations of first-order optimizers in ill-conditioned curvature landscapes.
Through Muon's Newton-Schulz iteration, ARS2-Neo enforces orthogonality on the update matrices (Stiefel manifold constraint). Mathematically, orthogonalized updates are equivalent to performing de-correlation in the parameter space, eliminating internal covariate shift and purifying the gradient information.
For any orthogonal matrix R and diagonal matrix D, the lifted product RDR^T is a full-rank matrix whose spectrum lives in the rotated coordinates—this identity holds regardless of where the curvature comes from. Adam is often interpreted as a diagonal preconditioner built from gradient second moments, and ARS2-Neo composes that diagonal scaling with a matrix-level orthogonalization step reminiscent of a polar-factor mixing. If the mixing basis R drifts slowly and stays correlated with a curvature eigenbasis while the diagonal D tracks the corresponding spectrum, then RDR^T can be viewed as a structured natural-gradient preconditioner in the original coordinates, which mirrors the intuition behind Amari (1998) and practical approximations such as K-FAC and Shampoo.
This is best phrased as an empirically testable hypothesis rather than a mathematical identity: when ARS2-Neo maintains the required alignment, the composite operator can resemble Natural Gradient Descent (NGD) and our Wikitext-2 run (training loss ≈ 0.9 by 20 epochs) is consistent with strong preconditioned descent. If, instead, the orthogonalization merely reshapes singular values without grounding in curvature statistics, the lifted RDR^T loses its connection to the true Fisher/Hessian and the NGD analogy weakens.
While NGD provides rapid convergence, it is prone to falling into "sharp minima" (overfitting). ARS2-Neo introduces Manifold-Aware SAM (Sharpness-Aware Minimization):
- Flatness Constraint: By searching for adversarial directions on the Riemannian manifold, the algorithm is guided towards broader basins in the loss landscape.
- MDL Correspondence: According to the Minimum Description Length (MDL) principle, flatter regions correspond to simpler model explanations, thereby possessing stronger generalization capabilities.
ARS2-Neo decomposes the optimization process into two independent operators:
- Statistical Operator (Energy): Uses the second-moment corrected momentum norm from AdamW to determine the update step size, serving as a proxy for the rate of free-energy descent.
- Structural Operator (Geometry): Ensures the update direction strictly follows the manifold's Geodesic through pre-whitening and orthogonal projection.
Experimental Setup: Qwen3 (RoPE, 3-layer), Context 255. Aimed at probing optimization stability on ill-conditioned curvature manifolds.
| Optimizer | Best PPL | Final PPL | Best Eval Loss | Final Eval Loss | Final Train Loss | Avg Time | PPL Gap |
|---|---|---|---|---|---|---|---|
| AdamW | 116.46 | 213.52 | 4.76 | 5.36 | 2.9740 | 314s | +97.06 |
| Muon | 111.35 | 475.65 | 4.71 | 6.16 | 2.2938 | 445s | +364.30 |
| ARS2-Neo (Base) | 96.10 | 3055.47 | 4.57 | 8.02 | 0.9123 | 425s | +2959.37 |
| ARS2-Neo (Sync) | 90.69 | 330.85 | 4.51 | 5.80 | 1.6100 | 784s | +240.16 |
| ARS2-Neo (AGA) | 93.23 | 414.83 | 4.54 | 6.03 | 1.5906 | 546s | +321.60 |
Experimental Setup: ResNet-18, Batch Size 256.
| Optimizer | Best Acc | Final Acc | Final Train Loss | Best Eval Loss | Final Eval Loss | Avg Epoch Time | Gen Gap |
|---|---|---|---|---|---|---|---|
| ARS2-Neo (Sync, ρ=0.1) | 95.87% | 95.73% | 0.0347 | 0.1500 | 0.1500 | 104s | +0.14 |
| ARS2-Neo (Base) | 95.58% | 95.52% | 0.0181 | 0.2400 | 0.2500 | 71s | +0.06 |
| ARS2-Neo (AGA, λ=2.0) | 94.10% | 94.09% | 0.1251 | 0.1800 | 0.1800 | 90s | +0.01 |
| AdamW | 94.60% | 94.47% | 0.0451 | 0.2500 | 0.2700 | 58s | +0.13 |
| Muon | 93.76% | 93.69% | 0.0267 | 0.2900 | 0.2900 | 75s* | +0.07 |
Muon CIFAR-10 avg epoch time includes a single outlier at 35331s (~9.8hrs); typical epochs are ~75s.
To verify the dynamic characteristics of the optimizer during generalization phase transitions, we compared the performance of various optimizers on a modular addition task (p=113, train_frac=0.3).
| Optimizer | Fitting (Ep) | Grokking (Ep) | Converge (Ep) | Best Eval Acc |
|---|---|---|---|---|
| AdamW | 113 | >600 | N/A | 15.65% |
| Muon | 22 | >347 | N/A | 36.83% |
| ARS2-Neo (Base) | 11 | 239 | 290 | 99.53% |
| ARS2-Neo (AGA) | 12 | 77 | 116 | 99.60% |
| ARS2C (AGA) | 13 | 93 | 137 | 99.06% |
| ARS2C (Scaler) (AGA) | 13 | 75 | 172 | 99.03% |
| ARS2D (Base) | 11 | 237 | 264 | 99.05% |
| ARS2D (AGA) | 12 | 60 | 112 | 99.00% |
Core Insight: Energy-Geometry Decoupling avoids ineffective wandering in overfitting basins, directly traversing high-dimensional canyons to reach generalized solutions. ARS2D (AGA) achieves grokking in 60 epochs and convergence in 112 epochs, the fastest among all variants. Muon and AdamW fail to grok within 600 epochs.
# uv is recommended
uv sync# Run WikiText-2 Sync Mode (Optimal Generalization)
# Note: the experiment directory is `exp/wikitext-2`, so use script path invocation.
python exp/wikitext-2/train.py --config config/lrp_wikitext2_ars2_neo_sync_10e.toml
# Run CIFAR-10 AGA Mode (Efficient Convergence)
python -m exp.cifar.train --config config/lrp_cifar10_ars2_neo_aga_20e.toml- LRP/Main experiments: directories named
outputs/lrp_*are the main comparative results used in claims. - Verify/Smoke experiments: directories named
outputs/verify_*are short sanity checks (often 1 epoch) and are not comparable to long-run LRP outcomes.
optimizer/: Core optimizer implementations, includingars2_neo.py.exp/: Atomicized experiment scripts, decoupling data flow from model logic.model/: Standard research models including Qwen3 (RoPE) and ResNet.config/: TOML-based experiment configuration management.
@software{ARS2_Neo_2025,
author = {Rui, L.},
title = {ARS2-Neo: Gliding Directly Towards Global Optima Along Geodesics of the Loss Landscape},
year = {2026},
publisher = {GitHub},
url = {https://github.com/dmf-archive/ARS}
}