ARS2C-AGA: Gliding Directly Towards Global Optima Along Geodesics of the Loss Landscape

This project is a research framework focused on second-order optimization dynamics and information geometry.

1. Theoretical Foundation: From Diagonal Fisher to Full-rank NGD

The core design of ARS2-Neo is based on a deep reconstruction of modern optimization algorithms, aimed at overcoming the limitations of first-order optimizers in ill-conditioned curvature landscapes.

1.1 Parameter De-correlation

Through Muon's Newton-Schulz iteration, ARS2-Neo enforces orthogonality on the update matrices (Stiefel manifold constraint). Mathematically, orthogonalized updates are equivalent to performing de-correlation in the parameter space, eliminating internal covariate shift and purifying the gradient information.

1.2 Full-rank Fisher Approximation and NGD

For any orthogonal matrix R and diagonal matrix D, the lifted product RDR^T is a full-rank matrix whose spectrum lives in the rotated coordinates—this identity holds regardless of where the curvature comes from. Adam is often interpreted as a diagonal preconditioner built from gradient second moments, and ARS2-Neo composes that diagonal scaling with a matrix-level orthogonalization step reminiscent of a polar-factor mixing. If the mixing basis R drifts slowly and stays correlated with a curvature eigenbasis while the diagonal D tracks the corresponding spectrum, then RDR^T can be viewed as a structured natural-gradient preconditioner in the original coordinates, which mirrors the intuition behind Amari (1998) and practical approximations such as K-FAC and Shampoo.

This is best phrased as an empirically testable hypothesis rather than a mathematical identity: when ARS2-Neo maintains the required alignment, the composite operator can resemble Natural Gradient Descent (NGD) and our Wikitext-2 run (training loss ≈ 0.9 by 20 epochs) is consistent with strong preconditioned descent. If, instead, the orthogonalization merely reshapes singular values without grounding in curvature statistics, the lifted RDR^T loses its connection to the true Fisher/Hessian and the NGD analogy weakens.

1.3 Global Optima and MDL Principle

While NGD provides rapid convergence, it is prone to falling into "sharp minima" (overfitting). ARS2-Neo introduces Manifold-Aware SAM (Sharpness-Aware Minimization):

Flatness Constraint: By searching for adversarial directions on the Riemannian manifold, the algorithm is guided towards broader basins in the loss landscape.
MDL Correspondence: According to the Minimum Description Length (MDL) principle, flatter regions correspond to simpler model explanations, thereby possessing stronger generalization capabilities.

2. Core Mechanism: Energy-Geometry Decoupling

ARS2-Neo decomposes the optimization process into two independent operators:

Statistical Operator (Energy): Uses the second-moment corrected momentum norm from AdamW to determine the update step size, serving as a proxy for the rate of free-energy descent.
Structural Operator (Geometry): Ensures the update direction strictly follows the manifold's Geodesic through pre-whitening and orthogonal projection.

3. Key Experimental Results (LRP Verification)

3.1 Wikitext-2 Language Modeling

Experimental Setup: Qwen3 (RoPE, 3-layer), Context 255. Aimed at probing optimization stability on ill-conditioned curvature manifolds.

Optimizer	Best PPL	Final PPL	Best Eval Loss	Final Eval Loss	Final Train Loss	Avg Time	PPL Gap
AdamW	116.46	213.52	4.76	5.36	2.9740	314s	+97.06
Muon	111.35	475.65	4.71	6.16	2.2938	445s	+364.30
ARS2-Neo (Base)	96.10	3055.47	4.57	8.02	0.9123	425s	+2959.37
ARS2-Neo (Sync)	90.69	330.85	4.51	5.80	1.6100	784s	+240.16
ARS2-Neo (AGA)	93.23	414.83	4.54	6.03	1.5906	546s	+321.60

3.2 CIFAR-10 Visual Classification

Experimental Setup: ResNet-18, Batch Size 256.

Optimizer	Best Acc	Final Acc	Final Train Loss	Best Eval Loss	Final Eval Loss	Avg Epoch Time	Gen Gap
ARS2-Neo (Sync, ρ=0.1)	95.87%	95.73%	0.0347	0.1500	0.1500	104s	+0.14
ARS2-Neo (Base)	95.58%	95.52%	0.0181	0.2400	0.2500	71s	+0.06
ARS2-Neo (AGA, λ=2.0)	94.10%	94.09%	0.1251	0.1800	0.1800	90s	+0.01
AdamW	94.60%	94.47%	0.0451	0.2500	0.2700	58s	+0.13
Muon	93.76%	93.69%	0.0267	0.2900	0.2900	75s*	+0.07

Muon CIFAR-10 avg epoch time includes a single outlier at 35331s (~9.8hrs); typical epochs are ~75s.

3.3 Grokking Phenomenon Acceleration

To verify the dynamic characteristics of the optimizer during generalization phase transitions, we compared the performance of various optimizers on a modular addition task (p=113, train_frac=0.3).

Optimizer	Fitting (Ep)	Grokking (Ep)	Converge (Ep)	Best Eval Acc
AdamW	113	>600	N/A	15.65%
Muon	22	>347	N/A	36.83%
ARS2-Neo (Base)	11	239	290	99.53%
ARS2-Neo (AGA)	12	77	116	99.60%
ARS2C (AGA)	13	93	137	99.06%
ARS2C (Scaler) (AGA)	13	75	172	99.03%
ARS2D (Base)	11	237	264	99.05%
ARS2D (AGA)	12	60	112	99.00%

Core Insight: Energy-Geometry Decoupling avoids ineffective wandering in overfitting basins, directly traversing high-dimensional canyons to reach generalized solutions. ARS2D (AGA) achieves grokking in 60 epochs and convergence in 112 epochs, the fastest among all variants. Muon and AdamW fail to grok within 600 epochs.

4. Quick Start

4.1 Installation

# uv is recommended
uv sync

4.2 Running Experiments

# Run WikiText-2 Sync Mode (Optimal Generalization)
# Note: the experiment directory is `exp/wikitext-2`, so use script path invocation.
python exp/wikitext-2/train.py --config config/lrp_wikitext2_ars2_neo_sync_10e.toml

# Run CIFAR-10 AGA Mode (Efficient Convergence)
python -m exp.cifar.train --config config/lrp_cifar10_ars2_neo_aga_20e.toml

4.3 Result Tiers and Interpretation

LRP/Main experiments: directories named outputs/lrp_* are the main comparative results used in claims.
Verify/Smoke experiments: directories named outputs/verify_* are short sanity checks (often 1 epoch) and are not comparable to long-run LRP outcomes.

5. Framework Structure

optimizer/: Core optimizer implementations, including ars2_neo.py.
exp/: Atomicized experiment scripts, decoupling data flow from model logic.
model/: Standard research models including Qwen3 (RoPE) and ResNet.
config/: TOML-based experiment configuration management.

Citation

@software{ARS2_Neo_2025,
  author = {Rui, L.},
  title = {ARS2-Neo: Gliding Directly Towards Global Optima Along Geodesics of the Loss Landscape},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/dmf-archive/ARS}
}

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.roo/rules		.roo/rules
config		config
exp		exp
model		model
optimizer		optimizer
outputs		outputs
ref		ref
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
.worktreeinclude		.worktreeinclude
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
data_extraction_report.md		data_extraction_report.md
paper_draft.md		paper_draft.md
process.md		process.md
pyproject.toml		pyproject.toml
sync_rules.ps1		sync_rules.ps1
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARS2C-AGA: Gliding Directly Towards Global Optima Along Geodesics of the Loss Landscape

1. Theoretical Foundation: From Diagonal Fisher to Full-rank NGD

1.1 Parameter De-correlation

1.2 Full-rank Fisher Approximation and NGD

1.3 Global Optima and MDL Principle

2. Core Mechanism: Energy-Geometry Decoupling

3. Key Experimental Results (LRP Verification)

3.1 Wikitext-2 Language Modeling

3.2 CIFAR-10 Visual Classification

3.3 Grokking Phenomenon Acceleration

4. Quick Start

4.1 Installation

4.2 Running Experiments

4.3 Result Tiers and Interpretation

5. Framework Structure

Citation

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARS2C-AGA: Gliding Directly Towards Global Optima Along Geodesics of the Loss Landscape

1. Theoretical Foundation: From Diagonal Fisher to Full-rank NGD

1.1 Parameter De-correlation

1.2 Full-rank Fisher Approximation and NGD

1.3 Global Optima and MDL Principle

2. Core Mechanism: Energy-Geometry Decoupling

3. Key Experimental Results (LRP Verification)

3.1 Wikitext-2 Language Modeling

3.2 CIFAR-10 Visual Classification

3.3 Grokking Phenomenon Acceleration

4. Quick Start

4.1 Installation

4.2 Running Experiments

4.3 Result Tiers and Interpretation

5. Framework Structure

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages