Computing Fibonacci
at Scale
A deep dive into ℤ[√5], FFT multiplication, and Apple Silicon. How algebraic insight, algorithmic sophistication, and systems optimization combine to compute F(10,000,000) in under one second.
Why O(log n) Matters
The naive iterative approach computes F(n) via n additions. Each addition of k-digit numbers costs O(k) operations. For F(10,000,000), that's ten million additions of numbers with 2+ million digits.
But there's a better way. The closed-form Binet formula reveals that F(n) = (φⁿ - ψⁿ)/√5, where φ and ψ are roots of x² - x - 1. This polynomial defines ℤ[√5], where we can compute exactly.
Binary exponentiation in this ring gives us F(n) in ~20 multiplications, each requiring only 3 BigInt multiplies vs 8 for matrix approaches.
Working in ℤ[√5]
The Fibonacci sequence appears simple, but its true nature is algebraic. The golden ratio φ = (1+√5)/2 and its conjugate ψ = (1-√5)/2 are roots of the minimal polynomial x² - x - 1.
This polynomial defines the ring ℤ[√5] = {a + b√5 | a,b ∈ ℤ}. The key insight: φ lives in a slight extension of this ring, and powers of (1 + √5) remain in ℤ[√5]. The √5 coefficient, after normalization, gives F(n) directly.
// Ring ℤ[√5] multiplication// (a + b√5)(c + d√5) = (ac + 5bd) + (ad + bc)√5 func square(_ x: (BigInt, BigInt)) -> (BigInt, BigInt) { let (a, b) = x // Only 3 multiplications: a², b², ab let ab = a * b return (a*a + 5*b*b, 2*ab) // (a² + 5b²) + 2ab√5}3 BigInt multiplications per squaring, 62% fewer than the 2×2 matrix approach. Over log₂(n) iterations, this compounds dramatically.
Binet's Formula
The connection between the golden ratio and Fibonacci numbers
The Formula
Define φ = (1+√5)/2 ≈ 1.618 (golden ratio) and ψ = (1−√5)/2 ≈ −0.618. These are roots of x² − x − 1 = 0.
Since |ψ| < 1, the term ψⁿ shrinks exponentially. For n > 10, F(n) is simply the nearest integer to φⁿ/√5.
The Key Insight
We can express φ² = 1 + φ, so φ² = (1 + √5)/2 + 1 = (3 + √5)/2. In our ring: φ² → (1, 1) in ℤ[√5]. By repeated squaring of this element, we get φ^(2^k), and the √5 coefficient gives us F(2^k) directly.
φ² = 1 + √5 → (1, 1)φ⁴ = (1+√5)² = 6 + 2√5 → (6, 2) → F(4) = 2φ⁸ = (6+2√5)² = 76 + 34√5 → F(8) = 34Continued Fractions & Extremal Structure
The Most Irrational Number
The golden ratio has the simplest continued fraction: φ = [1; 1, 1, 1, ...]. This makes φ the "most irrational" number: its rational approximations converge more slowly than any other irrational.
The convergents are exactly the ratios of consecutive Fibonacci numbers, the best possible rational approximations to φ with bounded denominator.
Lamé's Theorem (1844)
The Euclidean algorithm for GCD(a, b) takes at most 5·log₁₀(min(a,b)) steps. This bound is tight precisely when a and b are consecutive Fibonacci numbers.
Fibonacci numbers characterize worst-case GCD behavior. One of the oldest algorithms in mathematics meets number theory's golden sequence.
Hyperbolic Geometry Connection
The modular group SL(2,ℤ), the group of 2×2 integer matrices with determinant 1, acts on the hyperbolic plane. Continued fractions correspond to geodesics, and φ's all-ones expansion traces the "straightest" infinite geodesic. Fibonacci numbers enumerate vertices along this extremal path, connecting elementary number theory to hyperbolic geometry.
Ring vs Matrix
The classic O(log n) approach uses matrix exponentiation with [[1,1],[1,0]]. Squaring a 2×2 matrix requires 8 multiplications and 4 additions.
Our ring approach: just 3 multiplications and 2 additions per squaring step. Over 20 iterations, that's 160 vs 60 BigInt multiplications.
When each multiplication costs milliseconds on million-digit numbers, this difference is huge.
When Numbers Get Huge
Standard multiplication of n-digit numbers is O(n²). At 2 million digits, that's 4 trillion operations per multiply. The FFT transforms this into O(n log n) via the convolution theorem.
We represent BigInts as polynomials where coefficients are base-2¹⁵ digits. Polynomial multiplication is convolution: IFFT(FFT(a) ⊙ FFT(b)).
1func fftMultiply(_ a: [Double], _ b: [Double]) -> [Double] {2 let n = nextPowerOfTwo(a.count + b.count)3 4 // Forward FFT via Accelerate5 var aFreq = fft(padded(a, to: n))6 var bFreq = fft(padded(b, to: n))7 8 // Pointwise multiply in frequency domain9 vDSP_zvmulD(&aFreq, 1, &bFreq, 1, &aFreq, 1, vDSP_Length(n/2), 1)10 11 // Inverse FFT + carry propagation12 return ifft(aFreq)13}Each digit fits in 15 bits. Product of two digits: 32767² ≈ 1.07×10⁹. For FFT size n = 2²⁰, convolution values reach n × 32767² ≈ 1.1×10¹⁵, still within double precision's 2⁵³ ≈ 9×10¹⁵ exact integer range.
The O(n²) Bottleneck
When asymptotic complexity hides in plain sight
The Problem
Our initial implementation achieved O(log n) ring exponentiations, yet performance collapsed for large n. Profiling revealed the culprit: converting between BigInt and FFT digit representations.
// Naive conversion: O(n²) totalwhile value > 0 { digits.append(value % base) // O(n) value /= base // O(n)}// Called O(n) times = O(n²)Each division is O(n). Performed O(n) times = O(n²) total, dominating the theoretically O(n log n log log n) multiplication.
The Fix
The attaswift/BigInt library stores magnitudes as arrays of 64-bit words. Each word contains ~4.27 base-2¹⁵ digits (64/15 ≈ 4.27). We can extract them via shifts and masks in a single O(n) pass.
// O(n) bit extractionfor word in bigInt.words { // Extract 15-bit chunks via // bit shifts across word boundaries while bitsRemaining >= 15 { digits.append(extractBits(15)) }}This "minor" change reduced F(1,000,000) from multiple seconds to 85 milliseconds. A 500× improvement.
Lesson: Asymptotic complexity analysis must account for all operations, not just the theoretically dominant ones. A hidden O(n²) in your conversion routine will kill O(n log n) multiplication.
Apple Accelerate Framework
Hardware-optimized signal processing on Apple Silicon
1import Accelerate2 3class FFTMultiplier {4 private var fftSetup: vDSP_DFT_Setup?5 6 func multiply(_ a: inout DSPDoubleSplitComplex,7 _ b: inout DSPDoubleSplitComplex,8 length: Int) {9 // Forward FFT using vDSP10 vDSP_fft_zipD(fftSetup!, &a, 1, vDSP_Length(log2(Double(length))), .forward)11 vDSP_fft_zipD(fftSetup!, &b, 1, vDSP_Length(log2(Double(length))), .forward)12 13 // Complex multiply in frequency domain14 vDSP_zvmulD(&a, 1, &b, 1, &a, 1, vDSP_Length(length/2), 1)15 16 // Inverse FFT17 vDSP_fft_zipD(fftSetup!, &a, 1, vDSP_Length(log2(Double(length))), .inverse)18 }19}vDSP SIMD
Vectorized operations process 4-8 doubles simultaneously using ARM NEON or AVX instructions.
Radix-2 FFT
Power-of-two sizes enable the most efficient butterfly operations with perfect cache utilization.
Base 2¹⁵
Digits in [0, 32767] prevent overflow in frequency domain even after squaring and adding carries.
Swift & App Store Optimization
Concurrency: Computation runs on a dedicated Swift Task while UI polls results at 30fps. No blocking, no jank, even when multiplying million-digit numbers.
Thermal Throttling: Sustained CPU load heats the chip. We monitor ProcessInfo.thermalState and gracefully reduce work before the system throttles us.
Memory Pooling: Reuse FFT buffers across computations. Allocation is expensive; pooling drops memory pressure by 90%.
FFT Threshold Selection
The Crossover
FFT has setup overhead. For small numbers, naive O(n²) is faster. We benchmark to find the sweet spot.
// Empirically tuned thresholdlet FFT_THRESHOLD_BITS = 192 func multiply(_ a: BigInt, _ b: BigInt) -> BigInt { if a.bitWidth + b.bitWidth < FFT_THRESHOLD_BITS { return gradeSchoolMultiply(a, b) } return fftMultiply(a, b)}Cache Effects
L1/L2 cache size matters. Buffers that spill to L3 are 3-5× slower. We pad to power-of-two but stay cache-aware.
Caching & Amortization
The 3× constant factor improvement from proper resource management
FFT Setup Caching
vDSP FFT functions require a setup structure (vDSP_create_fftsetupD) containing precomputed twiddle factors, the complex roots of unity for butterfly operations.
// Cache setups by log₂(n)var setupCache: [Int: vDSP_DFT_Setup] = [:]let lock = NSLock() func getSetup(for log2n: Int) -> vDSP_DFT_Setup { lock.lock() defer { lock.unlock() } if let cached = setupCache[log2n] { return cached } let setup = vDSP_create_fftsetupD(...) setupCache[log2n] = setup return setup}Once created, a 2²⁰ setup is reused for all subsequent FFTs of that size, amortizing O(n) setup cost across hundreds of multiplications.
Buffer Pooling
Each FFT convolution needs six n-element Double arrays (real and imaginary parts for both inputs and output). For n = 2²⁰, that's 48MB per multiply.
The buffer pool maintains previously-allocated arrays for reuse, eliminating allocator thrashing and cache pollution from fresh allocations.
The Microarchitecture Advantage
Apple's P-cores are exceptionally wide. 8-wide decode (vs 4-6 on Intel/AMD), 630 instructions in flight, 192 KB L1 instruction cache. This width matters for FFT.
The FFT inner loop has no dependencies between iterations:
// Complex multiply-accumulate per iterationreal[i] = ar * br - ai * biimag[i] = ar * bi + ai * br4 multiplies, 2 adds, zero dependencies. A wide core executes multiple iterations simultaneously. Your FFT gets free parallelism without explicit threading.
Cache Sweet Spot
The M-series memory hierarchy is unusually large: 192KB L1 instruction cache, 128KB L1 data cache, and 24-48MB shared L2. Combined with 400+ GB/s unified memory bandwidth, the memory wall effectively disappears.
Our FFT working sets (262K complex doubles ≈ 4MB for F(3,000,000) computations) fit entirely in L2, eliminating main memory latency from the critical path.
L1: 192+128 KB, ~4 cycles. L2: 24-48 MB, ~12 cycles. Combined with 8-wide decode and 630-instruction ROB, FFT achieves near-peak throughput.
Filling the Execution Units
Apple's P-cores pack serious firepower: 4 integer ALUs, 4 FP/SIMD units (128-bit NEON each), and 3 load/store units.
The FFT butterfly operation fills all FP units perfectly. Each iteration needs 4 multiplies and 2 adds, exactly matching the hardware.
The AMX Coprocessor
Apple Silicon contains an undocumented matrix coprocessor called AMX (Apple Matrix eXtensions). It's not NEON, not the Neural Engine. It's a dedicated unit with 64×64 matrix registers.
When you call vDSP_fft_zipD, you're not just getting hand-tuned assembly. You're hitting AMX for the butterfly operations.
Same Swift code, zero changes. 2-4× throughput over NEON for dense linear algebra. The Accelerate framework handles dispatch automatically.
Swift's Hidden Optimizations
High-level code with low-level performance
ARC Traffic Analysis
The optimizer proves that most retain/release pairs are unnecessary when values don't escape or are uniquely referenced. Profiling our FFT hot path shows zero ARC traffic. All reference counting eliminated.
// Uniquely owned buffers@inline(__always)func fftButterfly(_ buf: inout [Double]) { // Compiler proves buf is unique // → No ARC overhead}Stack Promotion
Small BigInts get stack-promoted. For early iterations (n < 100), numbers fit in a few words and never touch the heap allocator.
Autovectorization
LLVM autovectorizes obvious loops. 128-bit NEON registers process 2 doubles per instruction.
for i in 0..<count { result[i] = a[i] * b[i]}// → simd_mul, 2 doubles/opWhy Unified Memory Changes Everything
CPU, GPU, and Neural Engine share the same memory with the same pointers
Future Optimization Path
Where Computing Is Heading
Industry Trajectory
AMD: APUs moving toward unified memory. ~100 GB/s today, Apple's trajectory by 2027.
Intel: Meteor Lake tiles CPU/GPU/NPU on same package with unified pools.
NVIDIA: Grace Hopper is ARM CPU + Hopper GPU with NVLink, hedging against PCIe tax.
Qualcomm: Snapdragon X Elite copies Apple's playbook directly.
What This Means
Serial algorithms hit IPC walls. You can't make a single thread faster, but you can make it wider.
Memory bandwidth trumps FLOPS. Most code is memory-bound. Unified architectures win by eliminating copies.
Hardware/software co-design wins. Apple controls Swift, LLVM, Accelerate, Metal, and the silicon. Nobody else has this.
Specialization proliferates. AMX, ANE, ProRes, Secure Enclave. General cores become orchestrators.
Theoretical Limits
The Double Precision Wall
Pushing beyond F(4,000,000) encounters floating-point precision limits. Convolution of two 300,000-element arrays produces values approaching 2⁵³, where double precision can no longer represent integers exactly.
The NTT Path Forward
The Number Theoretic Transform (NTT) replaces complex exponentials with modular roots of unity, performing exact integer arithmetic modulo carefully chosen primes.
Using multiple primes with Chinese Remainder Theorem reconstruction enables arbitrary precision at the cost of implementation complexity.
For now, double-precision FFT handles everything up to F(10,000,000+) comfortably. More than sufficient for this benchmark.
The Result
F(1,000,000), a 209,000-digit number, computes in 85 milliseconds. F(3,000,000) with 627,000 digits: under 800ms. F(10,000,000), over 2 million digits, completes in under one second.
The sequential benchmark, computing F(1), F(2), ..., F(n) until one computation exceeds one second, now reaches n ≈ 150,000+. Before fixing the O(n²) bottleneck, it topped out around 5,000-10,000.
The 500× speedup from a "minor" fix (replacing division-based digit conversion with bit extraction) underscores that asymptotic complexity analysis must account for all operations, not just the theoretically dominant ones.
Swift on Apple Silicon is a legitimate platform for high-performance numerical computing. Algebraic insight, algorithmic sophistication, systems optimization, and hardware integration combine to achieve performance competitive with C while maintaining Swift's safety guarantees and expressive syntax.
References & Further Reading
- Knuth, TAOCP Vol. 2: Seminumerical Algorithms
- Bach & Shallit, Algorithmic Number Theory
- Binet, J. (1843), original Fibonacci formula
- Apple Accelerate Framework
- GMP Library (reference)
- Schönhage–Strassen algorithm (1971)