Back to Blog
MathematicsSwiftAccelerateDeep Dive

Computing Fibonacci
at Scale

A deep dive into ℤ[√5], FFT multiplication, and Apple Silicon. How algebraic insight, algorithmic sophistication, and systems optimization combine to compute F(10,000,000) in under one second.

·35 min read
Multiplication Count: O(n) vs O(log n)
n = 1007 muls
n = 1,00010 muls
n = 10,00014 muls
n = 100,00017 muls
n = 1,000,00020 muls
Iterative O(n)
Ring O(log n)
The Problem

Why O(log n) Matters

The naive iterative approach computes F(n) via n additions. Each addition of k-digit numbers costs O(k) operations. For F(10,000,000), that's ten million additions of numbers with 2+ million digits.

But there's a better way. The closed-form Binet formula reveals that F(n) = (φⁿ - ψⁿ)/√5, where φ and ψ are roots of x² - x - 1. This polynomial defines ℤ[√5], where we can compute exactly.

Binary exponentiation in this ring gives us F(n) in ~20 multiplications, each requiring only 3 BigInt multiplies vs 8 for matrix approaches.

The Algebraic Foundation

Working in ℤ[√5]

The Fibonacci sequence appears simple, but its true nature is algebraic. The golden ratio φ = (1+√5)/2 and its conjugate ψ = (1-√5)/2 are roots of the minimal polynomial x² - x - 1.

This polynomial defines the ring ℤ[√5] = {a + b√5 | a,b ∈ ℤ}. The key insight: φ lives in a slight extension of this ring, and powers of (1 + √5) remain in ℤ[√5]. The √5 coefficient, after normalization, gives F(n) directly.

ring_multiply.swift
// Ring ℤ[√5] multiplication
// (a + b√5)(c + d√5) = (ac + 5bd) + (ad + bc)√5
 
func square(_ x: (BigInt, BigInt)) -> (BigInt, BigInt) {
let (a, b) = x
// Only 3 multiplications: a², b², ab
let ab = a * b
return (a*a + 5*b*b, 2*ab) // (a² + 5b²) + 2ab√5
}

3 BigInt multiplications per squaring, 62% fewer than the 2×2 matrix approach. Over log₂(n) iterations, this compounds dramatically.

Binary Exponentiation in ℤ[√5]
φ2
φ² = 1 + √5 (starting point)
1+1√5
F(n) = √5 coefficient / 2F(2) = 1
Deep Math

Binet's Formula

The connection between the golden ratio and Fibonacci numbers

The Formula

Define φ = (1+√5)/2 ≈ 1.618 (golden ratio) and ψ = (1−√5)/2 ≈ −0.618. These are roots of x² − x − 1 = 0.

F(n) = (φⁿ − ψⁿ) / √5

Since |ψ| < 1, the term ψⁿ shrinks exponentially. For n > 10, F(n) is simply the nearest integer to φⁿ/√5.

Binet's Formula
F(n) = (φⁿ ψⁿ) / √5
n=5
11.09-0.0902=5
n=10
122.990.0081=55
n=15
1364.00-0.0007=610
n=20
15127.000.0001=6765
ψⁿ → 0 rapidly, so F(n) ≈ φⁿ/√5 for large n

The Key Insight

We can express φ² = 1 + φ, so φ² = (1 + √5)/2 + 1 = (3 + √5)/2. In our ring: φ² → (1, 1) in ℤ[√5]. By repeated squaring of this element, we get φ^(2^k), and the √5 coefficient gives us F(2^k) directly.

Math
φ² = 1 + √5 (1, 1)
φ⁴ = (1+√5)² = 6 + 2√5 (6, 2) F(4) = 2
φ⁸ = (6+2√5)² = 76 + 34√5 F(8) = 34
Deep Math

Continued Fractions & Extremal Structure

The Most Irrational Number

The golden ratio has the simplest continued fraction: φ = [1; 1, 1, 1, ...]. This makes φ the "most irrational" number: its rational approximations converge more slowly than any other irrational.

φ = 1 + 1/(1 + 1/(1 + 1/(...))
Convergents = F(n+1)/F(n)

The convergents are exactly the ratios of consecutive Fibonacci numbers, the best possible rational approximations to φ with bounded denominator.

Lamé's Theorem (1844)

The Euclidean algorithm for GCD(a, b) takes at most 5·log₁₀(min(a,b)) steps. This bound is tight precisely when a and b are consecutive Fibonacci numbers.

GCD(F(n+1), F(n))n-1 divisions
Each quotient= 1

Fibonacci numbers characterize worst-case GCD behavior. One of the oldest algorithms in mathematics meets number theory's golden sequence.

Hyperbolic Geometry Connection

The modular group SL(2,ℤ), the group of 2×2 integer matrices with determinant 1, acts on the hyperbolic plane. Continued fractions correspond to geodesics, and φ's all-ones expansion traces the "straightest" infinite geodesic. Fibonacci numbers enumerate vertices along this extremal path, connecting elementary number theory to hyperbolic geometry.

Operations per Squaring Step
Matrix (2×2)³8 × + 4 +
|
ℤ[√5] Ring3 × + 2 +
|
62.5% fewer multiplications per step
Why Not Matrix?

Ring vs Matrix

The classic O(log n) approach uses matrix exponentiation with [[1,1],[1,0]]. Squaring a 2×2 matrix requires 8 multiplications and 4 additions.

Our ring approach: just 3 multiplications and 2 additions per squaring step. Over 20 iterations, that's 160 vs 60 BigInt multiplications.

When each multiplication costs milliseconds on million-digit numbers, this difference is huge.

FFT Multiplication

When Numbers Get Huge

Standard multiplication of n-digit numbers is O(n²). At 2 million digits, that's 4 trillion operations per multiply. The FFT transforms this into O(n log n) via the convolution theorem.

We represent BigInts as polynomials where coefficients are base-2¹⁵ digits. Polynomial multiplication is convolution: IFFT(FFT(a) ⊙ FFT(b)).

fft_multiply.swift
1func fftMultiply(_ a: [Double], _ b: [Double]) -> [Double] {
2 let n = nextPowerOfTwo(a.count + b.count)
3 
4 // Forward FFT via Accelerate
5 var aFreq = fft(padded(a, to: n))
6 var bFreq = fft(padded(b, to: n))
7 
8 // Pointwise multiply in frequency domain
9 vDSP_zvmulD(&aFreq, 1, &bFreq, 1, &aFreq, 1, vDSP_Length(n/2), 1)
10 
11 // Inverse FFT + carry propagation
12 return ifft(aFreq)
13}
Why Base 2¹⁵ = 32768?

Each digit fits in 15 bits. Product of two digits: 32767² ≈ 1.07×10⁹. For FFT size n = 2²⁰, convolution values reach n × 32767² ≈ 1.1×10¹⁵, still within double precision's 2⁵³ ≈ 9×10¹⁵ exact integer range.

FFT Pointwise Multiplication
A(ω)
×
B(ω)
=
C(ω) = A(ω) × B(ω)
Inverse FFT transforms C(ω) back to time domain = convolution result
The 500× Speedup

The O(n²) Bottleneck

When asymptotic complexity hides in plain sight

The Problem

Our initial implementation achieved O(log n) ring exponentiations, yet performance collapsed for large n. Profiling revealed the culprit: converting between BigInt and FFT digit representations.

Swift
// Naive conversion: O(n²) total
while value > 0 {
digits.append(value % base) // O(n)
value /= base // O(n)
}
// Called O(n) times = O(n²)

Each division is O(n). Performed O(n) times = O(n²) total, dominating the theoretically O(n log n log log n) multiplication.

The Fix

The attaswift/BigInt library stores magnitudes as arrays of 64-bit words. Each word contains ~4.27 base-2¹⁵ digits (64/15 ≈ 4.27). We can extract them via shifts and masks in a single O(n) pass.

Swift
// O(n) bit extraction
for word in bigInt.words {
// Extract 15-bit chunks via
// bit shifts across word boundaries
while bitsRemaining >= 15 {
digits.append(extractBits(15))
}
}

This "minor" change reduced F(1,000,000) from multiple seconds to 85 milliseconds. A 500× improvement.

Lesson: Asymptotic complexity analysis must account for all operations, not just the theoretically dominant ones. A hidden O(n²) in your conversion routine will kill O(n log n) multiplication.

Implementation

Apple Accelerate Framework

Hardware-optimized signal processing on Apple Silicon

AccelerateFFT.swift
1import Accelerate
2 
3class FFTMultiplier {
4 private var fftSetup: vDSP_DFT_Setup?
5 
6 func multiply(_ a: inout DSPDoubleSplitComplex,
7 _ b: inout DSPDoubleSplitComplex,
8 length: Int) {
9 // Forward FFT using vDSP
10 vDSP_fft_zipD(fftSetup!, &a, 1, vDSP_Length(log2(Double(length))), .forward)
11 vDSP_fft_zipD(fftSetup!, &b, 1, vDSP_Length(log2(Double(length))), .forward)
12 
13 // Complex multiply in frequency domain
14 vDSP_zvmulD(&a, 1, &b, 1, &a, 1, vDSP_Length(length/2), 1)
15 
16 // Inverse FFT
17 vDSP_fft_zipD(fftSetup!, &a, 1, vDSP_Length(log2(Double(length))), .inverse)
18 }
19}

vDSP SIMD

Vectorized operations process 4-8 doubles simultaneously using ARM NEON or AVX instructions.

🔄

Radix-2 FFT

Power-of-two sizes enable the most efficient butterfly operations with perfect cache utilization.

🧮

Base 2¹⁵

Digits in [0, 32767] prevent overflow in frequency domain even after squaring and adding carries.

Thermal Throttling Impact
Core Temp
40°C
Clock Speed
3.5GHz
Production Concerns

Swift & App Store Optimization

Concurrency: Computation runs on a dedicated Swift Task while UI polls results at 30fps. No blocking, no jank, even when multiplying million-digit numbers.

Thermal Throttling: Sustained CPU load heats the chip. We monitor ProcessInfo.thermalState and gracefully reduce work before the system throttles us.

Memory Pooling: Reuse FFT buffers across computations. Allocation is expensive; pooling drops memory pressure by 90%.

Fine Tuning

FFT Threshold Selection

The Crossover

FFT has setup overhead. For small numbers, naive O(n²) is faster. We benchmark to find the sweet spot.

Swift
// Empirically tuned threshold
let FFT_THRESHOLD_BITS = 192
 
func multiply(_ a: BigInt, _ b: BigInt) -> BigInt {
if a.bitWidth + b.bitWidth < FFT_THRESHOLD_BITS {
return gradeSchoolMultiply(a, b)
}
return fftMultiply(a, b)
}

Cache Effects

L1/L2 cache size matters. Buffers that spill to L3 are 3-5× slower. We pad to power-of-two but stay cache-aware.

L1 cache192 KB
L2 cache12 MB
FFT max in-L2~750K doubles
Systems Optimization

Caching & Amortization

The 3× constant factor improvement from proper resource management

FFT Setup Caching

vDSP FFT functions require a setup structure (vDSP_create_fftsetupD) containing precomputed twiddle factors, the complex roots of unity for butterfly operations.

Swift
// Cache setups by log₂(n)
var setupCache: [Int: vDSP_DFT_Setup] = [:]
let lock = NSLock()
 
func getSetup(for log2n: Int) -> vDSP_DFT_Setup {
lock.lock()
defer { lock.unlock() }
if let cached = setupCache[log2n] {
return cached
}
let setup = vDSP_create_fftsetupD(...)
setupCache[log2n] = setup
return setup
}

Once created, a 2²⁰ setup is reused for all subsequent FFTs of that size, amortizing O(n) setup cost across hundreds of multiplications.

Buffer Pooling

Each FFT convolution needs six n-element Double arrays (real and imaginary parts for both inputs and output). For n = 2²⁰, that's 48MB per multiply.

Without poolingmalloc/free per mul
With poolinghash lookup + assign
Improvement~3× faster

The buffer pool maintains previously-allocated arrays for reuse, eliminating allocator thrashing and cache pollution from fresh allocations.

Why Apple Silicon

The Microarchitecture Advantage

Apple's P-cores are exceptionally wide. 8-wide decode (vs 4-6 on Intel/AMD), 630 instructions in flight, 192 KB L1 instruction cache. This width matters for FFT.

The FFT inner loop has no dependencies between iterations:

Swift
// Complex multiply-accumulate per iteration
real[i] = ar * br - ai * bi
imag[i] = ar * bi + ai * br

4 multiplies, 2 adds, zero dependencies. A wide core executes multiple iterations simultaneously. Your FFT gets free parallelism without explicit threading.

P-Core Microarchitecture Comparison
Intel Alder Lake
Decode Width
6-wide
Reorder Buffer
512
L1i Cache
32 KB
AMD Zen 4
Decode Width
4-wide
Reorder Buffer
320
L1i Cache
32 KB
Apple M3
Decode Width
8-wide
Reorder Buffer
630
L1i Cache
192 KB
Apple Silicon Memory Hierarchy
Registers
32×128-bit NEON
0 cycles
L1 Cache
192KB i + 128KB d
~4 cycles
L2 Cache
24-48 MB shared
~12 cycles
Unified Memory
400+ GB/s
~100 cycles
FFT working sets (~4MB for F(3M)) fit entirely in L2
Memory Hierarchy

Cache Sweet Spot

The M-series memory hierarchy is unusually large: 192KB L1 instruction cache, 128KB L1 data cache, and 24-48MB shared L2. Combined with 400+ GB/s unified memory bandwidth, the memory wall effectively disappears.

Our FFT working sets (262K complex doubles ≈ 4MB for F(3,000,000) computations) fit entirely in L2, eliminating main memory latency from the critical path.

L1: 192+128 KB, ~4 cycles. L2: 24-48 MB, ~12 cycles. Combined with 8-wide decode and 630-instruction ROB, FFT achieves near-peak throughput.

Execution Width

Filling the Execution Units

Apple's P-cores pack serious firepower: 4 integer ALUs, 4 FP/SIMD units (128-bit NEON each), and 3 load/store units.

The FFT butterfly operation fills all FP units perfectly. Each iteration needs 4 multiplies and 2 adds, exactly matching the hardware.

4× FP multiply1 cycle
2× FP add1 cycle
Per butterfly2 cycles
P-Core Execution Units
ALU4 ports
FP/SIMD4 ports
Load/Store3 ports
FFT butterfly: 4 muls + 2 adds per iteration. Fills all FP units
AMX Matrix Coprocessor
64×64 matrix registers, dedicated multiply-accumulate
×
Accessible only through Accelerate framework. 2-4× faster than NEON
Secret Weapon

The AMX Coprocessor

Apple Silicon contains an undocumented matrix coprocessor called AMX (Apple Matrix eXtensions). It's not NEON, not the Neural Engine. It's a dedicated unit with 64×64 matrix registers.

When you call vDSP_fft_zipD, you're not just getting hand-tuned assembly. You're hitting AMX for the butterfly operations.

Same Swift code, zero changes. 2-4× throughput over NEON for dense linear algebra. The Accelerate framework handles dispatch automatically.

Zero-Cost Abstractions

Swift's Hidden Optimizations

High-level code with low-level performance

🔄

ARC Traffic Analysis

The optimizer proves that most retain/release pairs are unnecessary when values don't escape or are uniquely referenced. Profiling our FFT hot path shows zero ARC traffic. All reference counting eliminated.

Swift
// Uniquely owned buffers
@inline(__always)
func fftButterfly(_ buf: inout [Double]) {
// Compiler proves buf is unique
// → No ARC overhead
}
📦

Stack Promotion

Small BigInts get stack-promoted. For early iterations (n < 100), numbers fit in a few words and never touch the heap allocator.

n < 64 bits
→ Stack allocated
n > 64 bits
→ Heap array

Autovectorization

LLVM autovectorizes obvious loops. 128-bit NEON registers process 2 doubles per instruction.

Swift
for i in 0..<count {
result[i] = a[i] * b[i]
}
// → simd_mul, 2 doubles/op
Unified Architecture

Why Unified Memory Changes Everything

CPU, GPU, and Neural Engine share the same memory with the same pointers

Discrete GPU Architecture
CPURAM
↓ PCIe Copy ↓
GPUVRAM
~10 μs
transfer latency
Apple Unified Architecture
Unified Memory Pool
CPU
GPU
ANE
0 μs
zero-copy access

Future Optimization Path

1. CPU
Ring exponentiation
(serial dependency)
2. GPU
Large FFT multiplies
(parallel bulk math)
3. Same Buffer
Zero copies
(MTLBuffer shared)
4. No Sync
No PCIe stalls
(unified pointers)
The Future

Where Computing Is Heading

Industry Trajectory

AMD: APUs moving toward unified memory. ~100 GB/s today, Apple's trajectory by 2027.

Intel: Meteor Lake tiles CPU/GPU/NPU on same package with unified pools.

NVIDIA: Grace Hopper is ARM CPU + Hopper GPU with NVLink, hedging against PCIe tax.

Qualcomm: Snapdragon X Elite copies Apple's playbook directly.

What This Means

Serial algorithms hit IPC walls. You can't make a single thread faster, but you can make it wider.

Memory bandwidth trumps FLOPS. Most code is memory-bound. Unified architectures win by eliminating copies.

Hardware/software co-design wins. Apple controls Swift, LLVM, Accelerate, Metal, and the silicon. Nobody else has this.

Specialization proliferates. AMX, ANE, ProRes, Secure Enclave. General cores become orchestrators.

Pushing Further

Theoretical Limits

The Double Precision Wall

Pushing beyond F(4,000,000) encounters floating-point precision limits. Convolution of two 300,000-element arrays produces values approaching 2⁵³, where double precision can no longer represent integers exactly.

Max safe FFT sizen ≈ 2²⁰
Max result digits~3-4 million

The NTT Path Forward

The Number Theoretic Transform (NTT) replaces complex exponentials with modular roots of unity, performing exact integer arithmetic modulo carefully chosen primes.

Using multiple primes with Chinese Remainder Theorem reconstruction enables arbitrary precision at the cost of implementation complexity.

For now, double-precision FFT handles everything up to F(10,000,000+) comfortably. More than sufficient for this benchmark.

The Result

F(1,000,000), a 209,000-digit number, computes in 85 milliseconds. F(3,000,000) with 627,000 digits: under 800ms. F(10,000,000), over 2 million digits, completes in under one second.

The sequential benchmark, computing F(1), F(2), ..., F(n) until one computation exceeds one second, now reaches n ≈ 150,000+. Before fixing the O(n²) bottleneck, it topped out around 5,000-10,000.

The 500× speedup from a "minor" fix (replacing division-based digit conversion with bit extraction) underscores that asymptotic complexity analysis must account for all operations, not just the theoretically dominant ones.

Swift on Apple Silicon is a legitimate platform for high-performance numerical computing. Algebraic insight, algorithmic sophistication, systems optimization, and hardware integration combine to achieve performance competitive with C while maintaining Swift's safety guarantees and expressive syntax.

References & Further Reading

Mathematics
  • Knuth, TAOCP Vol. 2: Seminumerical Algorithms
  • Bach & Shallit, Algorithmic Number Theory
  • Binet, J. (1843), original Fibonacci formula
Implementation