MathematicsSwiftAccelerateDeep Dive

Computing Fibonacci
at Scale

A deep dive into ℤ[√5], FFT multiplication, and Apple Silicon. How algebraic insight, algorithmic sophistication, and systems optimization combine to compute F(10,000,000) in under one second.

January 2026·35 min read

Multiplication Count: O(n) vs O(log n)

n = 1007 muls

n = 1,00010 muls

n = 10,00014 muls

n = 100,00017 muls

n = 1,000,00020 muls

Iterative O(n)

Ring O(log n)

The Problem

Why O(log n) Matters

The naive iterative approach computes F(n) via n additions. Each addition of k-digit numbers costs O(k) operations. For F(10,000,000), that's ten million additions of numbers with 2+ million digits.

But there's a better way. The closed-form Binet formula reveals that F(n) = (φⁿ - ψⁿ)/√5, where φ and ψ are roots of x² - x - 1. This polynomial defines ℤ[√5], where we can compute exactly.

Binary exponentiation in this ring gives us F(n) in ~20 multiplications, each requiring only 3 BigInt multiplies vs 8 for matrix approaches.

The Algebraic Foundation

Working in ℤ[√5]

The Fibonacci sequence appears simple, but its true nature is algebraic. The golden ratio φ = (1+√5)/2 and its conjugate ψ = (1-√5)/2 are roots of the minimal polynomial x² - x - 1.

This polynomial defines the ring ℤ[√5] = {a + b√5 | a,b ∈ ℤ}. The key insight: φ lives in a slight extension of this ring, and powers of (1 + √5) remain in ℤ[√5]. The √5 coefficient, after normalization, gives F(n) directly.

ring_multiply.swift

// Ring ℤ[√5] multiplication
// (a + b√5)(c + d√5) = (ac + 5bd) + (ad + bc)√5
 
func square(_ x: (BigInt, BigInt)) -> (BigInt, BigInt) {
    let (a, b) = x
    // Only 3 multiplications: a², b², ab
    let ab = a * b
    return (a*a + 5*b*b, 2*ab)  // (a² + 5b²) + 2ab√5
}

3 BigInt multiplications per squaring, 62% fewer than the 2×2 matrix approach. Over log₂(n) iterations, this compounds dramatically.

Binary Exponentiation in ℤ[√5]

φ²

φ² = 1 + √5 (starting point)

1+1√5

F(n) = √5 coefficient / 2F(2) = 1

Deep Math

Binet's Formula

The connection between the golden ratio and Fibonacci numbers

The Formula

Define φ = (1+√5)/2 ≈ 1.618 (golden ratio) and ψ = (1−√5)/2 ≈ −0.618. These are roots of x² − x − 1 = 0.

F(n) = (φⁿ − ψⁿ) / √5

Since |ψ| < 1, the term ψⁿ shrinks exponentially. For n > 10, F(n) is simply the nearest integer to φⁿ/√5.

Binet's Formula

F(n) = (φⁿ − ψⁿ) / √5

n=5

11.09−-0.0902=5

n=10

122.99−0.0081=55

n=15

1364.00−-0.0007=610

n=20

15127.00−0.0001=6765

ψⁿ → 0 rapidly, so F(n) ≈ φⁿ/√5 for large n

The Key Insight

We can express φ² = 1 + φ, so φ² = (1 + √5)/2 + 1 = (3 + √5)/2. In our ring: φ² → (1, 1) in ℤ[√5]. By repeated squaring of this element, we get φ^(2^k), and the √5 coefficient gives us F(2^k) directly.

Math

φ² = 1 + √5  →  (1, 1)
φ⁴ = (1+√5)² = 6 + 2√5  →  (6, 2)  →  F(4) = 2
φ⁸ = (6+2√5)² = 76 + 34√5  →  F(8) = 34

Deep Math

Continued Fractions & Extremal Structure

The Most Irrational Number

The golden ratio has the simplest continued fraction: φ = [1; 1, 1, 1, ...]. This makes φ the "most irrational" number: its rational approximations converge more slowly than any other irrational.

φ = 1 + 1/(1 + 1/(1 + 1/(...))

Convergents = F(n+1)/F(n)

The convergents are exactly the ratios of consecutive Fibonacci numbers, the best possible rational approximations to φ with bounded denominator.

Lamé's Theorem (1844)

The Euclidean algorithm for GCD(a, b) takes at most 5·log₁₀(min(a,b)) steps. This bound is tight precisely when a and b are consecutive Fibonacci numbers.

GCD(F(n+1), F(n))n-1 divisions

Each quotient= 1

Fibonacci numbers characterize worst-case GCD behavior. One of the oldest algorithms in mathematics meets number theory's golden sequence.

Hyperbolic Geometry Connection

The modular group SL(2,ℤ), the group of 2×2 integer matrices with determinant 1, acts on the hyperbolic plane. Continued fractions correspond to geodesics, and φ's all-ones expansion traces the "straightest" infinite geodesic. Fibonacci numbers enumerate vertices along this extremal path, connecting elementary number theory to hyperbolic geometry.

Operations per Squaring Step

Matrix (2×2)³8 × + 4 +

ℤ[√5] Ring3 × + 2 +

62.5% fewer multiplications per step

Why Not Matrix?

Ring vs Matrix

The classic O(log n) approach uses matrix exponentiation with [[1,1],[1,0]]. Squaring a 2×2 matrix requires 8 multiplications and 4 additions.

Our ring approach: just 3 multiplications and 2 additions per squaring step. Over 20 iterations, that's 160 vs 60 BigInt multiplications.

When each multiplication costs milliseconds on million-digit numbers, this difference is huge.

FFT Multiplication

When Numbers Get Huge

Standard multiplication of n-digit numbers is O(n²). At 2 million digits, that's 4 trillion operations per multiply. The FFT transforms this into O(n log n) via the convolution theorem.

We represent BigInts as polynomials where coefficients are base-2¹⁵ digits. Polynomial multiplication is convolution: IFFT(FFT(a) ⊙ FFT(b)).

fft_multiply.swift

1func fftMultiply(_ a: [Double], _ b: [Double]) -> [Double] {
2    let n = nextPowerOfTwo(a.count + b.count)
3 
4    // Forward FFT via Accelerate
5    var aFreq = fft(padded(a, to: n))
6    var bFreq = fft(padded(b, to: n))
7 
8    // Pointwise multiply in frequency domain
9    vDSP_zvmulD(&aFreq, 1, &bFreq, 1, &aFreq, 1, vDSP_Length(n/2), 1)
10 
11    // Inverse FFT + carry propagation
12    return ifft(aFreq)
13}

Why Base 2¹⁵ = 32768?

Each digit fits in 15 bits. Product of two digits: 32767² ≈ 1.07×10⁹. For FFT size n = 2²⁰, convolution values reach n × 32767² ≈ 1.1×10¹⁵, still within double precision's 2⁵³ ≈ 9×10¹⁵ exact integer range.

FFT Pointwise Multiplication

A(ω)

B(ω)

C(ω) = A(ω) × B(ω)

Inverse FFT transforms C(ω) back to time domain = convolution result

The 500× Speedup

The O(n²) Bottleneck

When asymptotic complexity hides in plain sight

The Problem

Our initial implementation achieved O(log n) ring exponentiations, yet performance collapsed for large n. Profiling revealed the culprit: converting between BigInt and FFT digit representations.

Swift

// Naive conversion: O(n²) total
while value > 0 {
    digits.append(value % base)  // O(n)
    value /= base                // O(n)
}
// Called O(n) times = O(n²)

Each division is O(n). Performed O(n) times = O(n²) total, dominating the theoretically O(n log n log log n) multiplication.

The Fix

The attaswift/BigInt library stores magnitudes as arrays of 64-bit words. Each word contains ~4.27 base-2¹⁵ digits (64/15 ≈ 4.27). We can extract them via shifts and masks in a single O(n) pass.

Swift

// O(n) bit extraction
for word in bigInt.words {
    // Extract 15-bit chunks via
    // bit shifts across word boundaries
    while bitsRemaining >= 15 {
        digits.append(extractBits(15))
    }
}

This "minor" change reduced F(1,000,000) from multiple seconds to 85 milliseconds. A 500× improvement.

Lesson: Asymptotic complexity analysis must account for all operations, not just the theoretically dominant ones. A hidden O(n²) in your conversion routine will kill O(n log n) multiplication.

Implementation

Apple Accelerate Framework

Hardware-optimized signal processing on Apple Silicon

AccelerateFFT.swift

1import Accelerate
2 
3class FFTMultiplier {
4    private var fftSetup: vDSP_DFT_Setup?
5 
6    func multiply(_ a: inout DSPDoubleSplitComplex,
7                  _ b: inout DSPDoubleSplitComplex,
8                  length: Int) {
9        // Forward FFT using vDSP
10        vDSP_fft_zipD(fftSetup!, &a, 1, vDSP_Length(log2(Double(length))), .forward)
11        vDSP_fft_zipD(fftSetup!, &b, 1, vDSP_Length(log2(Double(length))), .forward)
12 
13        // Complex multiply in frequency domain
14        vDSP_zvmulD(&a, 1, &b, 1, &a, 1, vDSP_Length(length/2), 1)
15 
16        // Inverse FFT
17        vDSP_fft_zipD(fftSetup!, &a, 1, vDSP_Length(log2(Double(length))), .inverse)
18    }
19}

⚡

vDSP SIMD

Vectorized operations process 4-8 doubles simultaneously using ARM NEON or AVX instructions.

🔄

Radix-2 FFT

Power-of-two sizes enable the most efficient butterfly operations with perfect cache utilization.

🧮

Base 2¹⁵

Digits in [0, 32767] prevent overflow in frequency domain even after squaring and adding carries.

Thermal Throttling Impact

Core Temp

40°C

Clock Speed

3.5GHz

Production Concerns

Swift & App Store Optimization

Concurrency: Computation runs on a dedicated Swift Task while UI polls results at 30fps. No blocking, no jank, even when multiplying million-digit numbers.

Thermal Throttling: Sustained CPU load heats the chip. We monitor ProcessInfo.thermalState and gracefully reduce work before the system throttles us.

Memory Pooling: Reuse FFT buffers across computations. Allocation is expensive; pooling drops memory pressure by 90%.

Fine Tuning

FFT Threshold Selection

The Crossover

FFT has setup overhead. For small numbers, naive O(n²) is faster. We benchmark to find the sweet spot.

Swift

// Empirically tuned threshold
let FFT_THRESHOLD_BITS = 192
 
func multiply(_ a: BigInt, _ b: BigInt) -> BigInt {
    if a.bitWidth + b.bitWidth < FFT_THRESHOLD_BITS {
        return gradeSchoolMultiply(a, b)
    }
    return fftMultiply(a, b)
}

Cache Effects

L1/L2 cache size matters. Buffers that spill to L3 are 3-5× slower. We pad to power-of-two but stay cache-aware.

L1 cache192 KB

L2 cache12 MB

FFT max in-L2~750K doubles

Systems Optimization

Caching & Amortization

The 3× constant factor improvement from proper resource management

FFT Setup Caching

vDSP FFT functions require a setup structure (vDSP_create_fftsetupD) containing precomputed twiddle factors, the complex roots of unity for butterfly operations.

Swift

// Cache setups by log₂(n)
var setupCache: [Int: vDSP_DFT_Setup] = [:]
let lock = NSLock()
 
func getSetup(for log2n: Int) -> vDSP_DFT_Setup {
    lock.lock()
    defer { lock.unlock() }
    if let cached = setupCache[log2n] {
        return cached
    }
    let setup = vDSP_create_fftsetupD(...)
    setupCache[log2n] = setup
    return setup
}

Once created, a 2²⁰ setup is reused for all subsequent FFTs of that size, amortizing O(n) setup cost across hundreds of multiplications.

Buffer Pooling

Each FFT convolution needs six n-element Double arrays (real and imaginary parts for both inputs and output). For n = 2²⁰, that's 48MB per multiply.

Without poolingmalloc/free per mul

With poolinghash lookup + assign

Improvement~3× faster

The buffer pool maintains previously-allocated arrays for reuse, eliminating allocator thrashing and cache pollution from fresh allocations.

Why Apple Silicon

The Microarchitecture Advantage

Apple's P-cores are exceptionally wide. 8-wide decode (vs 4-6 on Intel/AMD), 630 instructions in flight, 192 KB L1 instruction cache. This width matters for FFT.

The FFT inner loop has no dependencies between iterations:

Swift

// Complex multiply-accumulate per iteration
real[i] = ar * br - ai * bi
imag[i] = ar * bi + ai * br

4 multiplies, 2 adds, zero dependencies. A wide core executes multiple iterations simultaneously. Your FFT gets free parallelism without explicit threading.

P-Core Microarchitecture Comparison

Intel Alder Lake

Decode Width

6-wide

Reorder Buffer

512

L1i Cache

32 KB

AMD Zen 4

Decode Width

4-wide

Reorder Buffer

320

L1i Cache

32 KB

Apple M3

Decode Width

8-wide

Reorder Buffer

630

L1i Cache

192 KB

Apple Silicon Memory Hierarchy

Registers

32×128-bit NEON

0 cycles

↓

L1 Cache

192KB i + 128KB d

~4 cycles

L2 Cache

24-48 MB shared

~12 cycles

Unified Memory

400+ GB/s

~100 cycles

FFT working sets (~4MB for F(3M)) fit entirely in L2

Memory Hierarchy

Cache Sweet Spot

The M-series memory hierarchy is unusually large: 192KB L1 instruction cache, 128KB L1 data cache, and 24-48MB shared L2. Combined with 400+ GB/s unified memory bandwidth, the memory wall effectively disappears.

Our FFT working sets (262K complex doubles ≈ 4MB for F(3,000,000) computations) fit entirely in L2, eliminating main memory latency from the critical path.

L1: 192+128 KB, ~4 cycles. L2: 24-48 MB, ~12 cycles. Combined with 8-wide decode and 630-instruction ROB, FFT achieves near-peak throughput.

Execution Width

Filling the Execution Units

Apple's P-cores pack serious firepower: 4 integer ALUs, 4 FP/SIMD units (128-bit NEON each), and 3 load/store units.

The FFT butterfly operation fills all FP units perfectly. Each iteration needs 4 multiplies and 2 adds, exactly matching the hardware.

4× FP multiply1 cycle

2× FP add1 cycle

Per butterfly2 cycles

P-Core Execution Units

ALU4 ports

FP/SIMD4 ports

Load/Store3 ports

FFT butterfly: 4 muls + 2 adds per iteration. Fills all FP units

AMX Matrix Coprocessor

64×64 matrix registers, dedicated multiply-accumulate

Accessible only through Accelerate framework. 2-4× faster than NEON

Secret Weapon

The AMX Coprocessor

Apple Silicon contains an undocumented matrix coprocessor called AMX (Apple Matrix eXtensions). It's not NEON, not the Neural Engine. It's a dedicated unit with 64×64 matrix registers.

When you call vDSP_fft_zipD, you're not just getting hand-tuned assembly. You're hitting AMX for the butterfly operations.

Same Swift code, zero changes. 2-4× throughput over NEON for dense linear algebra. The Accelerate framework handles dispatch automatically.

Zero-Cost Abstractions

Swift's Hidden Optimizations

High-level code with low-level performance

🔄

ARC Traffic Analysis

The optimizer proves that most retain/release pairs are unnecessary when values don't escape or are uniquely referenced. Profiling our FFT hot path shows zero ARC traffic. All reference counting eliminated.

Swift

// Uniquely owned buffers
@inline(__always)
func fftButterfly(_ buf: inout [Double]) {
    // Compiler proves buf is unique
    // → No ARC overhead
}

📦

Stack Promotion

Small BigInts get stack-promoted. For early iterations (n < 100), numbers fit in a few words and never touch the heap allocator.

n < 64 bits

→ Stack allocated

n > 64 bits

→ Heap array

⚡

Autovectorization

LLVM autovectorizes obvious loops. 128-bit NEON registers process 2 doubles per instruction.

Swift

for i in 0..<count {
    result[i] = a[i] * b[i]
}
// → simd_mul, 2 doubles/op

Unified Architecture

Why Unified Memory Changes Everything

CPU, GPU, and Neural Engine share the same memory with the same pointers

Discrete GPU Architecture

CPURAM

↓ PCIe Copy ↓

GPUVRAM

~10 μs

transfer latency

Apple Unified Architecture

Unified Memory Pool

CPU

GPU

ANE

0 μs

zero-copy access

Future Optimization Path

1. CPU

Ring exponentiation

(serial dependency)

2. GPU

Large FFT multiplies

(parallel bulk math)

3. Same Buffer

Zero copies

(MTLBuffer shared)

4. No Sync

No PCIe stalls

(unified pointers)

The Future

Where Computing Is Heading

Industry Trajectory

AMD: APUs moving toward unified memory. ~100 GB/s today, Apple's trajectory by 2027.

Intel: Meteor Lake tiles CPU/GPU/NPU on same package with unified pools.

NVIDIA: Grace Hopper is ARM CPU + Hopper GPU with NVLink, hedging against PCIe tax.

Qualcomm: Snapdragon X Elite copies Apple's playbook directly.

What This Means

Serial algorithms hit IPC walls. You can't make a single thread faster, but you can make it wider.

Memory bandwidth trumps FLOPS. Most code is memory-bound. Unified architectures win by eliminating copies.

Hardware/software co-design wins. Apple controls Swift, LLVM, Accelerate, Metal, and the silicon. Nobody else has this.

Specialization proliferates. AMX, ANE, ProRes, Secure Enclave. General cores become orchestrators.

Pushing Further

Theoretical Limits

The Double Precision Wall

Pushing beyond F(4,000,000) encounters floating-point precision limits. Convolution of two 300,000-element arrays produces values approaching 2⁵³, where double precision can no longer represent integers exactly.

Max safe FFT sizen ≈ 2²⁰

Max result digits~3-4 million

The NTT Path Forward

The Number Theoretic Transform (NTT) replaces complex exponentials with modular roots of unity, performing exact integer arithmetic modulo carefully chosen primes.

Using multiple primes with Chinese Remainder Theorem reconstruction enables arbitrary precision at the cost of implementation complexity.

For now, double-precision FFT handles everything up to F(10,000,000+) comfortably. More than sufficient for this benchmark.

The Result

F(1,000,000), a 209,000-digit number, computes in 85 milliseconds. F(3,000,000) with 627,000 digits: under 800ms. F(10,000,000), over 2 million digits, completes in under one second.

The sequential benchmark, computing F(1), F(2), ..., F(n) until one computation exceeds one second, now reaches n ≈ 150,000+. Before fixing the O(n²) bottleneck, it topped out around 5,000-10,000.

The 500× speedup from a "minor" fix (replacing division-based digit conversion with bit extraction) underscores that asymptotic complexity analysis must account for all operations, not just the theoretically dominant ones.

Swift on Apple Silicon is a legitimate platform for high-performance numerical computing. Algebraic insight, algorithmic sophistication, systems optimization, and hardware integration combine to achieve performance competitive with C while maintaining Swift's safety guarantees and expressive syntax.

See the Project→Discuss Your Challenge

References & Further Reading

Mathematics

Knuth, TAOCP Vol. 2: Seminumerical Algorithms
Bach & Shallit, Algorithmic Number Theory
Binet, J. (1843), original Fibonacci formula

Implementation

Apple Accelerate Framework
GMP Library (reference)
Schönhage–Strassen algorithm (1971)

Computing Fibonacciat Scale

Why O(log n) Matters

Working in ℤ[√5]

Binet's Formula

The Formula

The Key Insight

Continued Fractions & Extremal Structure

The Most Irrational Number

Lamé's Theorem (1844)

Hyperbolic Geometry Connection

Ring vs Matrix

When Numbers Get Huge

The O(n²) Bottleneck

The Problem

The Fix

Apple Accelerate Framework

vDSP SIMD

Radix-2 FFT

Base 2¹⁵

Swift & App Store Optimization

FFT Threshold Selection

The Crossover

Cache Effects

Caching & Amortization

FFT Setup Caching

Buffer Pooling

The Microarchitecture Advantage

Cache Sweet Spot

Filling the Execution Units

The AMX Coprocessor

Swift's Hidden Optimizations

ARC Traffic Analysis

Stack Promotion

Autovectorization

Why Unified Memory Changes Everything

Future Optimization Path

Where Computing Is Heading

Industry Trajectory

What This Means

Theoretical Limits

The Double Precision Wall

The NTT Path Forward

The Result

References & Further Reading

Computing Fibonacci
at Scale