by Marvin Taschenberger
Share
by Marvin Taschenberger

The Optimization Paradox
When we are facing performance challenges in computational code, many developers instinctively reach for the nuclear option: “We need to rewrite this in a faster language.” This approach seems logical—compiled languages like Rust or C++ should always outperform interpreted languages like Python. The promise of 10x or 100x speedups makes a complete rewrite seem worthwhile despite the considerable engineering time required.
In software development, we often face the classic tradeoff between development speed and execution speed. However, a third path is frequently overlooked: strategic optimization of specific bottlenecks. This approach can deliver dramatic performance improvements with a fraction of the engineering effort a complete rewrite would require.
To illustrate this principle, we will benchmark multiple implementations of a Monte Carlo simulation for option pricing—a computationally intensive algorithm common in financial applications. This example is a convenient showcase, but the principles apply across domains, from scientific computing to web services to machine learning pipelines.
The results are with a twist. In some cases, a few lines of vectorized Python code outperformed a complete rewrite in Rust. The PyPy JIT compiler showed different relative performance compared to CPython across implementation strategies. The most unexpected finding: in this specific context, the “fastest” language wasn’t the fastest approach.
This isn’t about declaring Python superior to Rust, or vectorization better than compilation. It’s about a fundamental principle that should guide all performance optimization work: understanding your specific bottlenecks matters more than following general performance dogma.
The Technical Challenge: Monte Carlo Option Pricing
Before diving into optimization strategies, let’s understand the computational problem we’re solving: option pricing using Monte Carlo simulation.
What are Options?
Options are financial contracts that give the holder the right (but not the obligation) to buy or sell an asset at a predetermined price (the strike price) before or at a specific date (expiration). Determining the fair price of these contracts involves modeling uncertainty and incorporating various market factors.
Monte Carlo Simulation: Perfect for Complex Pricing
Monte Carlo methods use repeated random sampling to obtain numerical results. For option pricing, we:
- Generate thousands or millions of possible price paths for the underlying asset
- Calculate the option’s payoff for each path
- Average these payoffs and discount them to present value
This approach is particularly valuable for complex options where closed-form solutions don’t exist. The accuracy of the price estimate improves with more simulations, following a relationship proportional to 1/√n, where n is the number of simulations.
Our benchmark implements a European call option pricing model using the Black-Scholes-Merton framework. At its core, the algorithm:
This implementation is clear and readable but suffers from Python’s interpreted nature. Each loop iteration incurs interpreter overhead, and mathematical operations happen one at a time.
Python with NumPy: Vectorization
Next, vectorizing the algorithm using NumPy, which performs operations on entire arrays at once:

This approach takes advantage of NumPy’s underlying C implementation to perform computations on large arrays without Python loop overhead. It also benefits from cache locality by processing contiguous memory blocks.
Cython: Compiled Python
Cython offers a way to compile Python-like code to C, potentially eliminating interpreter overhead:

Cython eliminates Python’s interpreter overhead by compiling to C code. It also allows static typing which can lead to more efficient code generation.
Cython with NumPy: Hybrid Approach
We can combine Cython’s compilation with NumPy’s vectorization:

This approach aims to get the best of both worlds: Cython’s compilation benefits and NumPy’s efficient array operations.
Rust: Native Systems Language
Finally, implementing the algorithm in Rust, a compiled systems language with a focus on performance and safety:

Rust offers direct compilation to machine code, precise control over memory, and zero-cost abstractions. In theory, it should provide the highest performance of all these approaches.
PyPy: Alternative Python Implementation
In addition to these implementations, we also test the pure Python and NumPy implementations using PyPy, an alternative Python implementation with a Just-In-Time (JIT) compiler that can automatically optimize code at runtime. This doesn’t require any code adjustment; it just needs a simple switch of interpreters.
Results Analysis: The Surprising Performance Landscape
After benchmarking all implementations with one million simulation paths, the results challenged conventional wisdom. Here’s how each approach performed relative to Python (normalized as 1.0x):

Several surprising patterns emerged:
The NumPy Advantage: Vectorization Wins
The most striking result is that NumPy-based implementations (both in CPython and PyPy) show the best performance, achieving speedups of a staggering 13-14x over CPython’s Pure Python. This demonstrates the power of vectorization for computation-heavy tasks.
Rust’s Underwhelming Performance (1.9x)
Another surprising result is that the Rust implementation performed at only 30% of PyPy Python’s speed—making it about one-sixth as fast as the NumPy implementations. This directly contradicts the common belief that rewriting performance-critical code in a systems language automatically yields dramatic improvements.
There are several possible explanations here:
- Python-Rust Interface Overhead: The performance bottleneck might be in the FFI (Foreign Function Interface) between Python and Rust, not in the Rust code itself
- Suboptimal Rust Implementation.
- Data Transfer Costs
The PyPy Effect
PyPy’s JIT compilation provides about an 8x speedup over CPython for the pure Python implementation. However, its advantage was much smaller when using NumPy, suggesting that PyPy’s JIT is most effective when optimizing pure Python code rather than code that already delegates to compiled C libraries.
Cython’s Mixed Results
Cython Basic (8.6x) offers only modest improvements over PyPy Python (7.6x) compared to pure Python.
Consistency and Variance
Looking at performance stability across multiple runs, the NumPy implementations not only performs fastest but also shows the most consistent results:
The NumPy vectorized implementation has the lowest coefficient of variation, indicating more predictable performance—an essential factor in production environments.
What’s Happening Under the Hood?
To understand these results, we need to consider what’s happening at a lower level:
- Memory Access Patterns: NumPy operations process data in contiguous memory blocks, maximizing cache efficiency and enabling SIMD (Single Instruction, Multiple Data) operations at the CPU level.
- Python-to-Native Overhead: The Rust implementation likely suffered from the overhead of crossing the Python-Rust boundary. Even though the Rust code itself is efficient, the FFI (Foreign Function Interface) overhead negated much of its advantage.
- Specialized vs. General Optimization: NumPy’s linear algebra operations are highly optimized for scientific computing, benefiting from decades of performance tuning. Contrastingly, general-purpose languages like Rust which was not optimized in this case for the type of application.
- JIT vs. AOT Compilation: PyPy’s just-in-time compilation can optimize based on actual runtime patterns, sometimes outperforming ahead-of-time compilation for specific workloads.
These results tell us something crucial about performance optimization: the fastest approach depends heavily on the specific nature of your bottlenecks. In this case, the bottleneck wasn’t Python’s interpreter overhead or GIL (which Rust would have addressed) but rather the way memory is accessed and processed during computation (which NumPy addresses brilliantly).
Conclusion: Strategic Optimization Beats Blind Rewrites
Our Monte Carlo option pricing case study offers a clear lesson: strategic optimization based on understanding specific bottlenecks yields far better results than following generic performance dogma.
In principle, the conventional wisdom that “Rust is faster than Python” wasn’t wrong—Rust code executes individual operations faster than Python. However, this advantage was irrelevant when the bottleneck was efficient vectorized operations on large datasets where NumPy excels.
The most effective approach to performance optimization isn’t about choosing the “fastest” language or framework in absolute terms. It’s about:
- Understanding your specific performance bottlenecks through measurement
- Selecting the right tool to address those particular bottlenecks
- Weighing the performance gains against implementation complexity
- Considering the entire system, including interface overhead
Note: The benchmarking code for this blog post is available on GitHub. Feel free to run the tests yourself and contribute to the discussion on optimization strategies.