Writing Fast Python Code: What You Need to Know

Youssef Hossam
16 min readFeb 18, 2021
Photo by Alistair MacRobert on Unsplash

Having used Python almost everyday for just over 2 years now, I’ve lately started to reflect on what I like and dislike the most about Python. Don’t get me wrong, I absolutely love Python and I think it’s become one of the most powerful languages nowadays, but I really don’t like how painfully slow it is.

To get around Python’s slow speed, it’s quite important to be familiar with the different approaches for speeding up Python code. So, in this article we’ll be discussing some of these approaches, why they work, when to use them, and how much speedup we should expect from each of them.

The article is going to progress from the very basic but extremely powerful concepts, like algorithmic complexity, to the more advanced techniques, like Python compilation and avoiding memory bound code, so feel free to skip the parts that you think you have a good grasp on.

We’ll see how adding a simple tag over a function can speed it up by several orders of magnitude, and how small improvements in time complexity can bring down the execution time of a program from 2 weeks to just over 1 second. We’re going to mainly focus on Python in a data science/engineering context, since Python is becoming kind of the go-to language for most data-related projects, but a lot of the concepts that we’ll discuss here, apply to normal software development as well.

A lot of Python’s slow speed, has to do with the fact that it’s an interpreted language. This, in simple terms, means that every time you run a a python program, there is some other program (Python’s interpreter) that’s sitting there and translating your code to machine language (which is the only thing computers understand) line by line while it’s running. Unfortunately, this interpretation process takes a non-trivial amount of processing, which eventually causes a lot of unnecessary overhead, and slows down the program execution quite significantly. In fact, compared to a compiled language like C++, Python can be anywhere from 10x to 200x slower depending on the type of program.

Before digging more into what techniques we can use to optimize our Python code, let’s keep in our minds the rather famous quote by the great Donald Knuth:

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”

So, we don’t really need to care about performance all the time, but when it’s needed, it can make a huge difference. A solid strategy is to start by writing some working code, make sure it’s working well and then optimize any parts that need to be sped up (if any) by profiling the code, and then compare the optimized code’s output to the original code’s output, to check for any bugs that may’ve been introduced by optimization. However, after familiarizing yourself with some of the techniques below, and practicing them on some of the problems you solve, you’ll eventually get better at writing efficient code from first time. This can help you become more productive, as you’ll save the time you could’ve wasted in rewriting some parts of the code.

Time Complexity Is More Important Than You Think

Number of operations versus input size n for each function

Many people think that having a good grasp of time complexity is only helpful for acing programming interviews. However, understanding time complexity is important for almost anyone who writes code for a living. When using very high level constructs with a lot of abstraction, like for instance, pandas data frames, it’s easy to forget about time complexity, especially if you’re used to calculating time complexity by counting how many nested for loops you have, instead of understanding what the code really does.

There’s a lot of excellent resources, that explain time complexity thoroughly, so I won’t be explaining what time complexity is here, but I just want to show an example of why time complexity matters, and how profound its impact could be.

Let’s start with this example, imagine that we have some pandas data frame consisting of 2 columns: “CustomerId”,”TransactionDate”. Every row represents some transaction by some customer and we have 10K customers, and 1 million transactions in total. We want to build a rest API, that returns a list of transaction dates given some customer ID, so we decide that we need a dictionary that goes like {cust_id:[trans_date1,trans_date2,..]}, that is, every customer will be mapped to a list with all of its transaction dates. Here is one way to do this:

We simply loop over all customer IDs, and for every customer ID, we get all the relevant transactions from the dataframe and add the customer ID along with all of its transactions to the dictionary. Sounds like a good idea, right? Well, not exactly. The code above should run perfectly fine and will produce the correct output, however, it’s extremely inefficient. For every customer in our 10K customers, we’ll be making around 1 million operations, to get all its transactions, so we’re really repeating a lot of work. The total amount of work we’re doing is in the order of 10K*1M=10 billion operations. Let’s look at an alternative implementation:

We’re now basically looping over every (customer,transaction) pair, and in every iteration, we add the current transaction date to the list of the current user, in the dictionary. This time, the number of operations will be in the order of 1M (same as number of transactions). This is actually the best we can do, in terms of time complexity, as we have to “touch” each row/transaction at least once. This’s usually known as “best conceivable runtime”. The BCR in this case (which is achieved by our faster implementation) is O(T) were T is the number of transactions. The slower version of this algorithm has a time complexity of O(N*T) were N is the number of customers. Given that N=10000, we should expect our fast implementation to be faster by a staggering 10000x! None of the techniques we’ll discuss below will ever get close to such a huge speedup. In fact, nothing comes close to the improvements that better time complexity can potentially offer. We also in our example have only 10K customers. If we for example, had 1M customers, the new code will be 1 million times faster! To put that in perspective, with a 1Mx speedup, if our fast implementation takes 1 sec, the slower implementation will take almost 2 weeks to run!

Notice how changing the number of users affected our speedup. This’s because better time complexity, improves how the algorithm “scales” against changes in input size. That’s exactly what the Big O notation is used for. We want to measure how slower our algorithm becomes when the input is larger. All of the other techniques we’ll mention, is concerned with optimizing the constant factors that is dropped by the Big O notation, so their speedups don’t change much when the input size is changed and their potential impact is typically much smaller than time complexity.

Besides having a good understanding of time complexity, we also need to have a good understanding of basic data structures (like arrays, hashmaps/hashsets), how they work, and the time complexity for different operations on them. If you come from a CS background, you’ve probably already studied all of these concepts back in college. If you come from a different background, it’d be very helpful to take a look at any of the available DS&Algo courses online. In both cases, it’s extremely important to have a solid understanding of basic DS and algorithm concepts, as they usually have the biggest impact on speed for sufficiently large input size.

Another common misconception about time complexity, is that it only matters when working with tons of data. This isn’t necessarily the case. In the example above, the file we’re using is just ~17MB, which is very little data by today’s standards, but despite that, improving time complexity still made a huge difference. The larger the data, the more important these concepts are, but as you can see, they are still important even when we’re working with little data, and can ultimately make the difference, between a working program, and one that’s almost unusable.

If you’ve managed to bring your code’s time complexity down its BCR, and it’s still isn’t fast enough, then chances are, Python’s slow speed is at play. But don’t worry, we still haven’t run out of ideas. The next sections will cover several approaches that can help write python code that’s executing native machine code under the hood.

Vectorization Can Be a Game Changer

Scalar vs vectorized operation

The term “Vectorization” is kind of overloaded, as it’s sometimes used to describe different things but we’ll use it here to describe a programming technique that mainly uses operations that apply at once to an entire set of values.

In most imperative programming languages, we tend to think of primitive data types as the smallest building block of data in our program. We can argue that these primitive data types are still made up of bytes and those bytes are made up of bits, but in most cases, we don’t have to worry about these details. This’s because the programming language we’re using is basically abstracting these details from us, so that we can perform operations on primitive datatypes without thinking about their inner workings.

Vectorization essentially adds an extra abstraction layer that uses containers as its smallest building block. Wither the collection is an array, a pandas series or a data frame, the concept is the same. We want a single operation to be performed on all elements in the container at once (conceptually).

In order to process the elements of a container without vectorization we usually need to iterate over the container and process each of its elements individually. So, if we for example, want to multiply each element in an array by 2, the natural way to do this in an imperative language is to simply iterate over the array and process each element:

But because Numpy offers vectorized operations we can actually replace the loop above with only:

Because of the extra abstraction layer, writing vectorized code requires changing your mindset to think of a collection as a primitive variable, and not as a group of elements that you have to loop over to process its elements. This usually means that in most cases vectorizing your code will boil down to removing any loops.

If you haven’t written vectorized code before you may look at the example above and think, “hey! that’s super easy! I’ll just remove the for loop and the indexing and that’s it!”, but unfortunately, it’s not always that simple. Let’s look at a more realistic example.

Imagine that we’re building an ML model that we want to use in forecasting a time series. Let’s assume that the values we want to predict represent the volume of some liquid, and we have a data frame that has the columns [“ID”, ”Liquid”, ”Date”]. Each row represents the liquid value for some (“ID”,”Date”) pair. To feed the data to our model, we need to build a matrix where each row represents an example, and the columns represent the volume of liquid in each month. We know that the liquid goes through a build up phase where it keeps on increasing and then it declines. Think something like the curves below.

We’ll only use the post peak part of the curve in forecasting so we want to drop the pre-peak months. Here is one to way to do it using vectorized code:

A non-vectorized approach, would probably involve iterating over each row and storing the values for each example, in maybe a dictionary that will be used later to remove the pre-peak part and build the final 2D array. As you can see, vectorizing our algorithm required us to approach the problem with an entirely different mindset. We no longer think of the data frame as a collection of rows, but rather as a single entity that we need to apply a series of transformations on to build the final output.

So, how much faster will the code above be compared to an iterative approach? Well, the above code is in fact a real example from our data pipeline that replaced an already existing iterative approach that looped over all rows. The code above dropped the time taken for processing our oil production data from 4 hours to 15 seconds! That’s almost a 1000x speedup! Processing production data is an integral part in many of our projects, so speeding up production processing made a big difference in several parts of our system.

You may be thinking that perhaps the existing iterative code was badly written, but actually, the code was algorithmically optimal. A 1000x speedup is perhaps too much, but it’s actually quite common for vectorized Numpy/Pandas code to run 2 to 3 orders of magnitude faster than python loops. The vectorization speedup is mainly because of 2 things:

  • Most of Numpy’s backend is written in C (Pandas use NP arrays under the hood). Python can be several orders of magnitude slower than C, so we get the benefit of C’s native execution speed when we rely on Numpy/Pandas built in functions.
  • For most arithmetic operations, Numpy can utilize SIMD instructions to get some speedups that usually range from 4x to 8x depending on the CPU architecture.

Just like being familiar with the basic imperative languages constructs (such as: for, while, if, etc.) is absolutely necessary to be able to write working code in these languages, writing vectorized code requires being familiar with the different constructs of the framework/language used to vectorize the code. Therefore, before starting to vectorize your code using Numpy and Pandas, make sure first that you’re familiar with the basic constructs that both have to offer. If you don’t feel like you have a solid understanding of the basic functionalities (like: slice, sort, merge, pivot, drop duplicates, filter, etc.), or you’ve struggled to understand the vectorized code above, then it’s time to look for a good Numpy/Pandas tutorial.

NumExpr: When Vectorized Code Isn’t Good Enough

Performance of processors and memory as a function of time

Do you know that the expression “2*a+3*b”, where a and b are Numpy arrays will create 3 Numpy arrays that are as large as a and b? This may not sound like a big deal, but as the arrays get larger, vectorized Numpy code, can be quite slow (surprise). The reason is, accessing memory is a pretty expensive task. Compared to primitive CPU operations like addition and multiplication, accessing the RAM can be close to 2 orders of magnitude slower. That is, 100 addition instructions can take as much time as a single memory read.

If in our example above, a and b had a shape of (10⁶, 200), the temporary matrices created will never fit in cache memory and will have to be written to the RAM, since the cache is usually just a few megabytes and array of this size, will probably need ~1GB of memory. As we’ve mentioned earlier, RAM access is ~100x slower than primitive operations, so the expression above will spend more than 99% of its time in memory reads/writes. This is a textbook example of a memory bound program.

The above holds true for most Numpy arithmetic expressions. We can even argue that in most cases, Numpy arithmetic expressions are memory bound, why? because the total number of arithmetic instructions is usually less than or equal to the number of memory accesses (for every arithmetic operation we have one or more memory accesses), and as we’ve already mentioned, memory access is much slower, so we end up with a program that spends most of its time in memory access.

To fix this problem, a very effective and easy solution is to rely on NumExpr to execute Numpy expressions. NumExpr, avoids allocating large arrays by handling the arrays as chunks of 4096 elements at a time, using a register machine. To use NumExpr for our expression above, we can simply do the following:

We just pass our expression as a string to the “numexpr.evaluate” and we’re done! NumExpr supports most of Numpy’s API so it should be easy to transform an already existing Numpy expression to a NumExpr string. According to NumExpr documentation, the speedups for NumExpr with respect to Numpy can be anywhere between 0.95x and 20x, depending on the CPU architecture and the arrays size. That’s not a bad improvement for such a small update to the code! NumExpr is also a dependency for Pandas, so if you have Pandas then you’re probably ready to go!

One caveat is that NumExpr can spend some time in compiling the expression before executing it. Thus, it’s usually better to use NumExpr with large arrays only. Otherwise, the performance gain could be overshadowed by the time taken to compile the expression.

Iterating Over a Data Frame is Generally a Bad Idea

Benchmarking different approaches for iterating over the values in a data frame

Sometimes iterating over our data is inevitable. This typically happens when we need to use a scalar function that uses some values from one or more columns in our data frame, so there isn’t really much we can do to avoid looping over our data in this case. What we can do, however, is to extract the columns we’re interested in, and loop over them simultaneously. Let’s look at this example:

We’re simply iterating over the rows and passing the 2 relevant values to our function in every iteration. Let’s look at another way to do it:

Can you see the difference? We’re now iterating over each of the 2 columns simultaneously instead of iterating over the rows. I measured the time taken by each of the 2 approaches above, using some toy function that just adds the 2 values, and the second approach turned out to be ~100x faster!

As you can see in the benchmarking plot above, the approach we’ve used which relies on list comprehension is usually around 2 orders of magnitude faster than “iterrows”, which is very close to the difference we’ve seen in our toy example. Notice how apply is slightly faster than “iterrows”, but it’s still pretty slow. That’s because apply will execute a for loop over each row so it isn’t that much different from looping over the rows manually. The plot above is from the excellent and very comprehensive stack overflow answer below.

Numba: Same Code, 100x Faster

It’s perfectly normal to feel like the claim above is too good to be true, but amazingly enough, it’s entirely possible when using Numba to speed up your code by several orders of magnitude, without any changes to the code itself. Therefore, in most cases, using Numba is your best bet if you can’t vectorize your code, since you can use it to write iterative solutions that have very good performance.

Let’s look at the example below. The function “multiply_by_ind”, takes in a 2D array and iterates over each of its elements. If the row*col is an even number, it multiplies this element by row, else, it multiplies it by col.

Have you noticed that “@njit” tag? Well, that’s where all the magic happens. I’ve measured the time taken for the function to process a 5000x5000 matrix with and without the “@njit” tag, and the speedup when using the “@njit” flag was well over 100x! So, how does this magical “@njit” tag work?

Remember when we said earlier that Python can be several orders of magnitude slower than compiled languages like C++? Numba solves this problem by simply compiling python code. The compilation happens “Just in time”, which means that the first time you call a Numba function, it will be compiled to native machine code on the fly and cached in memory. Whenever you call this function again, you’ll be calling a machine code function, so you’ll get a very similar performance to that offered by compiled languages.

After seeing the results above, it may be very tempting to think of Numba as some sort of a silver bullet that will solve all your performance problems. You may even wonder why it isn’t supported by Python out of the box, if we’re writing the same code and getting a 100x better performance. But unfortunately, Numba only supports a subset of Python and Numpy API. This means that you can’t just add the “@njit” tag to compile any python function you have, since you have to make sure that the code sticks to the subset of the language that’s supported by Numba. A solid approach is to write your code as you would normally do and then use Numba to compile those functions that proved to be time consuming (if any).

There are other alternatives to Numba such as Pythran and Cython, that can compile python code and offer similar performance benefits. Each of these options has its drawbacks, but if you’re looking for a quick speedup to some already existing code, then Numba is in my opinion, the easiest to start with. The article below has a pretty good overview of the 3 techniques, along with some benchmarking results for each of them.

Final Thoughts

Code optimization is more of an art than a science. There’s no clear recipe for what you should do to speed up your code, and there’s absolutely no limit to the ideas that you can come up with. However, if I had to recommend a systematic approach to optimize Python code, it’ll be probably be something similar to this pseudo Python code:

That’s just one way to think of using the various ideas we’ve discussed. Sticking to a pre-defined plan is generally not a very good idea when optimizing your code, but the pseudo code above can maybe act as a general guideline for how to optimize your code. None of the ideas presented in this article is a one size fits all solution, so it’s essential to think of the pros and cons of each approach before picking the one that works best with the problem at hand, and last but not least, make sure to always avoid premature optimization.

--

--