Today, I want to talk about something that honestly blew my programming mind. I was taking a course on Neural Networks and Deep Learning – the one by Andrew Ng, former head of Baidu AI Group and Google Brain.
And I stumbled upon a method I’d never encountered before to shortening your code and making it run 150x times faster.
Yes, 150 times – sometimes even 200 times faster. (I have screenshots in this article.)
You: “Get outta here, Max! That’s crazy talk.”
Don’t worry, I thrive off of disbelief. Let’s pack that skepticism into our carry-on, and get directly into the article.
P.S. There’s a free treat for all that make it to the end of this article (come on, it’s not that long).
The Issue with Loops
If you know how to program, you know about loops. Loops are a wonderful method that allows us to repeat entire blocks of code. They make huge programming projects a bearable task. Without loops, I’d be yelling “RUN IT AGAIN” at my code like 3,000 times a day.
However, if you’ve ever had to deal with, or even heard of, asymptotic analysis of an algorithm (Big O notation), you’ll also know the efficiency problems that can come with loops.
If you have a loop that depends on how much data you have, in addition to having to look at and do operations on every element, your algorithm speed is already linear (at best).
This means: the time the code takes to run is directly related to the amount of data you have. (aka. mo’ data, mo’ waiting around)
Now add in a second loop, and things soon start to look pretty scary. The algorithm you developed scales quadratically (squared) with input size.
This means: the more data you have, the slower the algorithm runs. (aka. mo’ data, CRAZY mo’ waiting around)
What worked so well and fast for you in the prototyping stage, is causing havoc after the large scale implementation.
So… What do you do?
What if you can’t figure out how to reduce the asymptotic behavior of the algorithm, or even worse, it’s not reducible?
Maybe you start spending your time developing a highly complex, but slightly more efficient approximation algorithm, sacrificing accuracy for some efficiency.
- How much accuracy can you sacrifice?
- What unforeseen effects can these approximations have in the long run?
Sometimes, when you use loops, say for manipulating data, you’re really just doing the same operation onto every element. Maybe you’re transforming a probability to a percentage, performing a normalization, or correcting for a systematic error.
Since your operation isn’t dependent on the previous outcomes of the loop, this would be a perfect case for parallelization, but writing an algorithm with parts running in parallel doesn’t sound that appealing.
It seems to add some complexity. After all, the child processes have been created, have run their computation, and have been recombined – so, will the algorithm really run that much faster?
The Solution through Vectorization
The best solution would be to perform that operation on every piece of data at the same time.
And lucky for us – that is something modern processors are able to do.
You just have to tell them to do it.
What you’re doing when running a loop on one piece of data at a time is you’re executing a single instruction on a single piece of data, one at a time.
What you want to be doing though is: execute a single instruction on multiple pieces of data. For processors, this is called an SIMD (single instruction, multiple data) operation.
So you’re probably like ‘Cool story, Max. This is all great to know, but how can we implement this?’
We do this by implementing a little, glorious thing called vectorization.
With us operating on a much higher level, we really don’t want to be dealing with writing out new processor instructions. Fortunately, most languages have a built-in ability to deal with these vector operations in code, and those that don’t, have 3rd party libraries that enable it.
If we take a look at Python, the NumPy library has the ability to perform vector operations on their ND-array object. As long as you’re using numpy arrays, you can do vector operations on these arrays.
For a normal Python list, the following code would give you an error:
It will tell you it can’t divide a list by a number, simply because that operation isn’t defined. What you would have to do instead is something like the following:
If you use a numpy array though, the following will work:
The reason this works is because the operation of dividing a numpy array by a number is defined.
And actually, what’s really happening is that numpy is doing something called broadcasting, which is a fancy way of changing the format of your data to make them compatible, but that’s not that important right now.
What’s really important is what NumPy is doing when you perform that operation. It uses the SIMD processor operation that we saw earlier to make your algorithm run a lot faster.
Is it Actually That Much Faster?
Now comes the question:
Does this really have that noticeable of an effect on speed though? Let’s take a look!
We’ll generate 1,000,000 random numbers and perform a simple operation on each element, namely just dividing it by 5.
We’ll be using the numpy.random.randn() method to create our random numbers. Since numpy automatically uses its numpy array data type, we can automatically use the vector operation format on it.
If we take a look at the following code:
We get an algorithm speed-up of around 150x, and sometimes going up to over 200x.
>> The power of vectorization is pretty badass, and should be implemented whenever possible, especially because these speed ups are incredible in real world applications. <<
It is important to note though that you can’t always replace loops with vector operations. An example of when you can’t replace a loop would be a scenario where we wanted to compute the average speed up factor. Consider the following code:
Here, the outer for loop is necessary, because we’re not performing vector operations on a data structure, but rather, want to repeat our specific block of code.
Knowing when you can use vectorization and when you have to use loops is something you’ll get the hang of when you start using it in your code, but it’s always good practice to start using numpy arrays and using methods contained in the numpy library, because they are implementing these types of performance optimizations (notice I used np.mean(speedUpFactor) instead of sum(speedUpFactor)/len(speedUpFactor)).
The other, possibly hidden to some, plus-side of numpy is that other data science libraries, e.g. scikit-learn and pandas, are built with numpy arrays as their foundation.
This means that you can also use the vectorization methods when using those libraries, which will help you
shorten your code, speed up your algorithm, and prototype and test so much faster.
Why It’s So, So Badass
It’s going to sound kind of dramatic, but this is the only real way I can describe it: finding this method made me feel like this was my programming Enlightenment. I had gone my entire programming career incomplete, and I had finally found the missing piece of the puzzle.
It gave me such a sense of peace because it:
- Helps me prototype machine learning implementations much faster
- Shortens my code and makes it easier/more intuitive to read
- Allows me to pre-process my data a lot faster
- Enables me to quickly create new data indicators
TL;DR: I don’t have to wait around as long for my code to run so I get to do other fun stuff instead.