The world as we know it today would not exist without programming libraries.
I don’t want to be dramatic but cities would collapse, the internet would break in half, and programmers worldwide would fall into a deep, dark depression.
I know most people that have never programmed before think of programming as a very manual process of typing up code.
I watched the same action movies as everyone else growing up; the ones with the black screen of ‘code’, someone typing furiously fast and thousands of text lines running through the screen, and I remember thinking to myself: “That looks awfully complicated. Also, how are they typing that fast?”
The truth is: that guy was probably just using a programming library.
Fortunately for all of us, programming libraries are tools that data scientists and programmers use every single day to make their lives easy.
In this post, we’re going to be diving into what programming libraries are, what to use programming libraries for in data science, top Python libraries for data science as well as top Python libraries for Machine Learning. We will break down what these libraries do plus how you can get set up using them.
What are Python Libraries?
Programming libraries are pre-written pieces of code that are available online for anyone to use. You can download these little pre-written sections of code and import them into your own programs so that you don’t have to start from scratch each time.
It is also the way that people who create programs share their work with others. This allows anyone to quickly build new programs without having to worry about creating everything manually while staring from a blank screen.
Code sharing, also called open-source code, is basically the way a programming language develops.
You can compare it to the auto industry and their process for building cars. Do car manufacturers reinvent the wheel for every car they build?
No, they don’t.
They use the same foundational designs not just for the wheel, but also for tires, for car bodies, and for engines.
Similarly, programmers don’t have to recreate the same code over and over again.
Programming libraries also get continuously updated and improved, so you don’t just have to repeat less, but things become better and faster with time.
This is all around awesome.
Why?
For one, this saves a lot of time for everyone, because you can just continue where someone else has left off, and just as importantly, it allows newcomers and others that are interested to just take the code and play around with it.
Although the people that develop the open-source code may be researchers or have advanced training in mathematics and computer science, once the code has been completed and is up and working, all anyone else needs to do is just import it and hit run.
This means that anyone who, say, wants to scrap some data from the interwebs can easily do this by importing a web scraping library. That way they don’t have to worry about dealing with complicated issues like network communication.
You just import the relevant web scraping library, type in the website URL that you want to scrape, hit run and ta-da — in just two lines of code, your program has now visited a website.
Awesome, right?
Fortunately, people enjoy doing all sorts of different things and code-sharing; be that privately from one person to another, or publicly through platforms such as GitHub. Code-sharing is a very big thing in the programming community.
More than that, everyone can do it. That means you can use the latest artificial intelligence libraries, developed by insanely smart researchers at Universities or at companies, and be able to start using them yourself.
Of course, if you ever feel comfortable enough, you can always contribute to open-source code, or create new code of your own to share and allow others to use.
Think of using open-source code like using a computer, the people who make it and improve upon it have a deep understanding of different fields of technology, hardware and software, but if you want to be able to use the computer, all you’d really need to do is turn it on and play around with it to learn and understand what you are able to do with a computer.
In addition, software updates roll out regularly that you just have to hit install on, and these make your computer faster, more secure, or add even more features for you to use.
This open source code is what we call modules or libraries and, chances are, if you’ve ever programmed, you’ve probably already used one.
Now you may say: “Alright, so some common things may have libraries built to help me do that, but what if I want to do something really unique and niche?”
Good question.
This is where the size of the community becomes important, and why it’s such a great idea to use popular languages. Python is one of the most popular languages world-wide, and the Python community is so big that it has over 200,000 libraries available.
Okay, so now that you have a general understanding of why libraries are fantastic, let’s move on to more practical applications of libraries.
What to Use Python Programming Libraries for in Data Science
The great thing about programming libraries is that all the features come pre-packaged, and so all you have to think about is what you want to use it for.
- Want to create your own game? Great! All you have to do is import PyGame and code out the rules of your game.
- Want to make amazing data visualizations? How about installing matplotlib and visualize the data you got from your web scrap?
- Have to deal with dates and times in your program? Use datetime!
And you may be thinking: “Do I reaaaaally need a library to figure out dates and times in my program?”
Things like dates and times may seem easy until you start to factor in the apparent randomness of some months having 30 days, others 31, and then you’ve got February who’s stuck at 28 but gets an extra day dangled onto it every 4 years.
And then you’ve got all these time zones PST, EST, CST (not to be confused with CT), WIST, GMT (which is the same time as UTC). You can see how this can get very complicated very fast.
No worries though, just use datetime and all of this is accounted for for you.
Programming libraries allow us to skip manually coding the standard, pre-defined, and sometimes majorly complex, underlying processes, and focus instead on what we want to do with the end result of our program.
Especially for data scientists, our focus should be on analyzing and drawing conclusions from the data and less on creating the programs that help us do our jobs.
Disclaimer: I know I’m making it sound very plug and play, but the whole process may not take all but 5 minutes. You do need to find what you want to use and how you want to use it.
My general point is that doing all of these new and cool fancy things won’t require you to be some sort of programming prodigy that types 3,000 words per minute; it really just requires some motivation to get you through the looking/reading/course watching to learn how to code and use the libraries you’re interested in, and some excitement to then take what you’ve learnt and implement it into your own project.
7 Python Programming Libraries for Data Science & What to Use Them For
Alright, let’s get into it and take a look at the core Python libraries that every data scientist should know, and go into what they are, why you should know it, and what they’re used for.
1. NumPy
What is NumPy?
NumPy, short for Numerical Python, is, much like the name suggests, a library used for numerical calculations in Python. It’s all built upon the basic datatype called the NumPy array, which you can imagine as a Python list, except much faster and much more versatile.
How to use NumPy
Importing Numpy and converting a python list to a NumPy array is very easy. You can just do the following:
import numpy as np
pythonList = […]#Whatever your list is
numpyArray = np.array(pythonList)
So what makes this NumPy array so special?
Let’s say you want to multiple every number in your list by 100 to convert it from a proportion or probability to a percentage.
In Python, your code would look something like this
for i in range(len(listOfNumbers)):
listOfNumbers[i] *= 100
Whereas using the NumPy array, it would look like this
listOfNumbers *= 100
Okay, so we’ve shortened our code from two to one line… what gives?
Not only can you use additional NumPy syntax to shorten large loops to much smaller ones, the real power lies in how fast these calculations are performed.
Let’s say we run the above code on a set of 100,000,000 numbers. If we use the standard Python list, it takes about 86.8 seconds to run. Not bad for one hundred million numbers.
But if we do the same thing using NumPy it takes a mere 1.3 seconds. How is that possible?
It’s possible because NumPy is programmed to automatically use your processor to its full extent.
What does that mean? Well, imagine you’re at a restaurant that has 10 waiters. All 10 of them are lined up against the wall and only one waiter is allowed to go out and take an order at a time; only once that waiter has come back does the next waiter go out to take the next order.
Inefficient, right? That’s about what the code is doing when you use the python list code above.
NumPy, on the other hand, runs their business much more efficiently by letting all the waiters take orders, all of the time.
That is a big part of the appeal of using NumPy; many libraries, such as SciPy, Pandas, and Matplotlib, which you’ll learn about more below, are actually built up using NumPy as their foundation.
If you want to know more about NumPy, you can read about it in my blog post about vectorization here.
2. Requests
What is Requests?
Request, built up from urllib, is a library used to make HTTP requests/send messages over the internet. It’s basically your gateway to the online world within Python.
Requests allow you to do all sorts of cool things, like connecting to a website or an API to either get information or to send information there.
This is awesome for data scientist because it gives you access to the vast sea of information and data that is the internet. It literally opens the (network) gateway for you to get data from anywhere online.
Besides letting you get data from dedicated APIs that are available either from a company you’re working for, or through an outside data source, it also opens up the world of web scraping to you.
In case you’re not familiar with web scraping or if you’re really interested in it, you can read more about it in my absolute beginner’s guide to web scraping.
How to use Requests
Let’s say you want to rank all my articles by the number of times I mention the word Python on my blog homepage; you can do it like this:
import requests
blogResponse = requests.get(“https://codingwithmax.com/”)
blogPage = blogResponse.text
blogPage.count(“Python”)
That’s it – easy peasy.
After we imported Requests, we just had one line of code to connect to my blog, the next line saves the page text into the blogPage variable, and in the final line, we just counted the number of occurrences of “Python”.
3. Beautiful Soup
What is Beautiful Soup?
Web scraping, however, is much more than just finding the number of occurrences of a word like “Python”, but getting this data from a website can be challenging because it’s usually not very structured.
After all, the main idea of a website is not to give you an easy and straightforward way to quickly extract data from it, but rather to be informative and useful for its visitors.
Websites are formatted using HTML (and CSS), and they are luckily more structured, and that’s where Beautiful Soup comes in.
Beautiful soup is a library that lets you easily navigate HTML or XML tags, so things like <title>This is a title</title>.
How to Use Beautiful Soup
Let’s say you want to find out the title of each of my blog posts, and you already have all the blog post links (through an earlier web scrap) and now you’re just going through 1 by 1.
Since you already know this information is contained within the title tag of the page, you could just pull up the page source (the html of the blog), look for the title tag, and then start after the “>” and go up to the next “<” (not including, of course, any “<” that may appear in the title.
In Beautiful Soup, however, you can just directly access the title tag, it’s as easy as:
from bs4 import BeautifulSoup
soup = BeautifulSoup(pageText,features=”html.parser”)
soup.find(“title”).text
The pageText is, for example, the blogPage text from the requests section above.
This simplicity that Beautiful Soup brings to webscraping means you can quickly parse websites without needing to write out complex logic to getting to the values you’re interested in.
4. Selenium
What is Selenium?
Most websites have some content that is generated as a user visits the websites and interacts with it. Other things, such as live prices, are loaded from somewhere and updated as long as the website is open.
If items on the page are generated, this means that you can’t just get the HTML text once and then look through it, you’d need a way to open the website, load the content, and then read the HTML once everything (or at least the things you’re interested in) are loaded.
This is where Selenium comes in.
Selenium allows you to automatically open and navigate websites using a web browser, which lets you do anything from monitoring and extracting live-updated data to actively interacting through selecting drop-down items or even logging into websites.
How to Use Selenium
With Selenium, this is as simple as:
from selenium import webdriver
browser = webdriver.Chrome(chromedriverLocation)
browser.get(“https://codingwithmax.com”)
browser.find_element_by_xpath(xPath).click()
browser.quit()
chromedriverLocation here is just the location of the chromedriver on your computer, which lets you open a chrome browser.
The xPath is just the location of the button on my website, which you can quickly get by opening the developer console on codingwithmax.com, and then finding where the button is, and right-clicking its location in the developer console and copying out the xpath.
5. Pandas
What is Pandas?
If you’ve ever had to deal with any type of data, you’ve probably done so by using some sort of table or sheet software, something like Excel, Numbers, or Google Sheets.
And if you’ve ever had to play around with data in one of those softwares, you’ve probably quickly noticed some annoying things like:
- The software sometimes starts to lag, especially when you have more than just a few lines of data in it
- When you drag formulas down cells, or copy + paste cells, and then have to re-type them because it didn’t get transferred properly
- Or you have to manually reshape tables into the format you need to then be able to use the data in a formula or for visualization
Pandas basically solves all of these issues for you, while still keeping that easily manageable table format.
It does this by not having to constantly display all the data, so it doesn’t experience the same lag and, because you can code the formula and properly say what you want to calculate, you don’t have to deal with manual re-shifting or re-typing.
How to Use Pandas
Let’s say you’re interested in chocolate beans, and you have a bunch of data that tells you where the beans are from, what their cocoa percentage is, where they’re processed, and so on and so forth.
You’re interested in knowing what the average cocoa percentage is for each bean type, but you’re not 100% sure what the best way to calculate this is.
With pandas, you can quickly and easily load all that data into a DataFrame that we’ll just refer to as df.
Then it’s just a loop and two short lines of code (which you can actually even combine into one if you like), that looks like this:
Import pandas as pd
df = pd.DataFrame(beanData)
for bean in df[“beanType”].unique().tolist():
dfSpecificBean = df.loc[df[“beanType”] == bean]
meanCocoaPercent = dfSpecificBean[“cocoaPercentage”].mean()
Not only is this code easy to write and read, but it also runs incredibly fast, much faster actually than if you were to do this all yourself from scratch.
6. Matplotlib
What is Matplotlib?
As a data scientist, it is vital that you are able to visualize your data. You do this to get a better understanding of the data by being able to visualize explore your dataset and look for trends and patterns, as well as to easily communicate results to others.
This is where matplotlib comes in. Matplotlib easily allows you to create all sorts of visualizations from graphs like scatter plots and histograms, to complex things like 3D animations of your data.
Because of just how many graphs you can create, how easy it is to create them, and how much you can customize them, matplotlib is the go-to library for data visualization in Python.
How to Use Matplotlib
Let’s say you have some trends data over time, such as the number of users of your product and you want to neatly visualize this and communicate the effects of key events to others, such as a new product launch and a week your servers had problems.
And this is the code for it:
import matplotlib.pyplot as plt
plt.plot(x,y)
plt.bar(outageDay+int(outageDuration/2),height=max(y)+50, width=outageDuration,color=”red”,alpha=0.3)
plt.bar(productLaunch,height=max(y)+50,width=3,color=”green”,alpha=0.3)
plt.annotate(“server outage”,
xy=(outageDay, 900),
xytext=(outageDay-70, 800),fontsize=12,
arrowprops=dict(arrowstyle=”<-“,
connectionstyle=”arc”),
)
plt.annotate(“New Feature\nLaunch”,
xy=(productLaunch, 900),
xytext=(productLaunch-70, 800),fontsize=12,
arrowprops=dict(arrowstyle=”<-“,
connectionstyle=”arc”),
)
plt.ylim(0,950)
plt.xlim(0,365)
plt.xlabel(“Days since launch”,size=20)
plt.ylabel(“Daily users”,size=20)
plt.title(“Daily users since product launch”,size=25)
Note that we actually only have 3 lines of code where we visualize the data, these are:
plt.plot(x,y)
plt.bar(outageDay+int(outageDuration/2),height=max(y)+50, width=outageDuration,color=”red”,alpha=0.3)
plt.bar(productLaunch,height=max(y)+50,width=3,color=”green”,alpha=0.3)
Everything else is extra fancy customizations we’ve added in.
With Matplotlib, it’s so quick to create a visualizations (the three lines above), and there are so many options for customization, such as adding the arrows and text, adding axes labels and titles, as well as setting the x and y range of our graph.
Matplotlib provides the great combination of quick visualization + tons of customization possibilities and you can keep it as simple, or make it as complex as you want or need.
7. SQLite
What is SQLite?
A data scientist is either working on data, or working on getting the data they need to start working on it.
If you’re working with a company, an organization it’s likely that they’ll have their data stored in some sort of database, and it’s likely that it is an SQL-based database.
Big projects are usually use large databases like PostgreSQL or Oracle, but fortunately accessing SQL databases is basically all the same.
SQLite is a library that lets you access and create databases on your disk, which makes it extremely useful for quickly prototyping, and also for other projects like storing and accessing data in applications.
Because all SQL databases are so similar it’s very easy to transfer your prototype code into code for larger databases.
How to Use SQLite
Let’s say we need to get our data for the cocoa example earlier; this is how we can go about it if it’s in an SQL database.
Using SQLite is, again, very straightforward.
All we have to do is connect to our database, which is as simple as opening a file, and then just writing your SQL query.
import sqlite3
conn = sqlite3.connect(“chocolate.db”)
c = conn.cursor()
c.execute(“SELECT beanType,cocoaPercentage
FROM cocoaData”)
beanData = c.fetchall()
c.close()
conn.close()
With sqlite, we can quickly connect to our database and just access all the tables and data inside using standard SQL.
Another great thing about this is that, now that we have it in Python, we can actually automatically create SQL queries in our code simply by changing the up the text we pass to the execute statement, which gives us even more flexibility.
8 Machine Learning Specialization Libraries for Python
Now that we have seen all the core Data Science libraries you should know, we can build upon the Data Science work to specialize into Machine Learning. This way we can extend and apply our Data Science knowledge to create all sorts of machine learning models.
It’s important to keep in mind that Machine Learning libraries, like Data Science libraries, are made to easily create and test ideas, but they are not a substitute for subject knowledge.
A good data scientist has the subject knowledge and understand of what they want to do and how they should approach problems using data, and then use Data Science libraries to quickly implement their ideas and work through their analysis.
Similarly, Machine Learning libraries make your life easy after you’ve chosen to specialize in Machine Learning. However, Machine Learning libraries don’t substitute a solid Data Science foundation and then further specialization into Machine Learning.
If you understand the different techniques you can use, which may be applicable to the current situations, how to properly prepare data, and how to go about evaluating and improving models, machine learning libraries quickly and easily let you implement and test your ideas.
1. SciKit Learn
What is SciKit Learn?
Machine learning is becoming a very popular field for research as well as potential business applications. The goal here is that instead of writing complex and long code you can have a machine learn a model that represents the data and the resulting outcome.
Machine learning is a huge field and has many different topics. Scikit learn focuses on what most would call the standard or fundamental machine learning algorithms, which is anything from simple linear regression to complex algorithms like manifold based dimensionality reduction.
Scikit learn can be your go-to for basically everything machine learning related that you’ll need to start out with. Because its algorithms are highly optimized (i.e. very fast) and because they are updated with new research papers, you’re going to find a lot of power with scikit learn.
Aside from machine learning algorithms, it will also provide you with ways to test your model performance, such as through cross validation, or improve your model performance by selecting optimal hyperparameters through grid search.
[Hyperparameters are free parameters that let you control the algorithm better, such as how fast it should learn, or how many decision trees you want in your random forest.]
It’s usually good practice to first start off with simpler models for a problem, and only later move to more complex ones and only stick with these if the shift provides better performance. If everything else is equal (or about equal) always go with the simplest version.
Therefore, unless you’re looking to do very specific things, like build a complex recommendation system or a deep neural network, you can default to scikit learn to start building a machine learning model for whatever problem you’re looking to solve.
How to Use Scitkit Learn
Although scikit learn provides many different algorithms, they basically all follow the same structure.
- You import your algorithm
- You define its hyperparameters, or use the default ones
- You fit the model to your training data and assess its performance
- You predict the values of your testing data or new data
Let’s use an example for predicting caffeine percentage of coffee beans from a particular region. Let’s say our dataset has data on things like bean origin, bean type, rainfall in the past month, average temperature, average cloud cover, and so on as well as, of course, the average caffeine percentage of the beans.
In this case, the caffeine percentage is the target value that you want your model to learn to find, and everything else are your features.
Our training and testing data format can be a pandas dataframe, or it can be a 2D nested list or a 2D numpy array, where each row is one instance (e.g. data from field number 1), and each column is a specific feature (e.g. average temperature).
Our target data format can also be an array, list, or dataframe. Scikit learn is very adaptable and makes it very easy to work with data you’ve would’ve prepared earlier during or after your analysis.
So, for example, your input trainingData could be a pandas dataframe that looks like this
Rainfall | Average Temperature | Average cloud cover | … |
1.5 | 22.5 | 0.15 | … |
… | … | … | … |
Or it could be in a nested listed/2d array like this
[[1.5, 22.5, 0.15, …], […], …]
And your targets could look like [0.012, …]
You could train a simple linear regression model on this data like this
from sklearn.linear_model import LinearRegression
lrReg = LogisticRegression()
lrReg.fit(trainingData,trainingTargets)
predictedValues = lrReg.predict(testingData)
If you want to assess your model performance, you can do the following:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np
lrReg = LogisticRegression()
results = cross_val_score(lrReg, trainingData, trainingTargets, scoring=“neg_mean_squared_error”,cv=3)
rmseScores = np.sqrt(-1*results)
Or if you want to train a random forest with 300 decision trees each at most a depth of 5 and each only randomly trained only on 20% of all features at each split to increase diversity you could just do
from sklearn.ensemble import RandomForestRegressor
rfReg = RandomForestRegressor(max_depth=5,
max_features=0.2,
n_estimators=300,
)
rfReg.fit(trainingData,trainingTargets)
predictedValues = rfReg.predict(testingData)
You can see just how little we had to do to train a machine learning model here, and just how easy it is to switch out one for another.
Machine learning, of course, is more than just picking a model and fitting. It’s everything from preparing and pre-processing the data, to completely evaluating and refining one, or several models that may possibly be joined together, to continually improving your model after its gone live and everything in-between.
Scikit learn provides you with all the tools required to do this.
It requires a good understanding of the different techniques available and how the different models work, so that you’re not just throwing something into a black box but are making educated decisions to get you to the final result.
Scikit learn makes your life extremely easy here because as you’re doing the thinking it provides many handy options to test out your thoughts in only a couple of lines of code, meaning you can focus on what’s actually important, rather than having to write everything from scratch before you can test anything.
In short, scikit learn should also be your default library to go to when wanting to implement basically anything that’s not a neural network, regardless of if it’s a supervised, semi-supervised, or unsupervised task.
2. XGBoost
What is XGBoost?
XGBoost, which stands for Extreme Gradient Boosting, is a library that gives you access to highly optimized boosting techniques. Boosting algorithms are algorithms that use several models, where each model is specifically trained to focus on correcting the mistakes that the previous model made.
Although you can use Scikit learn to build boosting ensembles, your performance and speed will probably not be as good as if you use XGBoost since XGBoost was built specifically just for building boosting models, whereas Scikit learn focuses on giving you access to as many tools as possible.
Since boosting generally requires your models be trained and evaluated in series, since each model has to wait for the result of its predecessor, they can’t really be optimized using standard parallelization techniques.
How to Use XGBoost
Fortunately, XGBoost has been built in a way to build models very similarly to the way they are built in Scikit Learn.
Let’s say we want to tackle the same problem as we did in the Scikit learn example for predicting caffeine percentage.
To build a model using XGBoost, we first have to change our data format from the train and target format we use in Scikit learn to the format used for XGBoost.
XGBoost’s format has the target value in the first position, then a space, and then it follows a series of feature indices:feature value. For example, re-writing the Scikit learn data format from above to XGBoost format:
0.012 0:1.5 1:22.5 2:0.15 …
…
This means that our target values is 0.012, the value of feature 0 is (:) 1.5, feature 1 is 22.5, feature number 2 is 0.15, etc.
We do the same also for our test data and now that we have this new data format we can quickly build our XGBoost model, using just a random forest as a base, as follows
import xgboost as xgb
param = {“max_depth”:5, “objective”:”reg:squarederror”}
bst = xgb.train(param,newTrainData)
preds = bst.predict(newTestData)
You can see that it has a very similar simplicity to scikit learn and in just a few lines we’ve just created a gradient boosted random forest.
3. TensorFlow
What is TensorFlow?
While Scikit learn focuses on the classical machine learning, TensorFlow’s is designed to let you easily perform numerical computations at scale, which means it has many different areas it can be used in.
TensorFlow works by defining graphs before you run calculations, which lets it understand what parts are important when as well as how components can be distributed and how the whole system can be scaled onto multiple machines and using both CPUs and GPUs.
For data science, a great thing is of course its ability to let you freely design your neural networks, as it has many pre-defined layer architectures that you can use, and it allows you to create and train them (or even do distributed training) with relative ease.
Of course, if you want to build your own algorithms that are not standard parts of machine and deep learning, TensorFlow is also a great place to create these because you still get all the power, performance, and scalability features.
TensorFlow is built and maintained by Google, so you can be sure that its design is built with both scalability and performance in mind.
The goal of TensorFlow is to allow you to easily create shallow and deep neural networks, and you do this through building up your neural network one layer at a time, easily and neatly stacking one on-top of the other.
You can add in a fully connected layer, or even directly use more complex ones like convolution or recurrent layers.
The goal of TensorFlow is to make your life on building, training, and deploying (deep) neural networks extremely easy, so that you don’t have to worry about scalability or algorithm performance, but can focus on building great models that you can scale at will.
How to Use TensorFlow
Although TensorFlow makes your life of creating neural networks substantially easier, there’s still a lot that goes into it.
Even creating a very “simple” 3 layered neural network to, once again, predict caffeine percentage as we did above, would require much more setup than above.
This is because the lower level API requires you to define the parameter shapes, the steps to take, the cost function to use, the optimizer, as well as the running of the calculation and the evaluation of the model.
Fortunately, TensorFlow also has a higher level API that you can run on-top of TensorFlow, called Keras, which allows you to achieve the same result with much more ease.
4. Keras
What is Keras?
Keras, as mentioned above, is a high level API that you can run on top of TensorFlow (as well as Theano and CNTK). This makes it even easier to create (deep) neural networks, since you get the power from TensorFlow, but you have even more readability and need to set up even less.
Of course, you can still dip back down into TensorFlow’s lower level API to basically create and type of neural network architecture you like, but for most people and most applications the flexibility provided by Keras is more than enough.
How to Use Keras
Let’s say we want to, once again, predict caffeine percentage as above. Let’s say we’ve got 10 features (i.e. avg. temperature, avg. cloud cover, + 8 more), and we want to build a 3 layered neural network.
The first layer, with size equal to 10, will be out input layer. This is where the data comes in from. The second layer, our hidden layer, we can give any size we want so let’s give it a size of 50 for now. The third layer will be our output layer, which will be of size 1 since we want to predict one final value (the caffeine percentage).
On our hidden layer, we’ll apply l2 regularization with an alpha of 0.001, and use a relu activation function in our hidden layer, and optimize, using the adam optimizer, for 100 epochs using the mean squared error loss function.
import keras
from keras.models import Model
beanInput = keras.layers.Input(shape=[10],name=”inputData”)
dense1 = keras.layers.Dense(units=50,
activation=keras.activations.relu,
kernel_regularizer=keras.regularizers.l2(0.001))(beanInput)
output = keras.layers.Dense(units=1,activation=None)(dense1)
model = Model(inputs=beanInput,outputs=output)
model.compile(“adam”,”mean_squared_error”)
model.fit(trainData,trainTargets,epochs=100)
preds = model.predict(testData)
Not bad for building a neural network. It may look a bit intimidating at first but after a second or third read it will make more sense and seem logical and connected.
The nice thing is, if we want to add a second layer with, say, 25 units, all we need to do is add in the other layer, and update the input of our output layer:
import keras
from keras.models import Model
beanInput = keras.layers.Input(shape=[10],name=”inputData”)
dense1 = keras.layers.Dense(units=50,
activation=keras.activations.relu,
kernel_regularizer=keras.regularizers.l2(0.001))(beanInput)
dense2 = keras.layers.Dense(units=25,
activation=keras.activations.relu,
kernel_regularizer=keras.regularizers.l2(0.001))(dense2)
output = keras.layers.Dense(units=1,activation=None)(dense2)
model = Model(inputs=beanInput,outputs=output)
model.compile(“adam”,”mean_squared_error”)
model.fit(trainData,trainTargets,epochs=100)
preds = model.predict(testData)
As you can see from above, the nice thing is we can also define the regularization for each layer, as well as the activation function we use for each layer. Actually, we can define many more things for each layer, like initialization, bias, and constraints, so we truly have a lot of freedom at each step.
5. PyTorch
What is PyTorch?
PyTorch is a library designed specifically to help you easily create, manage, and scale deep neural networks, or applications using those (like computer vision or natural language processing).
Just like TensorFlow, it can also be used for other numerical calculations, but we won’t be looking at that.
It also works using a graph system and can also make use of both CPUs and GPUs when scaling.
It’s mainly maintained by Facebook, and it’s designed specifically with Python in mind. A lot of the complexity is hidden so that implementations with it a more like scikit learn implementations, rather than the lower level TensorFlow implementations.
How to Use PyTorch
To create the same model like we did above for our Keras example, we would do:
import torch
import torch.nn as nn
nIn = 10
nHidden = 50
nOut = 1
nEpochs = 100
nNModel = nn.Sequential(nn.Linear(nIn,nHidden),
nn.ReLU(),
nn.Linear(nHidden,nOut))
lossFn=nn.MSELoss()
optimizer = torch.optim.Adam(nNModel.parameters(),
lr = 0.001)
for i in range(1):
trainPred = nNModel(trainData)
loss = lossFn(trainPred, trainTargets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
preds = nNModel(testData)
Generally, the implementation looks pretty similar to Keras, however there are some notable differences.
First, when we define out model we actually define it all in one series, and the activation functions are applied essentially after each layer, rather than being part of the layer.
Second, we write out the epoch training ourselves, whereas in Keras we just defined the number of epochs and told it to run. Nevertheless, even that process is pretty straightforward and still very readable, even if you’re not familiar with the library.
At this point you’re probably asking, so, do I use TensorFlow’s low level API, which is more intricate but gives me a lot of flexibility, do I use the high level API Keras provides me for TensorFlow, or do I use PyTorch?
Well, that answer is mostly up to you and what you prefer or what your colleagues or the companies you’re looking at use.
I would generally recommend using Keras over TensorFlow because it’s simpler to code up and you probably won’t use a lot of the freedom TensorFlow provides. That being said, if you do feel like you will be going deep into that research then go with TensorFlow.
In case you like the look of PyTorch, your colleagues are using it, or you have some experience with it, then go with, or stick with, PyTorch.
You don’t need to worry much about “which one is better’,’ they’re all amazing, so it’s mainly just down to personal preference.
6. PySpark MLLib
What is PySpark MlLib?
Above with Scikit learn, we’ve seen that it has a very large number of tools available to do all sorts of machine learning with, and we’ve also seen that libraries like TensorFlow, Keras, and PyTorch allow us to do numerical simulations and deep neural networks at scale.
There’s a region that we left a bit unexplored, until now, which is what happens if we want to use more standard machine learning techniques, but our datasets are extremely large that they don’t fit into memory?
Although Scikit learn does offer some settings that let you find in partial datasets over several iterations, Spark MLLib is specifically designed to do machine learning on very large datasets.
Spark MLLib is part of Apache Spark, which you can use in Python using the PySpark library.
Spark’s standard datatypes are the RDD (Resilient Distributed Dataset) and the Spark DataFrame and the main goal of these two datatypes is to enable computations to be distributed over multiple machines (hence the distributed) and then rejoined at the end.
When distributing work, Spark also keeps track of any jobs that fail and re-assigns them to be re-calculated. That way you don’t have to re-calculate everything from scratch incase a machine goes offline on a job fails, on a small part needs to be redone (this is the resilient part).
PySpark’s design allows both streaming as well as batch calculations, which makes it great for Big Data use cases where data sources can be streaming in constantly, or also when you have large data storages that you want to analyze.
How to Use PySpark MLLib
Let’s go back to our caffeine prediction example from Scikit learn, where we have a set of features that we want to use to predict the caffeine percentage of a coffee bean.
The first thing we need to do is convert the data into a format that PySpark MLlib works with, called the LabeledPoint.
A LabeledPoint has the format LabeledPoint(y,x), where y is your target value or label, and x are your features, which you supply in formats like a NumPy array, a Python list, or a sparse vector.
To convert the full data then, we can store it as a list of LabeledPoints, like this
from pyspark.mllib.regression import LabeledPoint
trainData = [LabeledPoint(y_1,x_1),
LabeledPoint(y_2,x_2),
…
LabeledPoint(y_m,x_m)]
…where each LabeledPoint corresponds to one instance used for training.
There’s one more transformation we have to do, which is taking this list of LabeledPoints and turning it into Spark’s RDD data format. To do this we just do the following.
from pyspark import SparkContext
sc = SparkContext(“local”, “Testing Linear Regression “)
rddData = sc.parallelize(data)
Now let’s build a simple linear regression model to train on this data
from pyspark.ml.regression import LinearRegression
linReg = LinearRegression.train(rddData)
And that’s it, now we’ve trained our Linear Regression.
To make predictions on data we don’t have to turn them into LabeledPoints but rather can either feed in one row of features directly, like so:
prediction = linReg.predict([feature_1,feature_2,…,feature_n])
Or, if we want to predict for multiple instances, we have to first again convert them into an RDD dataframe format, which we can either do beforehand like this
testData = sc.parallelize([listOfTestFeatures_1,listOfTestFeatures_2,…])
preds = linReg.predict(testData)
Where each listOfTestFeautres is a list of the format [feature_1,feature_2,…,feature_n]
Or we can also just do it directly in the prediction like this
preds = linReg.predict(sc.parallelize([listOfTestFeautres_1,listOfTestFeautres_2,…]))
To get out our predictions, we have to call .collect() on our predictions, like so
print(preds.collect())
As you can see, machine learning with PySpark MLLib is also made very easy for us, the important thing is just to transform the data correctly into PySpark’s RDD format.
Of course for PySpark, as we have for all our other Machine Learning libraries, a lot of Machine Learning work is all the different ways you can improve model performance and optimize the speed of your model. All these libraries we looked at make these tasks much easier for us, so that we can spend our time focusing on finding what works, and not spend the majority of our time coding up algorithms from scratch.
7. NLTK
What is NLTK?
NLTK is a little bit different than the rest of the libraries we looked at in the sense that all above machine learning libraries provided general tools to let us tackle any type of project we want.
NLTK, which stands for Natural Language Toolkit, focuses specifically on Natural Language Processing.
Although you may be able to build up your own Natural Language processing architecture using the tools provided, for example, by TensorFlow, things are a lot easier if you can instead use pre-made architectures.
Furthermore, NLTK provides a large set of tells to process and prepare text data, which can be extremely helpful since text data is so different than numerical data. For example, you may know that the words “In addition” and “Furthermore” have very similar meanings, but that’s not at all obvious to a computer.
Not only that but numerical data usually has an obvious order and structure to it, for example 1 and 2 are close to each other, and 2 is larger than 1. These relations are much more complex for languages.
Creating good natural language software is not only about creating good model architectures but is heavily dependent, just as the rest of machine learning, on great data preparation. NLTK makes your life easier in both data preparation as well as model creation.
How to Use NLTK
Let’s take a look at sentiment analysis, and how we can go about tackling that problem.
We’re not going to focus on all the extensive pre-processing you can do to better break down your sentences and improve model performance, such as dealing with slang, negations, conjunctions, etc. but will focus on showing how easy it can be to get started with natural language processing using NLTK.
Again, like with everything else we’ve looked at, to create great NLTK models, it’s important to have solid domain knowledge and understanding of topics like language structure and how it can be best broken down to provide useful information to your algorithm as well as the host of other problems you may encounter when dealing with grammar and text.
We can use NLTK’s vader model for sentiment analysis based on the paper from Hutto and Gilbert to have a prepared model for sentiment analysis. (Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.)
Performing sentiment analysis then becomes as easy as:
sentence = “I love codingwithmax!”
import nltk
nltk.download(‘vader_lexicon’)
sid = nltk.sentiment.vader.SentimentIntensityAnalyzer()
print(sid.polarity_scores(sentence))
… which then tell us that the sentence is very positive.
Using one of the pre-trained models from NLTK has saved us a ton of time doing everything from looking for and labeling data to build and optimizing, and provides us with a great and easy way to quickly analyze sentence sentiment.
8. Eli5
What is ELI5?
ELI5 is a library that helps you understand how the machine learning classifiers you built make predictions. From the libraries we’ve looked at above, it is compatible with models from Scikit learn, XGBoost, and Keras.
You can use it to explain predictions by looking at targets and how it used the features to predict the target, understanding the weights calculated by your model, as well as understanding feature variance, and how much feature importance changes as data is shuffled around.
It’s usually best to use ELI5 in a Jupyter notebook because it can deal with the html formatting that ELI5 can output to explain the parts of the model.
How to Use ELI5
Since ELI5 requires classifiers, and our example of caffeine percentage has been a regression problem of prediction the exact number, let’s reword our example.
Our model is now interesting in prediction if a bean will have a caffeine percentage of at least 1.5%.
To adapt our data to this all we need to do is change our targets, which we can do as follows (assuming our trainTargets are in a NumPy array)
newTrainTargets = trainTargets>=0.015
To understand feature importance variations, ELI5 requires us to use a validation set to experiment with. Well just do 1 fold validation, assuming our data is nicely distributed and there’s no need for stratified sampling, and separate off 20% of our test data separate to use for validation.
validationData = trainData[int(0.8*len(trainData)):]
validationTargets = newTrainTargets[int(0.8*len(trainData)):]
newTrainTargets = newTrainTargets[:int(0.8*len(trainData))]
trainData = trainData[:int(0.8*len(trainData))]
Remember that our trainTargets and trainData is the same length, so we don’t need to change those out in the indexing.
Now we can train a classifier using this, let’s pick a simple logistic regressor using Scikit learn.
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression()
logReg.fit(trainData,newTrainTargets)
And now we can use ELI5 to explain feature importance for the logReg model we just trained. The variable featureNames is the names of your features which can either be stored in a separate list/array or, if you’re using pandas dataframes, will just be the column names (extracted like this featureNames = trainDataDF.columns.tolist()).
import eli5
importance = eli5.sklearn.PermutationImportance(logReg).fit(validationData, validationTargets)
eli5.show_weights(importance, feature_names = featureNames)
And there we have it, a table showing our feature names and their importance as well as how much they can fluctuate.
Quick Summary
So let’s quickly recap, programming libraries are sections of code made available to the community by other develops that let us integrating complex features into our programs so that we don’t have to write these ourselves.
Programming libraries are the reason we don’t need to reinvent the wheel every time a programmer writes a program, and that is why they are great tools to implement for efficiency and quality.
It’s also important to mention once again that programming libraries don’t replace the need for subject matter and domain knowledge. They provide us with a set of tools that make our lives extremely easy, and let us quickly implement and test our ideas with very little hassle.
And that’s it!