Deploying a machine learning model to the web
Data scientists often have to communicate results to other people. In my case, my supervisors might want to see some...
import pandas as pd
import seaborn as sns
%matplotlib inline
When you have a single continuous variable and want to visualise the distribution of its values in your dataset, a histogram is generally what you need. This groups the values into bins, where each bin is an interval within the range of values your variable can take. The x axis will show the interval of each bin, while the y axis shows the number of values in your dataset that fall within that interval.
Let's load in some data using seaborn's handy load_dataset()
function. The flights
dataset has three variables: two ordered categorical (year
, month
) and one continuous (number of passengers
).
Input:
flights = sns.load_dataset('flights')
flights.head(3)
Output:
A simple histogram will show the overall distribution of the passenger
variable. This is easy to plot, as pandas dataframes have a built-in method for generating it.
Input:
import pandas as pd
import seaborn as sns
%matplotlib inline
Output:
By default, pandas plots histograms using 10 bins but you could fine-tune this. Displaying more bins gives a more detailed overview of the distribution, up to a point: it all depends on how many observations you have overall and how they are distributed. You can see how using 20 bins shows more information about the distributions inside the larger 5 bins.
Input:
flights.passengers.hist(bins=5) # The blue bars
flights.passengers.hist(bins=20) # The orange bars
Output:
So the range of passenger numbers is a little over 100 to a bit over 600, with most flights towards the lower end. For a more precise overview, the describe
method for a dataframe's columns will give general descriptive statistics.
Input:
flights.passengers.describe()
Output:
count 144.000000
mean 280.298611
std 119.966317
min 104.000000
25% 180.000000
50% 265.500000
75% 360.500000
max 622.000000
Name: passengers, dtype: float64
For a visual representation of describe
, a boxplot will show the minimum and maximum values (the left and right whiskers), the range of values covered by the 25th to 75th percentiles (the box) and the value of the median (the line inside the box).
Input:
sns.boxplot(x=flights.passengers)
Output:
When you have a variable which takes on named, rather than numerical, values then the most common way of representing them is with a bar chart.
Here, we'll load the titanic
dataset. Each row is a passenger on the ship, while the class
variable gives the class of that passenger's ticket.
titanic = sns.load_dataset('titanic')
titanic['class'].value_counts()
Output:
Third 491
First 216
Second 184
Name: class, dtype: int64
You can chain .plot(kind='bar')
to the above value_counts()
method, but I prefer to use seaborn as you can directly pass it the original data. It will then do the counting for you and allow you more control over appearance. For example, if you do not like the ordering seaborn used for the x axis, then you can set it manually as a list e.g. order=['Third', 'Second', 'First']
Input:
sns.countplot(titanic['class'])
Output:
If you want to normalise the counts so as to see relative percentages rather than counts, then you just need to do that to the data before plotting it as a normal barplot.
Input:
titanic_normed = pd.DataFrame(titanic['class'].value_counts(normalize=True)).reset_index() sns.barplot(data=titanic_normed, x='index', y='class')
Output:
Above, we only had a single variable. We examined it by looking at the frequency of values (the histogram) or by plotting descriptive statistics (the boxplot). But often we want to see how one variable is linked to another - as the value of one variable changes, what happens to the value of the other variable?
With continuous and ordered/unordered categorical variables, we have four possible combinations. Let's look at them in turn.
The mpg
dataset contains information about cars, measuring their weight, fuel efficiency and so on. We might expect heavier cars to have lower fuel efficiency.
When plotting continuous variables, the one you place on the x-axis should be the independent variable. This is generally some property or value we observe. The y-axis should display the dependent variable. This is a function of the values on the x-axis and is generally something we measure for each observed value on the x-axis. Here, we will place weight on the x-axis and miles per gallon on the y-axis.
Generally, the best choice of visualisation for this is a scatterplot. Each point represents the relation between a single value on the x-axis and its corresponding y value.
Input:
mpg = sns.load_dataset('mpg')
g = sns.scatterplot(data=mpg, x='weight', y='mpg')
Output:
There are several variations on this, which are made available through seaborn's jointplot
. The default will add histograms on the margins, for each of the two variables.
Input:
mpg.head(3)
Output:
Input:
sns.jointplot(data=mpg, x='weight', y='mpg')
Output:
By setting the kind
argument to kde
, you can instead plot a joint kernel density estimate, with individual density estimates on the margins.
Input:
sns.jointplot(data=mpg, x='weight', y='mpg', kind='kde')
Output:
Or you can set it to hex
and plot the values as hexagons, which represent histogram-type bins. This can be very useful if you have a lot of observations in your dataset and plotting all those points is slow or messy.
Input:
sns.jointplot(data=mpg, x='weight', y='mpg', kind='hex')
Output:
There are a few more options when it comes to jointly plotting continuous and categorical data. In general, the categorical data will go on the x-axis and you may need to change the order in which they are displayed.
Let's look at the relationship between fuel efficiency (continuous) and a car's country of origin (unordered categorical). Seaborn's willstripplot
make a separate scatterplot for each categorical variable and place it on the x axis, with its own colour. It will also stagger the points a little to help see their distribution - this can be controlled with the jitter
argument.
Input:
sns.stripplot(data=mpg, x='origin', y='mpg', jitter=0.3)
Output:
The swarmplot
does the same but arranges the points so that there is no overlapping.
Input:
sns.swarmplot(data=mpg, x='origin', y='mpg')
Output:
And if you want a boxplot for each categorical variable, there is no need to do them separately and manually place them in a figure - catplot
is a great way to plot categorical x continuous data.
Input:
sns.catplot(data=mpg, x='origin', y='mpg', kind='box')
Output:
Sometimes, the categorical data will have a natural order to it. The most common of these is times or dates. This can sensibly be plotted as a line, to show how the continuous variable changes over time. Generally, the categorical data must be unique - no value should appear more than once.
The gammas
dataset contains fMRI measurements taken from multiple subjects. Let's look at subject 0, and see how a signal which is dependent on blood oxygen levels (BOLD signal) changed over time in various regions of interest (ROI) in the brain.
Seaborn's lineplot
method has a hue
argument, that will seperate out the three different values for ROI and plot them as their own lines.
Input:
gammas = sns.load_dataset('gammas')
subject_0_data = gammas[(gammas.subject == 0)]
sns.lineplot(data=subject_0_data, x='timepoint', y='BOLD signal', hue='ROI')
We could also focus on a particular ROI and then see how all subjects compare by setting hue="subject"
Input:
sns.lineplot(data=gammas[gammas.ROI == 'IPS'], x='timepoint', y='BOLD signal', hue='subject', legend=False)
# Remove the legend as it gets in the way with the default plot size.
Output:
The most common non-graphical way of representing two joint categorical variables is as a contingency table. Each row of the table represents a possible value of one variable, the columns of the other variable. Cells are populated with the number of observations of pairs of those values.
We can create that table using pandas' crosstab
function - just tell it which columns of a dataframe to use.
Input:
titanic = sns.load_dataset('titanic')
titanic.head(3)
sex_class = pd.crosstab(titanic.sex, titanic['class'])
sex_class
Output:
We can also normalise the values to show percentages, rather than counts.
Input:
sex_class_normed = pd.crosstab(titanic.sex, titanic['class'], normalize=True) * 100
sex_class_normed
Output:
This tabular data is easily to represent visually as a heatmap. This essentially colours in the cells of the table, based on their value. It can be a great way to very quickly communicate the joint distribution of two categorical variables, especially where you want to highlight the fact that some particular combinations are very high or low.
Input:
sns.heatmap(sex_class, cmap='Blues', square=True, annot=True, fmt='g')
Output:
Input:
sns.heatmap(sex_class_normed, cmap='Blues', square=True, annot=True, fmt='.2f', cbar=False)
Output:
Here are the questions to ask before you start plotting:
What is the purpose of my visualisation?
What kind of variables do I have? For each variable:
Besides these variables, is there some other informative distinction I want to show? Do my variables come from...
Have I included all the necessary information?
And a quick list, linking types of data to types of visualisation:
continuous x continuous
continuous x unordered categorical
Look into seaborn's documentation for figure aesthetics and choosing colour palettes - these can make your visualisations look really great. The ones I did here use the default settings and could definitely be improved upon!
Think about how the plots could be improved in terms of the questions under "Have I included all the necessary information?". Seaborn makes it very easy to add titles and so on to figures.
Seaborn also makes it easy to visualise many aspects of the data at once, rather than individually as we did here. Read the documentation for jointplot and catplot to see how flexible and easy to use these methods are!
Try applying the above to real data that you have, rather than the toy datasets used here.
Alexander Robertson is a Data Science PhD student at the University of Edinburgh, where his research focuses on variation, usage and change in natural language and also emoji.
Data scientists often have to communicate results to other people. In my case, my supervisors might want to see some...
The purpose of this tutorial is to teach you how to process data with Pandas DataFrame.