Data visualization seaborn: Let's drink whisky

Fr 22 Juni 2018

Lets start with data visualization with seaborn¶

Today we are gonna learn about plotting with seaborn and styling these plots. We will do univariate, bivariate and multivariate plots. For a change we will use the Scotch reviews dataset from Kaggle, because Whisky is as tasty as Ramen.

What is seaborn?¶

Seaborn is a library specially built for data visualization in python. It is like the plotting functions of pandas built on top of matplotlib. It has a lot of nice features for easy visualization and styling.

So let's explore Seaborn and Whisky ratings¶

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns

As usual we just load the csv, look at the head, types, tail and summary.

In [2]:

df = pd.read_csv("scotch_review.csv")
df = df.dropna()
df.head()

Out[2]:

	Unnamed: 0	name	category	review.point	price	currency	description
0	1	Johnnie Walker Blue Label, 40%	Blended Scotch Whisky	97	225	$	Magnificently powerful and intense. Caramels, ...
1	2	Black Bowmore, 1964 vintage, 42 year old, 40.5%	Single Malt Scotch	97	4500.00	$	What impresses me most is how this whisky evol...
2	3	Bowmore 46 year old (distilled 1964), 42.9%	Single Malt Scotch	97	13500.00	$	There have been some legendary Bowmores from t...
3	4	Compass Box The General, 53.4%	Blended Malt Scotch Whisky	96	325	$	With a name inspired by a 1926 Buster Keaton m...
4	5	Chivas Regal Ultis, 40%	Blended Malt Scotch Whisky	96	160	$	Captivating, enticing, and wonderfully charmin...

In [3]:

df.tail()

Out[3]:

	Unnamed: 0	name	category	review.point	price	currency	description
2242	2243	Duncan Taylor (distilled at Cameronbridge), Ca...	Grain Scotch Whisky	72	125.00	$	Its best attributes are vanilla, toasted cocon...
2243	2244	Distillery Select 'Craiglodge' (distilled at L...	Single Malt Scotch	71	60.00	$	Aged in a sherry cask, which adds sweet notes ...
2244	2245	Edradour Barolo Finish, 11 year old, 57.1%	Single Malt Scotch	70	80.00	$	Earthy, fleshy notes with brooding grape notes...
2245	2246	Highland Park, Cask #7380, 1981 vintage, 25 ye...	Single Malt Scotch	70	225.00	$	The sherry is very dominant and cloying, which...
2246	2247	Distillery Select 'Inchmoan' (distilled at Loc...	Single Malt Scotch	63	60.00	$	Fiery peat kiln smoke, tar, and ripe barley on...

In [4]:

df.dtypes

Out[4]:

Unnamed: 0       int64
name            object
category        object
review.point     int64
price           object
currency        object
description     object
dtype: object

In [5]:

df['price'] = pd.to_numeric(df['price'], errors='coerce')
df.describe()

Out[5]:

	Unnamed: 0	review.point	price
count	2247.000000	2247.000000	2228.000000
mean	1124.000000	86.700045	479.311490
std	648.797349	4.054055	3719.372546
min	1.000000	63.000000	12.000000
25%	562.500000	84.000000	70.000000
50%	1124.000000	87.000000	109.000000
75%	1685.500000	90.000000	200.000000
max	2247.000000	97.000000	157000.000000

What have we learned¶

We have 2247 ratings for whisky. Also we have a price, a rating and a category. This could make some interesting visualization likes category and price. Because of the describe() method we can also already see, that the range of prices is very big and we got a skewed mean through outliers

In [6]:

ax = sns.countplot(df["category"])

That's not a nice graphic¶

We can rename these long whiskey names to something shorter :). This we can do with df.unique to get the unique namens in a column and the give df.replace a dictionary of the names which should be replaced.

In [7]:

df.category.unique()

Out[7]:

array(['Blended Scotch Whisky', 'Single Malt Scotch',
       'Blended Malt Scotch Whisky', 'Grain Scotch Whisky',
       'Single Grain Whisky'], dtype=object)

In [8]:

df.category = df.category.replace({'Blended Scotch Whisky': 'Blended', 'Single Malt Scotch':'Single Malt', 
                     'Blended Malt Scotch Whisky': 'Malt', 'Grain Scotch Whisky':'Grain Scotch',
                     'Single Grain Whisky': 'Single Grain' })

Better ? better!¶

The graph should look nicer now. Additonally as a preview of the blog post "styling your plots" We choose a darker style for this plot, change the fontsize and despine it. Despining means to remove the lines for the axes which often makes for a better look.

In [9]:

sns.set_style("white")
#sns.set(font_scale=1.5)
ax = sns.countplot(df["category"], orient="h")
sns.despine(bottom=True, left=True)

Distplot¶

Next up in our seaborn arsenal is the distplot. It's the same as a histogramm or especially the histplot from pandas. One difference between pandas and seaborn is the data formatting. In pandas we needed to give as a argument the really exact data format instead seaborn does most of the work for us here. That's one of the reasons why I prefer pandas for quick exploration and seaborn more for the distingushed plots.

Back to the distplot: It's the same as a histogramm, but if you want you can plot the kernel density estimate, a mathematic method which estimates the real valued distribution, too. As we can see in the plot, the ratings of the whisky are normal distributed with a offset to the higher values.

In [10]:

sns.distplot(df["review.point"], bins=20, kde=True)

/home/migge/ML/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f844de43470>

Scatterplot¶

We scatterplot and similar plots are grouped under the jointplot command in seaborn. As you can see the scatterplot is messy even with a reduced number of data points. In Seaborn the can just pass another parameter, to make a hexplot out of it.

In [11]:

sns.jointplot(x='review.point', y='price', data=df[df['price']< 500].dropna())

/home/migge/ML/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Out[11]:

<seaborn.axisgrid.JointGrid at 0x7f844ff1bc88>

The Jointplot comes with big guns:¶

As you can see the jointplot in seaborn gives you the histograms of the two variables on the two side too, which is a very nice addition to the plot.

Hexplot¶

To make a hexplot we just add kind='hex' and the gridsize to the jointplot function. We can now detect a small trend that more expensive whisky gets better ratings, but only by a small margin.

Can we do this even better?¶

Yes we can
We plot to countinous variables against each other, so perhaps it's not the best idea to put them into bins.
We have already seen the kernel density estimate for one variable
We can do it with the jointplot, too !

In [12]:

sns.jointplot(x='review.point', y='price', data=df[df['price']< 500].dropna(), kind='hex', gridsize=16)

/home/migge/ML/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Out[12]:

<seaborn.axisgrid.JointGrid at 0x7f844dd2d630>

2d KDE Plot¶

With kind='kde' we get a 2d-kernel density estimate. Here we can see the trend even better.

What do you think about this plot?
I think it's awesome

And finally for the best¶

In [13]:

sns.jointplot(x='review.point', y='price', data=df[df['price']< 500].dropna(), kind='kde')

Out[13]:

<seaborn.axisgrid.JointGrid at 0x7f844db81fd0>

Boxes and Violins¶

With boxplot and violinplot we can do boxplots and violinplots, violinplots are a nicer visualization technique, especially for data with outliers and countinous variables.
I find it very interesting , that Grain Scotch is the whisky with the second most deviation

In [14]:

sns.boxplot(x="category", y="review.point", data = df)

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f844d8fffd0>

In [15]:

sns.violinplot(x="category", y="review.point", data = df)

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f844d81c2e8>

In [16]:

sns.boxplot(x="category", y="price", data=df[df["price"]<500])

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f844d7b7da0>

In [17]:

sns.violinplot(x="category", y="price", data=df[df["price"]<500])

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f844d752400>

Conclusion¶

With seaborn we can build very nice visualizations
I like Whisky more than ramen :)
seaborn offers more and better plots than pandas, but we lose some customization

What's next¶

Do you want to hear more about styling your plots?
Like adjusting fonts, colourpalettes and subplots and labels ?
Or do you want some tutorial and blog post about neural networks and deep learning ?

Micke's Data Science Blog