Data visualization seaborn: Let's drink whisky

Lets start with data visualization with seaborn

Today we are gonna learn about plotting with seaborn and styling these plots. We will do univariate, bivariate and multivariate plots. For a change we will use the Scotch reviews dataset from Kaggle, because Whisky is as tasty as Ramen.

What is seaborn?

Seaborn is a library specially built for data visualization in python. It is like the plotting functions of pandas built on top of matplotlib. It has a lot of nice features for easy visualization and styling.

So let's explore Seaborn and Whisky ratings

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

As usual we just load the csv, look at the head, types, tail and summary.

In [2]:
df = pd.read_csv("scotch_review.csv")
df = df.dropna()
df.head()
Out[2]:
Unnamed: 0 name category review.point price currency description
0 1 Johnnie Walker Blue Label, 40% Blended Scotch Whisky 97 225 $ Magnificently powerful and intense. Caramels, ...
1 2 Black Bowmore, 1964 vintage, 42 year old, 40.5% Single Malt Scotch 97 4500.00 $ What impresses me most is how this whisky evol...
2 3 Bowmore 46 year old (distilled 1964), 42.9% Single Malt Scotch 97 13500.00 $ There have been some legendary Bowmores from t...
3 4 Compass Box The General, 53.4% Blended Malt Scotch Whisky 96 325 $ With a name inspired by a 1926 Buster Keaton m...
4 5 Chivas Regal Ultis, 40% Blended Malt Scotch Whisky 96 160 $ Captivating, enticing, and wonderfully charmin...
In [3]:
df.tail()
Out[3]:
Unnamed: 0 name category review.point price currency description
2242 2243 Duncan Taylor (distilled at Cameronbridge), Ca... Grain Scotch Whisky 72 125.00 $ Its best attributes are vanilla, toasted cocon...
2243 2244 Distillery Select 'Craiglodge' (distilled at L... Single Malt Scotch 71 60.00 $ Aged in a sherry cask, which adds sweet notes ...
2244 2245 Edradour Barolo Finish, 11 year old, 57.1% Single Malt Scotch 70 80.00 $ Earthy, fleshy notes with brooding grape notes...
2245 2246 Highland Park, Cask #7380, 1981 vintage, 25 ye... Single Malt Scotch 70 225.00 $ The sherry is very dominant and cloying, which...
2246 2247 Distillery Select 'Inchmoan' (distilled at Loc... Single Malt Scotch 63 60.00 $ Fiery peat kiln smoke, tar, and ripe barley on...
In [4]:
df.dtypes
Out[4]:
Unnamed: 0       int64
name            object
category        object
review.point     int64
price           object
currency        object
description     object
dtype: object
In [5]:
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df.describe()
Out[5]:
Unnamed: 0 review.point price
count 2247.000000 2247.000000 2228.000000
mean 1124.000000 86.700045 479.311490
std 648.797349 4.054055 3719.372546
min 1.000000 63.000000 12.000000
25% 562.500000 84.000000 70.000000
50% 1124.000000 87.000000 109.000000
75% 1685.500000 90.000000 200.000000
max 2247.000000 97.000000 157000.000000

What have we learned

We have 2247 ratings for whisky. Also we have a price, a rating and a category. This could make some interesting visualization likes category and price. Because of the describe() method we can also already see, that the range of prices is very big and we got a skewed mean through outliers

In [6]:
ax = sns.countplot(df["category"])

That's not a nice graphic

We can rename these long whiskey names to something shorter :). This we can do with df.unique to get the unique namens in a column and the give df.replace a dictionary of the names which should be replaced.

In [7]:
df.category.unique()
Out[7]:
array(['Blended Scotch Whisky', 'Single Malt Scotch',
       'Blended Malt Scotch Whisky', 'Grain Scotch Whisky',
       'Single Grain Whisky'], dtype=object)
In [8]:
df.category = df.category.replace({'Blended Scotch Whisky': 'Blended', 'Single Malt Scotch':'Single Malt', 
                     'Blended Malt Scotch Whisky': 'Malt', 'Grain Scotch Whisky':'Grain Scotch',
                     'Single Grain Whisky': 'Single Grain' })

Better ? better!

The graph should look nicer now. Additonally as a preview of the blog post "styling your plots" We choose a darker style for this plot, change the fontsize and despine it. Despining means to remove the lines for the axes which often makes for a better look.

In [9]:
sns.set_style("white")
#sns.set(font_scale=1.5)
ax = sns.countplot(df["category"], orient="h")
sns.despine(bottom=True, left=True)

Distplot

Next up in our seaborn arsenal is the distplot. It's the same as a histogramm or especially the histplot from pandas. One difference between pandas and seaborn is the data formatting. In pandas we needed to give as a argument the really exact data format instead seaborn does most of the work for us here. That's one of the reasons why I prefer pandas for quick exploration and seaborn more for the distingushed plots.

Back to the distplot: It's the same as a histogramm, but if you want you can plot the kernel density estimate, a mathematic method which estimates the real valued distribution, too. As we can see in the plot, the ratings of the whisky are normal distributed with a offset to the higher values.

In [10]:
sns.distplot(df["review.point"], bins=20, kde=True)
/home/migge/ML/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f844de43470>

Scatterplot

We scatterplot and similar plots are grouped under the jointplot command in seaborn. As you can see the scatterplot is messy even with a reduced number of data points. In Seaborn the can just pass another parameter, to make a hexplot out of it.

In [11]:
sns.jointplot(x='review.point', y='price', data=df[df['price']< 500].dropna())
/home/migge/ML/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[11]:
<seaborn.axisgrid.JointGrid at 0x7f844ff1bc88>

The Jointplot comes with big guns:

As you can see the jointplot in seaborn gives you the histograms of the two variables on the two side too, which is a very nice addition to the plot.

Hexplot

To make a hexplot we just add kind='hex' and the gridsize to the jointplot function. We can now detect a small trend that more expensive whisky gets better ratings, but only by a small margin.

Can we do this even better?

  • Yes we can
  • We plot to countinous variables against each other, so perhaps it's not the best idea to put them into bins.
  • We have already seen the kernel density estimate for one variable
  • We can do it with the jointplot, too !
In [12]:
sns.jointplot(x='review.point', y='price', data=df[df['price']< 500].dropna(), kind='hex', gridsize=16)
/home/migge/ML/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[12]:
<seaborn.axisgrid.JointGrid at 0x7f844dd2d630>

2d KDE Plot

With kind='kde' we get a 2d-kernel density estimate. Here we can see the trend even better.

  • What do you think about this plot?
  • I think it's awesome

And finally for the best

In [13]:
sns.jointplot(x='review.point', y='price', data=df[df['price']< 500].dropna(), kind='kde')
Out[13]:
<seaborn.axisgrid.JointGrid at 0x7f844db81fd0>

Boxes and Violins

  • With boxplot and violinplot we can do boxplots and violinplots, violinplots are a nicer visualization technique, especially for data with outliers and countinous variables.
  • I find it very interesting , that Grain Scotch is the whisky with the second most deviation
In [14]:
sns.boxplot(x="category", y="review.point", data = df)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f844d8fffd0>
In [15]:
sns.violinplot(x="category", y="review.point", data = df)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f844d81c2e8>
In [16]:
sns.boxplot(x="category", y="price", data=df[df["price"]<500])
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f844d7b7da0>
In [17]:
sns.violinplot(x="category", y="price", data=df[df["price"]<500])
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f844d752400>

Conclusion

  • With seaborn we can build very nice visualizations
  • I like Whisky more than ramen :)
  • seaborn offers more and better plots than pandas, but we lose some customization

What's next

  • Do you want to hear more about styling your plots?
  • Like adjusting fonts, colourpalettes and subplots and labels ?
  • Or do you want some tutorial and blog post about neural networks and deep learning ?

I Always love suggestions and critic

Hope you learned something and had fun