Data visualization seaborn: Let's drink whisky
Fr 22 Juni 2018Lets start with data visualization with seaborn¶
Today we are gonna learn about plotting with seaborn and styling these plots. We will do univariate, bivariate and multivariate plots. For a change we will use the Scotch reviews dataset from Kaggle, because Whisky is as tasty as Ramen.
What is seaborn?¶
Seaborn is a library specially built for data visualization in python. It is like the plotting functions of pandas built on top of matplotlib. It has a lot of nice features for easy visualization and styling.
So let's explore Seaborn and Whisky ratings¶
import numpy as np
import pandas as pd
import seaborn as sns
As usual we just load the csv, look at the head, types, tail and summary.
df = pd.read_csv("scotch_review.csv")
df = df.dropna()
df.head()
df.tail()
df.dtypes
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df.describe()
What have we learned¶
We have 2247 ratings for whisky. Also we have a price, a rating and a category. This could make some interesting visualization likes category and price. Because of the describe() method we can also already see, that the range of prices is very big and we got a skewed mean through outliers
ax = sns.countplot(df["category"])
That's not a nice graphic¶
We can rename these long whiskey names to something shorter :). This we can do with df.unique to get the unique namens in a column and the give df.replace a dictionary of the names which should be replaced.
df.category.unique()
df.category = df.category.replace({'Blended Scotch Whisky': 'Blended', 'Single Malt Scotch':'Single Malt',
'Blended Malt Scotch Whisky': 'Malt', 'Grain Scotch Whisky':'Grain Scotch',
'Single Grain Whisky': 'Single Grain' })
Better ? better!¶
The graph should look nicer now. Additonally as a preview of the blog post "styling your plots" We choose a darker style for this plot, change the fontsize and despine it. Despining means to remove the lines for the axes which often makes for a better look.
sns.set_style("white")
#sns.set(font_scale=1.5)
ax = sns.countplot(df["category"], orient="h")
sns.despine(bottom=True, left=True)
Distplot¶
Next up in our seaborn arsenal is the distplot. It's the same as a histogramm or especially the histplot from pandas. One difference between pandas and seaborn is the data formatting. In pandas we needed to give as a argument the really exact data format instead seaborn does most of the work for us here. That's one of the reasons why I prefer pandas for quick exploration and seaborn more for the distingushed plots.
Back to the distplot: It's the same as a histogramm, but if you want you can plot the kernel density estimate, a mathematic method which estimates the real valued distribution, too. As we can see in the plot, the ratings of the whisky are normal distributed with a offset to the higher values.
sns.distplot(df["review.point"], bins=20, kde=True)
Scatterplot¶
We scatterplot and similar plots are grouped under the jointplot command in seaborn. As you can see the scatterplot is messy even with a reduced number of data points. In Seaborn the can just pass another parameter, to make a hexplot out of it.
sns.jointplot(x='review.point', y='price', data=df[df['price']< 500].dropna())
The Jointplot comes with big guns:¶
As you can see the jointplot in seaborn gives you the histograms of the two variables on the two side too, which is a very nice addition to the plot.
Hexplot¶
To make a hexplot we just add kind='hex' and the gridsize to the jointplot function. We can now detect a small trend that more expensive whisky gets better ratings, but only by a small margin.
Can we do this even better?¶
- Yes we can
- We plot to countinous variables against each other, so perhaps it's not the best idea to put them into bins.
- We have already seen the kernel density estimate for one variable
- We can do it with the jointplot, too !
sns.jointplot(x='review.point', y='price', data=df[df['price']< 500].dropna(), kind='hex', gridsize=16)
2d KDE Plot¶
With kind='kde' we get a 2d-kernel density estimate. Here we can see the trend even better.
- What do you think about this plot?
- I think it's awesome
And finally for the best¶
sns.jointplot(x='review.point', y='price', data=df[df['price']< 500].dropna(), kind='kde')
Boxes and Violins¶
- With boxplot and violinplot we can do boxplots and violinplots, violinplots are a nicer visualization technique, especially for data with outliers and countinous variables.
- I find it very interesting , that Grain Scotch is the whisky with the second most deviation
sns.boxplot(x="category", y="review.point", data = df)
sns.violinplot(x="category", y="review.point", data = df)
sns.boxplot(x="category", y="price", data=df[df["price"]<500])
sns.violinplot(x="category", y="price", data=df[df["price"]<500])
Conclusion¶
- With seaborn we can build very nice visualizations
- I like Whisky more than ramen :)
- seaborn offers more and better plots than pandas, but we lose some customization
What's next¶
- Do you want to hear more about styling your plots?
- Like adjusting fonts, colourpalettes and subplots and labels ?
- Or do you want some tutorial and blog post about neural networks and deep learning ?