Data visualization with pandas: Part 2

Bivariate plotting with Pandas

Welcome to the second post in my series of data visualization with python

Previous Post

  1. Today we will do bivariate visualization with pandas
  2. Then we will do the same with seaborn
  3. Also we will explore multivariate plotting in seaborn
  4. The next library to explore will be plotly
  5. Finally we will do an example Data Science project with the Dataset. Formulate a Question, exploring the Data and answering it.

The Dataset can be found on Kaggle at Ramen Dataset

In [1]:
import pandas as pd
%matplotlib inline

In the last post we saw that there could be some interesting relationnships to explore with bivariate plotting. For example rating/country, rating/style, or style/country. We read the Data in, convert the ratings to numerical values and fill the missing values with a zero.

In [2]:
data = pd.read_csv("ramen-ratings.csv")
data['Stars'] = pd.to_numeric(data['Stars'], errors='coerce')
data['Stars'] = data['Stars'].fillna(0)
data.head(12)
Out[2]:
Review # Brand Variety Style Country Stars Top Ten
0 2580 New Touch T's Restaurant Tantanmen Cup Japan 3.75 NaN
1 2579 Just Way Noodles Spicy Hot Sesame Spicy Hot Sesame Guan... Pack Taiwan 1.00 NaN
2 2578 Nissin Cup Noodles Chicken Vegetable Cup USA 2.25 NaN
3 2577 Wei Lih GGE Ramen Snack Tomato Flavor Pack Taiwan 2.75 NaN
4 2576 Ching's Secret Singapore Curry Pack India 3.75 NaN
5 2575 Samyang Foods Kimchi song Song Ramen Pack South Korea 4.75 NaN
6 2574 Acecook Spice Deli Tantan Men With Cilantro Cup Japan 4.00 NaN
7 2573 Ikeda Shoku Nabeyaki Kitsune Udon Tray Japan 3.75 NaN
8 2572 Ripe'n'Dry Hokkaido Soy Sauce Ramen Pack Japan 0.25 NaN
9 2571 KOKA The Original Spicy Stir-Fried Noodles Pack Singapore 2.50 NaN
10 2570 Tao Kae Noi Creamy tom Yum Kung Flavour Pack Thailand 5.00 NaN
11 2569 Yamachan Yokohama Tonkotsu Shoyu Pack USA 5.00 NaN
In [3]:
Style_by_country = data.groupby('Country')['Style'].value_counts().unstack().fillna(0)
Style_by_country.head()
Out[3]:
Style Bar Bowl Box Can Cup Pack Tray
Country
Australia 0.0 0.0 0.0 0.0 17.0 5.0 0.0
Bangladesh 0.0 0.0 0.0 0.0 0.0 7.0 0.0
Brazil 0.0 0.0 0.0 0.0 2.0 3.0 0.0
Cambodia 0.0 0.0 0.0 0.0 0.0 5.0 0.0
Canada 0.0 8.0 0.0 0.0 17.0 16.0 0.0

With this command we group the data by Country and Style and count how often each style is present for each country. DataFrames of this format are required for stacked plots as we will plot with the following command.

In [4]:
Style_by_country.plot.bar(stacked=True)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec4d522e8>

This plot is very messy, so we could filter the dataframe for just the countries with a sum of entries of 100. This we can just do with the following code. We sum the entries on the 1.axis and filter it. We resulting plot is a lot better

In [5]:
top_countries = Style_by_country[Style_by_country.sum(axis=1) > 100]
top_countries.plot.bar(stacked = True)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec4c5b240>

horizontal is even better

With barh we get a horizontal bar chart which shows the distribution by country even better. We can see now, that in japan there are alot more ratings for Bowls, instead of the ready made packs in the supermarket. In Malaysia this trend is reversed

In [6]:
top_countries.plot.barh(stacked=True)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec4907b70>

Boxplots

With a boxplot we can visualize the distribution of the Packs in our top countries dataframe. Here we can see the outlier of the bowls, which is japan. Also we can see that ramen packs are the most popular by a wide margin.

In [7]:
top_countries.plot.box(stacked=True)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec47d26d8>

Stars vs Style

Lets look at the correlation of stars by Style. First we need to select only Stars and Style from the Dataset. This is very easy with the following pandas command.

In [8]:
stars_by_style = data[["Style", "Stars"]]
stars_by_style.head()
Out[8]:
Style Stars
0 Cup 3.75
1 Pack 1.00
2 Cup 2.25
3 Pack 2.75
4 Pack 3.75

Label encoding

For the Correlation between Stars and Country we can do some nice scatter plots with pandas. For this we need to encode the Style as a numerical value. For this we use label encoding. First we need to convert The Style to a categorial variable with astype("Category").

In [9]:
stars_by_style["Style"] = stars_by_style["Style"].astype("category")
stars_by_style.dtypes
/home/migge/ML/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
Out[9]:
Style    category
Stars     float64
dtype: object

With cat.codes every Category gets assigned an unique id as we can see with the head command

In [10]:
stars_by_style["Style_encode"] = stars_by_style["Style"].cat.codes
stars_by_style.head(10)
/home/migge/ML/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[10]:
Style Stars Style_encode
0 Cup 3.75 4
1 Pack 1.00 5
2 Cup 2.25 4
3 Pack 2.75 5
4 Pack 3.75 5
5 Pack 4.75 5
6 Cup 4.00 4
7 Tray 3.75 6
8 Pack 0.25 5
9 Pack 2.50 5

Scatter plot

A scatter plot in pandas is very easy with the following command. As we have a lot of Data Points for every style we can't see a clear trend. For this case, pandas provides the hexbin method which we will see in the following step. It aggregates the Values in hexagonal bins and color codes them by number.

In [11]:
stars_by_style.plot.scatter(x="Style_encode", y="Stars")
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec2658908>

This scatterplot doesnt tell us much , so lets try a hexplot. In the hexplot we see that the data centers around Packs and ratings between three and four.

In [12]:
stars_by_style.plot.hexbin(x="Style_encode", y="Stars", gridsize=16)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec25bf2b0>

Stars by Country

Lets do the same for Star by country

In [13]:
stars_by_country = data[["Country", "Stars"]]
stars_by_country["Country"] = stars_by_country["Country"].astype("category")
stars_by_country["Country_encode"] = stars_by_country["Country"].cat.codes
/home/migge/ML/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/home/migge/ML/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
In [14]:
stars_by_country.head(10)
Out[14]:
Country Stars Country_encode
0 Japan 3.75 18
1 Taiwan 1.00 32
2 USA 2.25 35
3 Taiwan 2.75 32
4 India 3.75 16
5 South Korea 4.75 30
6 Japan 4.00 18
7 Japan 3.75 18
8 Japan 0.25 18
9 Singapore 2.50 29
In [15]:
stars_by_country.plot.scatter(x = "Country_encode", y = "Stars")
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec250e860>
In [16]:
stars_by_country.plot.hexbin(x = "Country_encode", y = "Stars", gridsize=18)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec24746d8>

The only conclusion we can find here is that the ramen ratings are generally high.

Conclusion

Hope you had a nice intro into bivariate plotting with Pandas. Do you have any questions? For the visualization with seaborn and plotly I'm looking to use a different dataset. Do you have any recommendations or questions you would like to have answered :) Comment or write me on Twitter or LinkedIn.

Hope to hear from you soon .