Bivariate plotting with Pandas¶

Welcome to the second post in my series of data visualization with python¶

Previous Post ¶

Today we will do bivariate visualization with pandas
Then we will do the same with seaborn
Also we will explore multivariate plotting in seaborn
The next library to explore will be plotly
Finally we will do an example Data Science project with the Dataset. Formulate a Question, exploring the Data and answering it.

The Dataset can be found on Kaggle at Ramen Dataset

In [1]:

import pandas as pd
%matplotlib inline

In the last post we saw that there could be some interesting relationnships to explore with bivariate plotting. For example rating/country, rating/style, or style/country. We read the Data in, convert the ratings to numerical values and fill the missing values with a zero.

In [2]:

data = pd.read_csv("ramen-ratings.csv")
data['Stars'] = pd.to_numeric(data['Stars'], errors='coerce')
data['Stars'] = data['Stars'].fillna(0)
data.head(12)

Out[2]:

	Review #	Brand	Variety	Style	Country	Stars	Top Ten
0	2580	New Touch	T's Restaurant Tantanmen	Cup	Japan	3.75	NaN
1	2579	Just Way	Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...	Pack	Taiwan	1.00	NaN
2	2578	Nissin	Cup Noodles Chicken Vegetable	Cup	USA	2.25	NaN
3	2577	Wei Lih	GGE Ramen Snack Tomato Flavor	Pack	Taiwan	2.75	NaN
4	2576	Ching's Secret	Singapore Curry	Pack	India	3.75	NaN
5	2575	Samyang Foods	Kimchi song Song Ramen	Pack	South Korea	4.75	NaN
6	2574	Acecook	Spice Deli Tantan Men With Cilantro	Cup	Japan	4.00	NaN
7	2573	Ikeda Shoku	Nabeyaki Kitsune Udon	Tray	Japan	3.75	NaN
8	2572	Ripe'n'Dry	Hokkaido Soy Sauce Ramen	Pack	Japan	0.25	NaN
9	2571	KOKA	The Original Spicy Stir-Fried Noodles	Pack	Singapore	2.50	NaN
10	2570	Tao Kae Noi	Creamy tom Yum Kung Flavour	Pack	Thailand	5.00	NaN
11	2569	Yamachan	Yokohama Tonkotsu Shoyu	Pack	USA	5.00	NaN

In [3]:

Style_by_country = data.groupby('Country')['Style'].value_counts().unstack().fillna(0)
Style_by_country.head()

Out[3]:

Style	Bar	Bowl	Box	Can	Cup	Pack	Tray
Country
Australia	0.0	0.0	0.0	0.0	17.0	5.0	0.0
Bangladesh	0.0	0.0	0.0	0.0	0.0	7.0	0.0
Brazil	0.0	0.0	0.0	0.0	2.0	3.0	0.0
Cambodia	0.0	0.0	0.0	0.0	0.0	5.0	0.0
Canada	0.0	8.0	0.0	0.0	17.0	16.0	0.0

With this command we group the data by Country and Style and count how often each style is present for each country. DataFrames of this format are required for stacked plots as we will plot with the following command.

In [4]:

Style_by_country.plot.bar(stacked=True)

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec4d522e8>

This plot is very messy, so we could filter the dataframe for just the countries with a sum of entries of 100. This we can just do with the following code. We sum the entries on the 1.axis and filter it. We resulting plot is a lot better

In [5]:

top_countries = Style_by_country[Style_by_country.sum(axis=1) > 100]
top_countries.plot.bar(stacked = True)

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec4c5b240>

horizontal is even better¶

With barh we get a horizontal bar chart which shows the distribution by country even better. We can see now, that in japan there are alot more ratings for Bowls, instead of the ready made packs in the supermarket. In Malaysia this trend is reversed

In [6]:

top_countries.plot.barh(stacked=True)

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec4907b70>

Boxplots¶

With a boxplot we can visualize the distribution of the Packs in our top countries dataframe. Here we can see the outlier of the bowls, which is japan. Also we can see that ramen packs are the most popular by a wide margin.

In [7]:

top_countries.plot.box(stacked=True)

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec47d26d8>

Stars vs Style¶

Lets look at the correlation of stars by Style. First we need to select only Stars and Style from the Dataset. This is very easy with the following pandas command.

In [8]:

stars_by_style = data[["Style", "Stars"]]
stars_by_style.head()

Out[8]:

	Style	Stars
0	Cup	3.75
1	Pack	1.00
2	Cup	2.25
3	Pack	2.75
4	Pack	3.75

Label encoding¶

For the Correlation between Stars and Country we can do some nice scatter plots with pandas. For this we need to encode the Style as a numerical value. For this we use label encoding. First we need to convert The Style to a categorial variable with astype("Category").

In [9]:

stars_by_style["Style"] = stars_by_style["Style"].astype("category")
stars_by_style.dtypes

/home/migge/ML/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Out[9]:

Style    category
Stars     float64
dtype: object

With cat.codes every Category gets assigned an unique id as we can see with the head command

In [10]:

stars_by_style["Style_encode"] = stars_by_style["Style"].cat.codes
stars_by_style.head(10)

/home/migge/ML/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[10]:

	Style	Stars	Style_encode
0	Cup	3.75	4
1	Pack	1.00	5
2	Cup	2.25	4
3	Pack	2.75	5
4	Pack	3.75	5
5	Pack	4.75	5
6	Cup	4.00	4
7	Tray	3.75	6
8	Pack	0.25	5
9	Pack	2.50	5

Scatter plot¶

A scatter plot in pandas is very easy with the following command. As we have a lot of Data Points for every style we can't see a clear trend. For this case, pandas provides the hexbin method which we will see in the following step. It aggregates the Values in hexagonal bins and color codes them by number.

In [11]:

stars_by_style.plot.scatter(x="Style_encode", y="Stars")

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec2658908>

This scatterplot doesnt tell us much , so lets try a hexplot. In the hexplot we see that the data centers around Packs and ratings between three and four.

In [12]:

stars_by_style.plot.hexbin(x="Style_encode", y="Stars", gridsize=16)

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec25bf2b0>

Stars by Country¶

Lets do the same for Star by country

In [13]:

stars_by_country = data[["Country", "Stars"]]
stars_by_country["Country"] = stars_by_country["Country"].astype("category")
stars_by_country["Country_encode"] = stars_by_country["Country"].cat.codes

/home/migge/ML/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/home/migge/ML/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until

In [14]:

stars_by_country.head(10)

Out[14]:

	Country	Stars	Country_encode
0	Japan	3.75	18
1	Taiwan	1.00	32
2	USA	2.25	35
3	Taiwan	2.75	32
4	India	3.75	16
5	South Korea	4.75	30
6	Japan	4.00	18
7	Japan	3.75	18
8	Japan	0.25	18
9	Singapore	2.50	29

In [15]:

stars_by_country.plot.scatter(x = "Country_encode", y = "Stars")

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec250e860>

In [16]:

stars_by_country.plot.hexbin(x = "Country_encode", y = "Stars", gridsize=18)

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ec24746d8>

The only conclusion we can find here is that the ramen ratings are generally high.

Conclusion¶

Hope you had a nice intro into bivariate plotting with Pandas. Do you have any questions? For the visualization with seaborn and plotly I'm looking to use a different dataset. Do you have any recommendations or questions you would like to have answered :) Comment or write me on Twitter or LinkedIn.

Micke's Data Science Blog

Data visualization with pandas: Part 2