Data visualization with pandas: Part 2
So 17 Juni 2018Bivariate plotting with Pandas¶
Welcome to the second post in my series of data visualization with python¶
Previous Post ¶
- Today we will do bivariate visualization with pandas
- Then we will do the same with seaborn
- Also we will explore multivariate plotting in seaborn
- The next library to explore will be plotly
- Finally we will do an example Data Science project with the Dataset. Formulate a Question, exploring the Data and answering it.
The Dataset can be found on Kaggle at Ramen Dataset
import pandas as pd
%matplotlib inline
In the last post we saw that there could be some interesting relationnships to explore with bivariate plotting. For example rating/country, rating/style, or style/country. We read the Data in, convert the ratings to numerical values and fill the missing values with a zero.
data = pd.read_csv("ramen-ratings.csv")
data['Stars'] = pd.to_numeric(data['Stars'], errors='coerce')
data['Stars'] = data['Stars'].fillna(0)
data.head(12)
Style_by_country = data.groupby('Country')['Style'].value_counts().unstack().fillna(0)
Style_by_country.head()
With this command we group the data by Country and Style and count how often each style is present for each country. DataFrames of this format are required for stacked plots as we will plot with the following command.
Style_by_country.plot.bar(stacked=True)
This plot is very messy, so we could filter the dataframe for just the countries with a sum of entries of 100. This we can just do with the following code. We sum the entries on the 1.axis and filter it. We resulting plot is a lot better
top_countries = Style_by_country[Style_by_country.sum(axis=1) > 100]
top_countries.plot.bar(stacked = True)
horizontal is even better¶
With barh we get a horizontal bar chart which shows the distribution by country even better. We can see now, that in japan there are alot more ratings for Bowls, instead of the ready made packs in the supermarket. In Malaysia this trend is reversed
top_countries.plot.barh(stacked=True)
Boxplots¶
With a boxplot we can visualize the distribution of the Packs in our top countries dataframe. Here we can see the outlier of the bowls, which is japan. Also we can see that ramen packs are the most popular by a wide margin.
top_countries.plot.box(stacked=True)
Stars vs Style¶
Lets look at the correlation of stars by Style. First we need to select only Stars and Style from the Dataset. This is very easy with the following pandas command.
stars_by_style = data[["Style", "Stars"]]
stars_by_style.head()
Label encoding¶
For the Correlation between Stars and Country we can do some nice scatter plots with pandas. For this we need to encode the Style as a numerical value. For this we use label encoding. First we need to convert The Style to a categorial variable with astype("Category").
stars_by_style["Style"] = stars_by_style["Style"].astype("category")
stars_by_style.dtypes
With cat.codes every Category gets assigned an unique id as we can see with the head command
stars_by_style["Style_encode"] = stars_by_style["Style"].cat.codes
stars_by_style.head(10)
Scatter plot¶
A scatter plot in pandas is very easy with the following command. As we have a lot of Data Points for every style we can't see a clear trend. For this case, pandas provides the hexbin method which we will see in the following step. It aggregates the Values in hexagonal bins and color codes them by number.
stars_by_style.plot.scatter(x="Style_encode", y="Stars")
This scatterplot doesnt tell us much , so lets try a hexplot. In the hexplot we see that the data centers around Packs and ratings between three and four.
stars_by_style.plot.hexbin(x="Style_encode", y="Stars", gridsize=16)
Stars by Country¶
Lets do the same for Star by country
stars_by_country = data[["Country", "Stars"]]
stars_by_country["Country"] = stars_by_country["Country"].astype("category")
stars_by_country["Country_encode"] = stars_by_country["Country"].cat.codes
stars_by_country.head(10)
stars_by_country.plot.scatter(x = "Country_encode", y = "Stars")
stars_by_country.plot.hexbin(x = "Country_encode", y = "Stars", gridsize=18)
The only conclusion we can find here is that the ramen ratings are generally high.
Conclusion¶
Hope you had a nice intro into bivariate plotting with Pandas. Do you have any questions? For the visualization with seaborn and plotly I'm looking to use a different dataset. Do you have any recommendations or questions you would like to have answered :) Comment or write me on Twitter or LinkedIn.