Data visualization with pandas: Part 1

Sa 09 Juni 2018

Data Visualization with pandas on the Ramen Ratings Dataset¶

To improve my skill in Data Visualization, I thought about doing visualization with a lots of different frameworks/libaries.¶

This post will be about univariate Visualization with pandas
After that we will do bivariate visualization with pandas
Then we will do the same with seaborn
Also we will explore multivariate plotting in seaborn
The next library to explore will be plotly
Finally we will do an example Data Science project with the Dataset. Formulate a Question, exploring the Data and answering it.

The Dataset can be found on Kaggle at Ramen Dataset

Why only pandas¶

While seaborn or plotly offers a lot of functionality for nice plots and visualizations, we can visualize nice and quick with pandas. As we will see the structure of the input in pandas is different in comparison to seaborn or plotly. This makes very quick visualization easy if you just wanna do an explorative data analysis for yourself.

We just import pandas. Yes, only one import statement.

In [1]:

import pandas as pd

With pd.read_csv we can quickly read the csv in a pandas dataframe with the name Data. For an explanation about dataframes visit (link)

with data.head(10) we just visualize the first 10 entries. We do this to get an first glimpse on the data. Sometimes we detect missing values , or interesting formats. For Example here we see The column Top Ten, for which most of the values will be missing values.

In [2]:

data = pd.read_csv("ramen-ratings.csv")
data.head(10)

Out[2]:

	Review #	Brand	Variety	Style	Country	Stars	Top Ten
0	2580	New Touch	T's Restaurant Tantanmen	Cup	Japan	3.75	NaN
1	2579	Just Way	Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...	Pack	Taiwan	1	NaN
2	2578	Nissin	Cup Noodles Chicken Vegetable	Cup	USA	2.25	NaN
3	2577	Wei Lih	GGE Ramen Snack Tomato Flavor	Pack	Taiwan	2.75	NaN
4	2576	Ching's Secret	Singapore Curry	Pack	India	3.75	NaN
5	2575	Samyang Foods	Kimchi song Song Ramen	Pack	South Korea	4.75	NaN
6	2574	Acecook	Spice Deli Tantan Men With Cilantro	Cup	Japan	4	NaN
7	2573	Ikeda Shoku	Nabeyaki Kitsune Udon	Tray	Japan	3.75	NaN
8	2572	Ripe'n'Dry	Hokkaido Soy Sauce Ramen	Pack	Japan	0.25	NaN
9	2571	KOKA	The Original Spicy Stir-Fried Noodles	Pack	Singapore	2.5	NaN

Types¶

It's always important to look at the types of data in a given dataset. Here we see that all columns besides the id are of the object type. For further plotting, it would be beneficial if the ratings are numerical. This we can do with the nice pandas to_numeric method.

In [3]:

data.dtypes

Out[3]:

Review #     int64
Brand       object
Variety     object
Style       object
Country     object
Stars       object
Top Ten     object
dtype: object

In [4]:

data['Stars'] = pd.to_numeric(data['Stars'], errors='coerce')

The tail¶

We already looked at the head of the dataset for a peek. Why should we do it again with the tail ? Often we can find some missing values in the tail or some fragmented data.

In [5]:

data.tail(10)

Out[5]:

	Review #	Brand	Variety	Style	Country	Stars	Top Ten
2570	10	Smack	Vegetable Beef	Pack	USA	1.5	NaN
2571	9	Sutah	Cup Noodle	Cup	South Korea	2.0	NaN
2572	8	Tung-I	Chinese Beef Instant Rice Noodle	Pack	Taiwan	3.0	NaN
2573	7	Ve Wong	Mushroom Pork	Pack	Vietnam	1.0	NaN
2574	6	Vifon	Nam Vang	Pack	Vietnam	2.5	NaN
2575	5	Vifon	Hu Tiu Nam Vang ["Phnom Penh" style] Asian Sty...	Bowl	Vietnam	3.5	NaN
2576	4	Wai Wai	Oriental Style Instant Noodles	Pack	Thailand	1.0	NaN
2577	3	Wai Wai	Tom Yum Shrimp	Pack	Thailand	2.0	NaN
2578	2	Wai Wai	Tom Yum Chili Flavor	Pack	Thailand	2.0	NaN
2579	1	Westbrae	Miso Ramen	Pack	USA	0.5	NaN

Countries¶

With this code we look at the number of ratings by countries and order them. There are no big surprises here, the countries of Asia and USA have the most ramen ratings. In the first plot, we plot all countries, but that looks very messy. With the second plot we can just look at the top 10, to get a better plot.

In [6]:

data["Country"].value_counts().plot.bar()

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9d77c0c5f8>

In [7]:

data["Country"].value_counts().head(10).plot.bar()

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9d757e92e8>

Lets look at the different styles of Ramen¶

With this quick plot we see that most of the ramen style are of the instant pack variety, with bowl and cup in second. Box, can and bar are nearly existent. This is interesting because an analyis of countries or ratings would be interesting by the different styles. This we will explore with subplots now and in the next post with bivariate plotting.

In [8]:

data["Style"].value_counts().plot.bar()

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9d758546a0>

Cleaning¶

How to only get the Pack entries of the Dataset? With pandas its that simple line of code. Afterthat we look if it was succesful. If we just plot the countries again, we see a different ordering with South Korea and Taiwan not at top.

In [9]:

pack = data[data.Style == "Pack"]
pack[pack.Style != "Pack"].head()

Out[9]:

	Review #	Brand	Variety	Style	Country	Stars	Top Ten

In [10]:

pack["Country"].value_counts().head(10).plot.bar()

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9d756bb4e0>

Brand¶

We can do the same with the brand and see that most of these are of the type nissin. For a good Data Analysis we would now need to research some expert knowledge.

In [11]:

data["Brand"].value_counts().head(10).plot.bar()

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9d756c7e80>

Conclusion¶

In this post:¶

We explored some simple data formating and univariate plotting with pandas
This is just to get an first idea about the dataset
This charts are just the preview for the Data Analyst itself , not the one you would show others
We learned some things about ramen :)

In the next post¶

We will do the same with more complicated transformations and bivariate plotting
This will conclude the first overview of the data, then we will make beautiful plots with seaborn.

Micke's Data Science Blog