Data visualization with pandas: Part 1

Data Visualization with pandas on the Ramen Ratings Dataset

To improve my skill in Data Visualization, I thought about doing visualization with a lots of different frameworks/libaries.

  1. This post will be about univariate Visualization with pandas
  2. After that we will do bivariate visualization with pandas
  3. Then we will do the same with seaborn
  4. Also we will explore multivariate plotting in seaborn
  5. The next library to explore will be plotly
  6. Finally we will do an example Data Science project with the Dataset. Formulate a Question, exploring the Data and answering it.

The Dataset can be found on Kaggle at Ramen Dataset

Why only pandas

While seaborn or plotly offers a lot of functionality for nice plots and visualizations, we can visualize nice and quick with pandas. As we will see the structure of the input in pandas is different in comparison to seaborn or plotly. This makes very quick visualization easy if you just wanna do an explorative data analysis for yourself.

We just import pandas. Yes, only one import statement.

In [1]:
import pandas as pd

With pd.read_csv we can quickly read the csv in a pandas dataframe with the name Data. For an explanation about dataframes visit (link)

with data.head(10) we just visualize the first 10 entries. We do this to get an first glimpse on the data. Sometimes we detect missing values , or interesting formats. For Example here we see The column Top Ten, for which most of the values will be missing values.

In [2]:
data = pd.read_csv("ramen-ratings.csv")
data.head(10)
Out[2]:
Review # Brand Variety Style Country Stars Top Ten
0 2580 New Touch T's Restaurant Tantanmen Cup Japan 3.75 NaN
1 2579 Just Way Noodles Spicy Hot Sesame Spicy Hot Sesame Guan... Pack Taiwan 1 NaN
2 2578 Nissin Cup Noodles Chicken Vegetable Cup USA 2.25 NaN
3 2577 Wei Lih GGE Ramen Snack Tomato Flavor Pack Taiwan 2.75 NaN
4 2576 Ching's Secret Singapore Curry Pack India 3.75 NaN
5 2575 Samyang Foods Kimchi song Song Ramen Pack South Korea 4.75 NaN
6 2574 Acecook Spice Deli Tantan Men With Cilantro Cup Japan 4 NaN
7 2573 Ikeda Shoku Nabeyaki Kitsune Udon Tray Japan 3.75 NaN
8 2572 Ripe'n'Dry Hokkaido Soy Sauce Ramen Pack Japan 0.25 NaN
9 2571 KOKA The Original Spicy Stir-Fried Noodles Pack Singapore 2.5 NaN

Types

It's always important to look at the types of data in a given dataset. Here we see that all columns besides the id are of the object type. For further plotting, it would be beneficial if the ratings are numerical. This we can do with the nice pandas to_numeric method.

In [3]:
data.dtypes
Out[3]:
Review #     int64
Brand       object
Variety     object
Style       object
Country     object
Stars       object
Top Ten     object
dtype: object
In [4]:
data['Stars'] = pd.to_numeric(data['Stars'], errors='coerce')

The tail

We already looked at the head of the dataset for a peek. Why should we do it again with the tail ? Often we can find some missing values in the tail or some fragmented data.

In [5]:
data.tail(10)
Out[5]:
Review # Brand Variety Style Country Stars Top Ten
2570 10 Smack Vegetable Beef Pack USA 1.5 NaN
2571 9 Sutah Cup Noodle Cup South Korea 2.0 NaN
2572 8 Tung-I Chinese Beef Instant Rice Noodle Pack Taiwan 3.0 NaN
2573 7 Ve Wong Mushroom Pork Pack Vietnam 1.0 NaN
2574 6 Vifon Nam Vang Pack Vietnam 2.5 NaN
2575 5 Vifon Hu Tiu Nam Vang ["Phnom Penh" style] Asian Sty... Bowl Vietnam 3.5 NaN
2576 4 Wai Wai Oriental Style Instant Noodles Pack Thailand 1.0 NaN
2577 3 Wai Wai Tom Yum Shrimp Pack Thailand 2.0 NaN
2578 2 Wai Wai Tom Yum Chili Flavor Pack Thailand 2.0 NaN
2579 1 Westbrae Miso Ramen Pack USA 0.5 NaN

Countries

With this code we look at the number of ratings by countries and order them. There are no big surprises here, the countries of Asia and USA have the most ramen ratings. In the first plot, we plot all countries, but that looks very messy. With the second plot we can just look at the top 10, to get a better plot.

In [6]:
data["Country"].value_counts().plot.bar()
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d77c0c5f8>
In [7]:
data["Country"].value_counts().head(10).plot.bar()
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d757e92e8>

Lets look at the different styles of Ramen

With this quick plot we see that most of the ramen style are of the instant pack variety, with bowl and cup in second. Box, can and bar are nearly existent. This is interesting because an analyis of countries or ratings would be interesting by the different styles. This we will explore with subplots now and in the next post with bivariate plotting.

In [8]:
data["Style"].value_counts().plot.bar()
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d758546a0>

Cleaning

How to only get the Pack entries of the Dataset? With pandas its that simple line of code. Afterthat we look if it was succesful. If we just plot the countries again, we see a different ordering with South Korea and Taiwan not at top.

In [9]:
pack = data[data.Style == "Pack"]
pack[pack.Style != "Pack"].head() 
Out[9]:
Review # Brand Variety Style Country Stars Top Ten
In [10]:
pack["Country"].value_counts().head(10).plot.bar()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d756bb4e0>

Brand

We can do the same with the brand and see that most of these are of the type nissin. For a good Data Analysis we would now need to research some expert knowledge.

In [11]:
data["Brand"].value_counts().head(10).plot.bar()
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d756c7e80>

Conclusion

In this post:

  • We explored some simple data formating and univariate plotting with pandas
  • This is just to get an first idea about the dataset
  • This charts are just the preview for the Data Analyst itself , not the one you would show others
  • We learned some things about ramen :)

In the next post

  • We will do the same with more complicated transformations and bivariate plotting
  • This will conclude the first overview of the data, then we will make beautiful plots with seaborn.

Hope you had fun reading and learned something