Data visualization with pandas: Part 1
Sa 09 Juni 2018Data Visualization with pandas on the Ramen Ratings Dataset¶
To improve my skill in Data Visualization, I thought about doing visualization with a lots of different frameworks/libaries.¶
- This post will be about univariate Visualization with pandas
- After that we will do bivariate visualization with pandas
- Then we will do the same with seaborn
- Also we will explore multivariate plotting in seaborn
- The next library to explore will be plotly
- Finally we will do an example Data Science project with the Dataset. Formulate a Question, exploring the Data and answering it.
The Dataset can be found on Kaggle at Ramen Dataset
Why only pandas¶
While seaborn or plotly offers a lot of functionality for nice plots and visualizations, we can visualize nice and quick with pandas. As we will see the structure of the input in pandas is different in comparison to seaborn or plotly. This makes very quick visualization easy if you just wanna do an explorative data analysis for yourself.
We just import pandas. Yes, only one import statement.
import pandas as pd
With pd.read_csv we can quickly read the csv in a pandas dataframe with the name Data. For an explanation about dataframes visit (link)
with data.head(10) we just visualize the first 10 entries. We do this to get an first glimpse on the data. Sometimes we detect missing values , or interesting formats. For Example here we see The column Top Ten, for which most of the values will be missing values.
data = pd.read_csv("ramen-ratings.csv")
data.head(10)
Types¶
It's always important to look at the types of data in a given dataset. Here we see that all columns besides the id are of the object type. For further plotting, it would be beneficial if the ratings are numerical. This we can do with the nice pandas to_numeric method.
data.dtypes
data['Stars'] = pd.to_numeric(data['Stars'], errors='coerce')
The tail¶
We already looked at the head of the dataset for a peek. Why should we do it again with the tail ? Often we can find some missing values in the tail or some fragmented data.
data.tail(10)
Countries¶
With this code we look at the number of ratings by countries and order them. There are no big surprises here, the countries of Asia and USA have the most ramen ratings. In the first plot, we plot all countries, but that looks very messy. With the second plot we can just look at the top 10, to get a better plot.
data["Country"].value_counts().plot.bar()
data["Country"].value_counts().head(10).plot.bar()
Lets look at the different styles of Ramen¶
With this quick plot we see that most of the ramen style are of the instant pack variety, with bowl and cup in second. Box, can and bar are nearly existent. This is interesting because an analyis of countries or ratings would be interesting by the different styles. This we will explore with subplots now and in the next post with bivariate plotting.
data["Style"].value_counts().plot.bar()
Cleaning¶
How to only get the Pack entries of the Dataset? With pandas its that simple line of code. Afterthat we look if it was succesful. If we just plot the countries again, we see a different ordering with South Korea and Taiwan not at top.
pack = data[data.Style == "Pack"]
pack[pack.Style != "Pack"].head()
pack["Country"].value_counts().head(10).plot.bar()
Brand¶
We can do the same with the brand and see that most of these are of the type nissin. For a good Data Analysis we would now need to research some expert knowledge.
data["Brand"].value_counts().head(10).plot.bar()
Conclusion¶
In this post:¶
- We explored some simple data formating and univariate plotting with pandas
- This is just to get an first idea about the dataset
- This charts are just the preview for the Data Analyst itself , not the one you would show others
- We learned some things about ramen :)
In the next post¶
- We will do the same with more complicated transformations and bivariate plotting
- This will conclude the first overview of the data, then we will make beautiful plots with seaborn.