top of page

Analyzing Iris Flower Data Set

Iris Flower data set consists of 50 specimens from each of three species of Iris Flower(Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample - the length and the width of the sepals and petals, in centimeters. We will find out if any correlation exists between these features that will help us predicting the species of the flower based on the features. The data set can be downloaded from here.

First, let us plot a Scatter Plot with Sepal Length as X-axis, Sepal Width as Y-axis and color code by Species to find out if we can find any correlation between them.

Python Code:

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

# Reading Data
flower_data = pd.read_excel(r'PATH')

Overwriting default Matplotlib settings with Seaborn for plots to look more elegant

sns.scatterplot(flower_data['Sepal length'], flower_data['Sepal width'], hue=flower_data['Species'], s=90)

's=90' in the above code defines the size of dots in scatterplot

Sepal Scatter Plot

We can observe that even though Setosa is well distinguished from others, Versicolor and Virginica tend to overlap quite often. We can deduce that following sepal lengths and widths may not be accurate always

So, let us do scatter plot again but this time with petal lengths and widths.

sns.scatterplot(flower_data['Petal length'], flower_data['Petal width'], hue=flower_data['Species'], s=90)

Petal Scatter Plot

Here, the plot is more distinguished. Given petal features of a flower, we can try to predict which species it belongs to. Iris Setosa has a very small petal area and Virginica has the largest petal area out of the three.

If we want to analyze plots between all features present, we can always use pair plot. Creating a pair is really easy. All it takes is a line of code and we have the below output.

Iris Flower Data set pair plots

And the following line of code:

sns.pairplot(flower_data, hue='Species')

Notice how we used (hue = 'Species') in the pair plot as opposed to (hue = flower_data['Species']) in the Scatter plot above. This is because in the pair plot, we're taking the whole 'flower_data' data frame for plotting. But in Scatter plot, we're using only two columns. If we don't explicitly say (hue = flower_data['Species']) in the scatter plot, it wouldn't know where to look. But in the pair plot, we're taking the whole of the data set. So, we can just say 'Species' data set and it understands 'Species' must be a column in the flower_data data frame.

bottom of page