Exploratory Data Analysis (EDA) is a critical step in the data analysis process. It involves visually and statistically summarizing the main characteristics of a dataset, often with the help of graphical representations. The goal of EDA is to understand the underlying structure of the data, detect any anomalies or outliers, test assumptions, and check the quality of the data before applying more sophisticated data analysis or machine learning techniques.
In this blog, we’ll discuss the importance of EDA, the typical steps involved, and how to perform EDA in Python using popular libraries like Pandas, Matplotlib, Seaborn, and NumPy.
EDA serves several purposes in the data analysis process:
Before performing any analysis, you need to load the dataset. Common file formats are CSV, Excel, and SQL databases. In Python, the Pandas library is typically used to load datasets into DataFrames.
import pandas as pd
# Load dataset
df = pd.read_csv('your_dataset.csv')
# Display the first few rows
print(df.head())
Data cleaning involves handling missing values, duplicates, and converting data types. In some cases, you may need to rename columns, remove irrelevant columns, or reformat data.
# Check for missing values
print(df.isnull().sum())
# Fill missing values with a specific value (e.g., mean or median)
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
# Drop rows with missing values
df.dropna(inplace=True)
# Check for duplicate rows
df.duplicated().sum()
# Remove duplicates
df.drop_duplicates(inplace=True)
# Convert a column to a specific data type
df['column_name'] = df['column_name'].astype('int')
To understand the dataset better, we should inspect its shape, types of columns, and summary statistics.
# Get the number of rows and columns
print(df.shape)
# Check the data types of each column
print(df.dtypes)
# Get summary statistics for numeric columns
print(df.describe())
Univariate analysis involves examining the distribution and basic statistics of each individual variable.
You can use statistical measures like mean, median, mode, variance, and standard deviation to summarize a numerical column. Visualization tools like histograms, boxplots, and density plots can also help assess the distribution.
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram
df['numeric_column'].hist(bins=20, color='skyblue', edgecolor='black')
plt.title('Histogram of Numeric Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
# Boxplot
sns.boxplot(data=df, x='numeric_column')
plt.title('Boxplot of Numeric Column')
plt.show()
# Density plot
sns.kdeplot(df['numeric_column'], shade=True, color='green')
plt.title('Density Plot of Numeric Column')
plt.show()
You can use bar plots to visualize the frequency of each category in a categorical variable.
# Barplot for categorical column
sns.countplot(data=df, x='categorical_column', palette='Set2')
plt.title('Barplot of Categorical Column')
plt.show()
Bivariate analysis examines the relationship between two variables. This is particularly useful for understanding correlations and identifying potential predictors for modeling.
You can use scatter plots, correlation matrices, or pair plots to visualize the relationship between two numeric variables.
# Scatter plot
plt.scatter(df['numeric_column_1'], df['numeric_column_2'])
plt.title('Scatter plot of Numeric Column 1 vs Numeric Column 2')
plt.xlabel('Numeric Column 1')
plt.ylabel('Numeric Column 2')
plt.show()
# Correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Boxplots or violin plots are great tools to explore the relationship between numeric and categorical variables.
# Boxplot for Numeric vs Categorical
sns.boxplot(data=df, x='categorical_column', y='numeric_column')
plt.title('Boxplot of Numeric Column by Categorical Column')
plt.show()
# Violin plot for Numeric vs Categorical
sns.violinplot(data=df, x='categorical_column', y='numeric_column')
plt.title('Violin Plot of Numeric Column by Categorical Column')
plt.show()
Multivariate analysis explores relationships between three or more variables. In EDA, this often involves creating pair plots or using dimensionality reduction techniques.
A pair plot shows the pairwise relationships between multiple variables and is useful when dealing with multiple numeric columns.
sns.pairplot(df)
plt.title('Pair Plot of the DataFrame')
plt.show()
For datasets with a large number of features, PCA can reduce dimensionality while retaining the maximum variance. PCA helps in visualizing high-dimensional data in 2D or 3D space.
from sklearn.decomposition import PCA
# Perform PCA for dimensionality reduction
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df[['numeric_column_1', 'numeric_column_2', 'numeric_column_3']])
# Create a DataFrame with the PCA results
pca_df = pd.DataFrame(data=pca_result, columns=['PCA1', 'PCA2'])
# Plot the PCA result
plt.scatter(pca_df['PCA1'], pca_df['PCA2'])
plt.title('PCA Plot')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
Outliers are values that are significantly different from most other data points. They can distort statistical analyses and model predictions. You can detect outliers using visualizations like boxplots or by calculating the Z-score or IQR (Interquartile Range).
Boxplots are useful for detecting outliers. Any data points outside of the "whiskers" in the boxplot are considered outliers.
sns.boxplot(data=df, x='numeric_column')
plt.title('Boxplot for Outlier Detection')
plt.show()
The Z-score represents how many standard deviations away a data point is from the mean. A Z-score greater than 3 or less than -3 is often considered an outlier.
from scipy.stats import zscore
df['z_score'] = zscore(df['numeric_column'])
outliers = df[df['z_score'].abs() > 3]
print(outliers)
Feature engineering involves creating new features or transforming existing ones to improve model performance. Based on insights from EDA, you may decide to combine features, create new ones, or perform transformations (e.g., logarithmic transformation) to improve the distribution of the data.
Once you’ve explored the data, you may want to test assumptions about the data using statistical tests, such as t-tests, chi-square tests, or ANOVA.
Example:
from scipy.stats import ttest_ind
# T-test to compare means of two groups
group1 = df[df['categorical_column'] == 'Group1']['numeric_column']
group2 = df[df['categorical_column'] == 'Group2']['numeric_column']
t_stat, p_value = ttest_ind(group1, group2)
print(f"T-statistic: {t_stat}, P-value: {p_value}")