Linear regression is one of the most fundamental statistical methods used in data analysis and predictive modeling. It allows us to explore relationships between variables and predict outcomes based on input data. In this blog post, we will dive into some practical case studies where linear regression has been used to solve real-world problems. These case studies will showcase how linear regression can be applied across various industries such as finance, healthcare, marketing, and more.
Before diving into case studies, let’s quickly review what linear regression is. Linear regression models the relationship between a dependent variable (target) and one or more independent variables (predictors) using a straight line. The model tries to fit the best line that minimizes the sum of squared errors between the observed values and predicted values.
In simple linear regression, the relationship between the dependent variable and the independent variable is expressed as:
Where:
In the real estate industry, linear regression is frequently used to predict the price of a house based on various features like size, location, number of bedrooms, etc. This is an excellent example of how linear regression helps both buyers and sellers make informed decisions.
A real estate agency wants to predict the price of a house based on its size (square footage) and the number of bedrooms. The agency has data on previous home sales.
Using linear regression, we can model the relationship between the price of the house (dependent variable) and the size of the house and number of bedrooms (independent variables). The resulting model will help predict the price of a house for potential buyers and sellers.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Sample data
data = {
'Size (sqft)': [1500, 1800, 2400, 3000, 3500],
'Bedrooms': [3, 3, 4, 4, 5],
'Price ($)': [400000, 450000, 500000, 600000, 650000]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Define independent variables (Size and Bedrooms)
X = df[['Size (sqft)', 'Bedrooms']]
# Define dependent variable (Price)
y = df['Price ($)']
# Create the linear regression model
model = LinearRegression()
model.fit(X, y)
# Predicting house price based on new data
predicted_price = model.predict([[2000, 3]]) # Predict price for 2000 sqft, 3 bedrooms
print(f"Predicted House Price: ${predicted_price[0]:,.2f}")
In the field of human resources, linear regression can be used to predict employee salaries based on experience, education level, and other factors. This is a practical example for businesses that want to make data-driven decisions about compensation.
A company wants to predict an employee’s salary based on their years of experience and educational qualification. The company has data from previous employees.
By using linear regression, the company can create a model that predicts salaries based on these predictors. This model helps businesses ensure competitive compensation while maintaining internal equity.
# Sample data
data = {
'Experience (years)': [1, 2, 3, 4, 5],
'Education Level': [1, 2, 2, 3, 3], # 1=High School, 2=Bachelor's, 3=Master's
'Salary ($)': [40000, 45000, 50000, 60000, 70000]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Define independent variables (Experience and Education)
X = df[['Experience (years)', 'Education Level']]
# Define dependent variable (Salary)
y = df['Salary ($)']
# Create the linear regression model
model = LinearRegression()
model.fit(X, y)
# Predicting salary based on new data
predicted_salary = model.predict([[6, 3]]) # Predict salary for 6 years of experience, Master's degree
print(f"Predicted Salary: ${predicted_salary[0]:,.2f}")
Linear regression can help companies assess the effectiveness of their marketing campaigns by evaluating how different factors (e.g., budget, social media engagement) influence sales.
A company wants to know how its advertising budget and social media engagement impact sales. They want to know if increasing the budget for online ads will result in increased sales.
By using linear regression, the company can model the relationship between advertising budget, social media engagement, and sales. This helps the company understand the return on investment (ROI) from their marketing activities.
# Sample data
data = {
'Ad Budget ($)': [1000, 2000, 3000, 4000, 5000],
'Social Media Engagement': [150, 200, 250, 300, 350],
'Sales ($)': [12000, 15000, 18000, 22000, 25000]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Define independent variables (Ad Budget and Social Media Engagement)
X = df[['Ad Budget ($)', 'Social Media Engagement']]
# Define dependent variable (Sales)
y = df['Sales ($)']
# Create the linear regression model
model = LinearRegression()
model.fit(X, y)
# Predicting sales based on new data
predicted_sales = model.predict([[3500, 320]]) # Predict sales for $3500 ad budget and 320 social media engagements
print(f"Predicted Sales: ${predicted_sales[0]:,.2f}")
In healthcare, linear regression is often used to predict patient outcomes based on various clinical factors. One example is predicting recovery time for patients after surgery based on factors like age, pre-existing conditions, and type of surgery.
A hospital wants to predict how long it will take for a patient to recover after surgery. The hospital has data on patients' age, the type of surgery they had, and other medical factors.
Using linear regression, the hospital can build a predictive model that estimates recovery time, helping to plan patient care and hospital resources.
# Sample data
data = {
'Age (years)': [25, 40, 60, 30, 50],
'Surgery Type': [1, 2, 2, 1, 3], # 1=Minor, 2=Moderate, 3=Major
'Recovery Time (days)': [7, 14, 30, 10, 25]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Define independent variables (Age and Surgery Type)
X = df[['Age (years)', 'Surgery Type']]
# Define dependent variable (Recovery Time)
y = df['Recovery Time (days)']
# Create the linear regression model
model = LinearRegression()
model.fit(X, y)
# Predicting recovery time based on new data
predicted_recovery = model.predict([[35, 2]]) # Predict recovery time for 35 years old, Moderate surgery
print(f"Predicted Recovery Time: {predicted_recovery[0]:,.2f} days")