Linear Regression Basics
Linear regression is one of the most fundamental and widely used statistical techniques in data analysis and machine learning. It helps model the relationship between a dependent variable (target) and one or more independent variables (predictors). In this blog, we’ll dive into the basics of linear regression, explain its concepts, and provide an overview of how it works, along with an implementation example.
Linear regression is a method used to model the relationship between two or more variables by fitting a linear equation to observed data. The goal is to find the line (or hyperplane) that best represents the relationship between the independent variables (predictors) and the dependent variable (target).
The primary goal is to predict the dependent variable (target) based on the values of the independent variables.
Simple Linear Regression: This involves a single independent variable and is used to predict a continuous dependent variable. The relationship is modeled as a straight line.
y = β₀ + β₁x
y
is the dependent variable, x
is the independent variable, β₀
is the intercept, and β₁
is the slope of the line.Multiple Linear Regression: This involves two or more independent variables to predict a dependent variable. It is useful when the relationship between the target and predictors is more complex and cannot be represented by a straight line.
y = β₀ + β₁x₁ + β₂x₂ + ... + βₖxₖ
y
is the dependent variable, x₁, x₂, ..., xₖ
are the independent variables, and β₀, β₁, β₂, ..., βₖ
are the coefficients (intercepts and slopes).In simple linear regression, the relationship between the dependent variable y
and the independent variable x
is modeled using the following formula:
Where:
y
is the dependent variable.x
is the independent variable.β₀
is the intercept of the line.β₁
is the slope of the line.ε
is the error term (residuals), which represents the difference between the observed and predicted values.For multiple linear regression, the formula extends to multiple independent variables:
Linear regression makes several assumptions about the data, which, if violated, can affect the accuracy and reliability of the model:
The key objective in linear regression is to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the error between the predicted values and the actual values. The error is typically measured by sum of squared errors (SSE), which is the difference between the actual and predicted values squared:
The algorithm uses a method called Ordinary Least Squares (OLS) to minimize this error and estimate the coefficients (β₀
, β₁
, ...) of the linear model.
OLS works by finding the line that minimizes the sum of the squared residuals (the difference between observed and predicted values).
Let’s walk through the process of implementing a simple linear regression model using Python and the scikit-learn library.
First, make sure to install the necessary libraries.
pip install numpy pandas matplotlib scikit-learn
For this example, let’s use a simple dataset with a relationship between hours studied and exam scores.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Create a sample dataset
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Exam_Score': [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
}
# Convert to a pandas DataFrame
df = pd.DataFrame(data)
# Split the data into independent (X) and dependent (y) variables
X = df[['Hours_Studied']] # Independent variable
y = df['Exam_Score'] # Dependent variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Display the predictions
print("Predicted Exam Scores:", y_pred)
# Evaluate the model using the coefficient of determination (R^2)
print("R^2 score:", model.score(X_test, y_test))
# Visualize the regression line
plt.scatter(X, y, color='blue') # Scatter plot of the data points
plt.plot(X, model.predict(X), color='red') # Line representing the linear model
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.title('Linear Regression: Hours Studied vs. Exam Score')
plt.show()
To evaluate the linear regression model, you can use several metrics, including: