Linear Regression Basics


Linear regression is one of the most fundamental and widely used statistical techniques in data analysis and machine learning. It helps model the relationship between a dependent variable (target) and one or more independent variables (predictors). In this blog, we’ll dive into the basics of linear regression, explain its concepts, and provide an overview of how it works, along with an implementation example.

Table of Contents

  1. What is Linear Regression?
  2. Types of Linear Regression
  3. Key Concepts in Linear Regression
  4. The Mathematical Formula
  5. Assumptions of Linear Regression
  6. How Linear Regression Works
  7. Implementing Linear Regression in Python
  8. Evaluating the Model

1. What is Linear Regression?

Linear regression is a method used to model the relationship between two or more variables by fitting a linear equation to observed data. The goal is to find the line (or hyperplane) that best represents the relationship between the independent variables (predictors) and the dependent variable (target).

  • Simple Linear Regression: Involves one independent variable.
  • Multiple Linear Regression: Involves two or more independent variables.

The primary goal is to predict the dependent variable (target) based on the values of the independent variables.


2. Types of Linear Regression

  • Simple Linear Regression: This involves a single independent variable and is used to predict a continuous dependent variable. The relationship is modeled as a straight line.

    • Formula: y = β₀ + β₁x
    • y is the dependent variable, x is the independent variable, β₀ is the intercept, and β₁ is the slope of the line.
  • Multiple Linear Regression: This involves two or more independent variables to predict a dependent variable. It is useful when the relationship between the target and predictors is more complex and cannot be represented by a straight line.

    • Formula: y = β₀ + β₁x₁ + β₂x₂ + ... + βₖxₖ
    • y is the dependent variable, x₁, x₂, ..., xₖ are the independent variables, and β₀, β₁, β₂, ..., βₖ are the coefficients (intercepts and slopes).

3. Key Concepts in Linear Regression

  • Dependent Variable (y): The variable you're trying to predict or explain.
  • Independent Variable (X): The variable(s) used to make predictions.
  • Intercept (β₀): The value of the dependent variable when all independent variables are zero.
  • Slope (β₁, β₂, ..., βₖ): The amount by which the dependent variable changes as the independent variable(s) change.

4. The Mathematical Formula

In simple linear regression, the relationship between the dependent variable y and the independent variable x is modeled using the following formula:

y=β0+β1x+ϵ

Where:

  • y is the dependent variable.
  • x is the independent variable.
  • β₀ is the intercept of the line.
  • β₁ is the slope of the line.
  • ε is the error term (residuals), which represents the difference between the observed and predicted values.

For multiple linear regression, the formula extends to multiple independent variables:

y=β0+β1x1+β2x2+...+βkxk+ϵ


5. Assumptions of Linear Regression

Linear regression makes several assumptions about the data, which, if violated, can affect the accuracy and reliability of the model:

  1. Linearity: The relationship between the dependent and independent variables is linear.
  2. Independence: The residuals (errors) are independent of each other.
  3. Homoscedasticity: The variance of the residuals is constant across all values of the independent variables.
  4. Normality: The residuals are normally distributed.
  5. No multicollinearity (for multiple linear regression): The independent variables should not be highly correlated with each other.

6. How Linear Regression Works

The key objective in linear regression is to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the error between the predicted values and the actual values. The error is typically measured by sum of squared errors (SSE), which is the difference between the actual and predicted values squared:

SSE=(yactualypredicted)2

The algorithm uses a method called Ordinary Least Squares (OLS) to minimize this error and estimate the coefficients (β₀, β₁, ...) of the linear model.

OLS works by finding the line that minimizes the sum of the squared residuals (the difference between observed and predicted values).


7. Implementing Linear Regression in Python

Let’s walk through the process of implementing a simple linear regression model using Python and the scikit-learn library.

Step 1: Install Required Libraries

First, make sure to install the necessary libraries.

pip install numpy pandas matplotlib scikit-learn

Step 2: Prepare the Data

For this example, let’s use a simple dataset with a relationship between hours studied and exam scores.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Create a sample dataset
data = {
    'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Exam_Score': [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
}

# Convert to a pandas DataFrame
df = pd.DataFrame(data)

# Split the data into independent (X) and dependent (y) variables
X = df[['Hours_Studied']]  # Independent variable
y = df['Exam_Score']  # Dependent variable

Step 3: Split the Data into Training and Test Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Create and Train the Linear Regression Model

# Initialize the model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

Step 5: Make Predictions

# Make predictions on the test data
y_pred = model.predict(X_test)

# Display the predictions
print("Predicted Exam Scores:", y_pred)

Step 6: Evaluate the Model

# Evaluate the model using the coefficient of determination (R^2)
print("R^2 score:", model.score(X_test, y_test))

# Visualize the regression line
plt.scatter(X, y, color='blue')  # Scatter plot of the data points
plt.plot(X, model.predict(X), color='red')  # Line representing the linear model
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.title('Linear Regression: Hours Studied vs. Exam Score')
plt.show()

Output:

  • R² score: This will show how well the model explains the variability of the data (closer to 1 means better fit).
  • Visualization: A scatter plot of the data points and the fitted regression line.

8. Evaluating the Model

To evaluate the linear regression model, you can use several metrics, including:

  • R² Score (Coefficient of Determination): This measures how much of the variance in the dependent variable is explained by the independent variables. A value close to 1 indicates a good fit.
  • Mean Squared Error (MSE): The average of the squared differences between the actual and predicted values. A lower MSE indicates a better fit.