Logistic regression is one of the most popular algorithms in machine learning, used primarily for binary classification tasks. Despite the name, logistic regression is a classification algorithm, not a regression algorithm. It’s used to predict the probability of a categorical dependent variable, typically one of two classes, based on one or more predictor variables.
In this blog, we’ll dive into what logistic regression is, how it works, its applications, and provide sample code to demonstrate its implementation.
Logistic regression is a statistical method for analyzing datasets in which the outcome is a binary variable. It is used when the dependent variable (target) is categorical, and typically has two classes or categories (e.g., yes/no, 1/0, success/failure).
The goal of logistic regression is to model the probability that a given input belongs to a particular class. Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of the class membership, which is constrained between 0 and 1.
The logistic regression model relies on the sigmoid function (also known as the logistic function), which is an S-shaped curve that transforms any real-valued number into a value between 0 and 1. This is ideal for classification, where we need to predict probabilities.
The equation of the sigmoid function is:
Where:
The working of logistic regression can be broken down into several key steps:
Modeling the Relationship: The first step is modeling the relationship between the independent variables and the target. We use a linear equation (similar to linear regression) to combine the input features.
Transforming Output with the Sigmoid Function: Instead of predicting continuous output, logistic regression applies the sigmoid function to the linear combination of inputs to produce a probability value between 0 and 1.
Making Predictions: After obtaining the probability, logistic regression classifies the input based on a threshold (usually 0.5). If the predicted probability is greater than or equal to 0.5, the input is classified as 1 (positive class); otherwise, it’s classified as 0 (negative class).
The logistic regression model uses the following formula to estimate the probability that the target variable is equal to 1, given the input features :
Where:
The log-likelihood function is used to estimate the model parameters during training by maximizing the likelihood of observing the given data.
Logistic regression is widely used for binary classification tasks, and here are a few common applications:
Let’s now walk through how to implement logistic regression using Python’s scikit-learn
library. We'll use a simple example of classifying data points into two classes based on features.
For this example, let’s assume we have a dataset with customer information, and we want to predict whether a customer will churn or stay. We’ll use the Iris dataset from sklearn
and classify one class of flowers versus the rest.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
# Sample dataset (using Iris dataset for simplicity)
from sklearn.datasets import load_iris
data = load_iris()
# We will predict whether the flower is of class 0 or not (binary classification)
X = data.data
y = (data.target == 0).astype(int) # Classifying if the flower is of class 0
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the logistic regression model
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
# Visualize the decision boundary
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, cmap='coolwarm', edgecolors='k', s=100)
plt.title('Logistic Regression - Classification Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
fit()
method.