Ai Cheat Sheet
  • Home
  • Statistics ↓↑
    • Types of Measure
    • Population and Sample
    • Outliers
    • Variance
    • Standard Deviation
    • Skewness
    • Percentiles
    • Deciles
    • Quartiles
    • Box and Whisker Plots
    • Correlation and Covariance
    • Hypothesis Test
    • P Value
    • Statistical Significance
    • Bootstrapping
    • Confidence Interval
    • Central Limit Theorem
    • F1 Score (F Measure)
    • ROC and AUC
    • Random Variable
    • Expected Value
    • Central Limit Theorem
  • Probability ↓↑
    • What is Probability
    • Joint Probability
    • Marginal Probability
    • Conditional Probability
    • Bayesian Statistics
    • Naive Bayes
  • Data Science ↓↑
    • Probability Distribution
    • Bernoulli Distribution
    • Uniform Distribution
    • Binomial Distribution
    • Poisson Distribution
    • Normal Distribution
    • T-SNE
  • Data Engineering ↓↑
    • Data Science vs Data Engineering
    • Data Architecture
    • Data Governance
    • Data Quality
    • Data Compliance
    • Business Intelligence
    • Data Modeling
    • Data Catalog
    • Data Cleaning
    • Data Format
      • Apache Avro
    • Tools
      • Data Fusion
      • Dataflow
      • Dataproc
      • BigQuery
    • Cloud Platforms
      • GCP
    • SQL
      • ACID
      • SQL Transaction
      • Query Optimization
    • Data Engineering Interview Questions
  • Vector and Matrix
    • Vector
    • Matrix
  • Machine Learning ↓↑
    • L1 and L2 Loss Function
    • Linear Regression
    • Logistic Regression
    • Naive Bayes Classifier
    • Resources
  • Deep Learning ↓↑
    • Neural Networks and Deep Learning
    • Improving Deep Neural Networks
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
    • Sequence Models
    • Bias
    • Activation Function
    • Softmax
    • Cross Entropy
  • Natural Language Processing ↓↑
    • Linguistics and NLP
    • Text Augmentation
    • CNN for NLP
    • Transformers
      • Implementation
  • Computer Vision ↓↑
    • Object Localization
    • Object Detection
    • Bounding Box Prediction
    • Evaluating Object Localization
    • Anchor Boxes
    • YOLO Algorithm
    • R-CNN
    • Face Recognition
  • Time Series
    • Resources
  • Reinforcement Learning
    • Reinforcement Learning
  • System Design
    • SW Diagramming
    • Feed
  • Tools
    • PyTorch
    • Tensorflow
    • Hugging Face
  • MLOps
    • Vertex AI
      • Dataset
      • Feature Store
      • Pipelines
      • Training
      • Experiments
      • Model Registry
      • Serving
        • Batch Predictions
        • Online Predictions
      • Metadata
      • Matching Engine
      • Monitoring and Alerting
  • Interview Questions ↓↑
    • Questions by Shared Experience
  • Contact
    • My Personal Website
Powered by GitBook
On this page
  • 1. Logistic Model
  • 2. Define the Loss Function
  • 3. Utilize the Gradient Descent Algorithm
  • Python Implementation

Was this helpful?

  1. Machine Learning ↓↑

Logistic Regression

PreviousLinear RegressionNextNaive Bayes Classifier

Last updated 4 years ago

Was this helpful?

1. Logistic Model

Consider a model with features x1,x2,x3...xnx_1, x_2, x_3 ... x_nx1​,x2​,x3​...xn​ . Let the binary output be denoted by yyy , that can take the values 0 or 1. Let ppp be the probability of y=1y = 1y=1 , we can denote it as p=P(y=1)p = P(y=1)p=P(y=1) . The mathematical relationship between these variables can be denoted as:

ln(p1−p)=θ0+θ1x1+θ2x2+θ3x3ln(\frac{p}{1-p}) = \theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3\\ln(1−pp​)=θ0​+θ1​x1​+θ2​x2​+θ3​x3​

Here the term p1−p\frac{p}{1−p}1−pp​ is known as the odds and denotes the likelihood of the event taking place. Thus ln(p1−p)ln(\frac{p}{1−p})ln(1−pp​) is known as the log odds and is simply used to map the probability that lies between 0 and 1 to a range between (−∞, +∞). The terms θ1,θ2,θ3,...\theta_1,\theta_2,\theta_3,...θ1​,θ2​,θ3​,...are parameters (or weights) that we will estimate during training.

It is actually Sigmoid!

ln(p1−p)=θ0+θ1x1+θ2x2+θ3x3p1−p=eθ0+θ1x1+θ2x2+θ3x3eθ0+θ1x1+θ2x2+θ3x3−p(eθ0+θ1x1+θ2x2+θ3x3)=pp+p(eθ0+θ1x1+θ2x2+θ3x3)=eθ0+θ1x1+θ2x2+θ3x3p(1+eθ0+θ1x1+θ2x2+θ3x3)=eθ0+θ1x1+θ2x2+θ3x3p=eθ0+θ1x1+θ2x2+θ3x31+eθ0+θ1x1+θ2x2+θ3x3p=eθ0+θ1x1+θ2x2+θ3x3eθ0+θ1x1+θ2x2+θ3x31+eb0+b1x1+b2x2+b3x3eθ0+θ1x1+θ2x2+θ3x3p=11+1eθ0+θ1x1+θ2x2+θ3x3p=11+e−(θ0+θ1x1+θ2x2+θ3x3)S(x)=11+e−xln(\frac{p}{1-p}) = \theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3\\ \frac{p}{1-p} = e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}\\ e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3} - p(e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}) = p\\ p + p(e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}) = e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}\\ p(1 + e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}) = e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}\\ p = \frac {e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}}{1 + e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}}\\ p = \frac {\frac{e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}}{e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}}}{\frac{1 + e^{b_0+b_1x_1+b_2x_2+b_3x_3}}{e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}}}\\ p = \frac {1}{1+\frac{1}{e^{\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3}}}\\ p = \frac {1}{1+e^{-(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3)}}\\ S(x)=\frac{1}{1+e^{-x}}ln(1−pp​)=θ0​+θ1​x1​+θ2​x2​+θ3​x3​1−pp​=eθ0​+θ1​x1​+θ2​x2​+θ3​x3​eθ0​+θ1​x1​+θ2​x2​+θ3​x3​−p(eθ0​+θ1​x1​+θ2​x2​+θ3​x3​)=pp+p(eθ0​+θ1​x1​+θ2​x2​+θ3​x3​)=eθ0​+θ1​x1​+θ2​x2​+θ3​x3​p(1+eθ0​+θ1​x1​+θ2​x2​+θ3​x3​)=eθ0​+θ1​x1​+θ2​x2​+θ3​x3​p=1+eθ0​+θ1​x1​+θ2​x2​+θ3​x3​eθ0​+θ1​x1​+θ2​x2​+θ3​x3​​p=eθ0​+θ1​x1​+θ2​x2​+θ3​x3​1+eb0​+b1​x1​+b2​x2​+b3​x3​​eθ0​+θ1​x1​+θ2​x2​+θ3​x3​eθ0​+θ1​x1​+θ2​x2​+θ3​x3​​​p=1+eθ0​+θ1​x1​+θ2​x2​+θ3​x3​1​1​p=1+e−(θ0​+θ1​x1​+θ2​x2​+θ3​x3​)1​S(x)=1+e−x1​

Now we will be using the above equation to make our predictions. Before that we will train our model to obtain the values of our parameters θ1,θ2,θ3,...\theta_1,\theta_2,\theta_3,...θ1​,θ2​,θ3​,... that result in least error.

2. Define the Loss Function

A such as Least Squared Error will do the job.

L=∑i=1n(ytrue−ypredicted)2L = \sum_{i=1}^n (y_{true}-y_{predicted})^2L=i=1∑n​(ytrue​−ypredicted​)2

3. Utilize the Gradient Descent Algorithm

You might know that the partial derivative of a function at its minimum value is equal to 0. So gradient descent basically uses this concept to estimate the parameters or weights of our model by minimizing the loss function.

  1. Initialize the weights, θ0=0\theta_0=0θ0​=0 and θ1=0\theta_1=0θ1​=0 .

  2. Calculate the partial derivative with respect to θ0\theta_0θ0​ and θ1\theta_1θ1​ dθ0=−2∑i=1n(yi−yiˉ)×yiˉ×(1−yiˉ)dθ1=−2∑i=1n(yi−yiˉ)×yiˉ×(1−yiˉ)×xid_{\theta_0} = -2 \sum_{i=1}^n(y_i - \bar{y_i}) \times \bar{y_i} \times (1 - \bar{y_i})\\ d_{\theta_1} = -2 \sum_{i=1}^n(y_i - \bar{y_i}) \times \bar{y_i} \times (1 - \bar{y_i}) \times x_idθ0​​=−2∑i=1n​(yi​−yi​ˉ​)×yi​ˉ​×(1−yi​ˉ​)dθ1​​=−2∑i=1n​(yi​−yi​ˉ​)×yi​ˉ​×(1−yi​ˉ​)×xi​

  3. Update the weights - values of b0b_0b0​ and b1b_1b1​ θ0=θ0−l×dθ0θ1=θ1−l×dθ1\theta_0 = \theta_0 - l \times d_{\theta_0} \\ \theta_1 = \theta_1 - l \times d_{\theta_1}θ0​=θ0​−l×dθ0​​θ1​=θ1​−l×dθ1​​

Python Implementation

# Importing libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from math import exp

# Preparing the dataset
data = pd.DataFrame({'feature' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], 'label' : [0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]})
# Divide the data to training set and test set
X_train, X_test, y_train, y_test = train_test_split(data['feature'], data['label'], test_size=0.30)

## Logistic Regression Model
# Helper function to normalize data
def normalize(X):
    return X - X.mean()

# Method to make predictions
def predict(X, theta0, theta1):
    # Here the predict function is: 1/(1+e^(-x))
    return np.array([1 / (1 + exp(-(theta0 + theta1*x))) for x in X])

# Method to train the model
def logistic_regression(X, Y):
    # Normalizing the data
    X = normalize(X)

    # Initializing variables
    theta0 = 0
    theta1 = 0
    learning_rate = 0.001
    epochs = 300

    # Training iteration
    for epoch in range(epochs):
        y_pred = predict(X, theta0, theta1)

        ## Here the loss function is: sum(y-y_pred)^2 a.k.a least squared error (LSE)
        # Derivative of loss w.r.t. theta0
        theta0_d = -2 * sum((Y - y_pred) * y_pred * (1 - y_pred))
        # Derivative of loss w.r.t. theta1
        theta1_d = -2 * sum(X * (Y - y_pred) * y_pred * (1 - y_pred))

        theta0 = theta0 - learning_rate * theta0_d
        theta1 = theta1 - learning_rate * theta1_d
    
    return theta0, theta1

# Training the model
theta0, theta1 = logistic_regression(X_train, y_train)   

# Making predictions
X_test_norm = normalize(X_test)
y_pred = predict(X_test_norm, theta0, theta1)
y_pred = [1 if p >= 0.5 else 0 for p in y_pred]

# Evaluating the model
print(list(y_test))
print(y_pred)

L2 Loss function
Logistic Regression in Machine Learning using PythonMedium
Logo