Ai Cheat Sheet
  • Home
  • Statistics ↓↑
    • Types of Measure
    • Population and Sample
    • Outliers
    • Variance
    • Standard Deviation
    • Skewness
    • Percentiles
    • Deciles
    • Quartiles
    • Box and Whisker Plots
    • Correlation and Covariance
    • Hypothesis Test
    • P Value
    • Statistical Significance
    • Bootstrapping
    • Confidence Interval
    • Central Limit Theorem
    • F1 Score (F Measure)
    • ROC and AUC
    • Random Variable
    • Expected Value
    • Central Limit Theorem
  • Probability ↓↑
    • What is Probability
    • Joint Probability
    • Marginal Probability
    • Conditional Probability
    • Bayesian Statistics
    • Naive Bayes
  • Data Science ↓↑
    • Probability Distribution
    • Bernoulli Distribution
    • Uniform Distribution
    • Binomial Distribution
    • Poisson Distribution
    • Normal Distribution
    • T-SNE
  • Data Engineering ↓↑
    • Data Science vs Data Engineering
    • Data Architecture
    • Data Governance
    • Data Quality
    • Data Compliance
    • Business Intelligence
    • Data Modeling
    • Data Catalog
    • Data Cleaning
    • Data Format
      • Apache Avro
    • Tools
      • Data Fusion
      • Dataflow
      • Dataproc
      • BigQuery
    • Cloud Platforms
      • GCP
    • SQL
      • ACID
      • SQL Transaction
      • Query Optimization
    • Data Engineering Interview Questions
  • Vector and Matrix
    • Vector
    • Matrix
  • Machine Learning ↓↑
    • L1 and L2 Loss Function
    • Linear Regression
    • Logistic Regression
    • Naive Bayes Classifier
    • Resources
  • Deep Learning ↓↑
    • Neural Networks and Deep Learning
    • Improving Deep Neural Networks
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
    • Sequence Models
    • Bias
    • Activation Function
    • Softmax
    • Cross Entropy
  • Natural Language Processing ↓↑
    • Linguistics and NLP
    • Text Augmentation
    • CNN for NLP
    • Transformers
      • Implementation
  • Computer Vision ↓↑
    • Object Localization
    • Object Detection
    • Bounding Box Prediction
    • Evaluating Object Localization
    • Anchor Boxes
    • YOLO Algorithm
    • R-CNN
    • Face Recognition
  • Time Series
    • Resources
  • Reinforcement Learning
    • Reinforcement Learning
  • System Design
    • SW Diagramming
    • Feed
  • Tools
    • PyTorch
    • Tensorflow
    • Hugging Face
  • MLOps
    • Vertex AI
      • Dataset
      • Feature Store
      • Pipelines
      • Training
      • Experiments
      • Model Registry
      • Serving
        • Batch Predictions
        • Online Predictions
      • Metadata
      • Matching Engine
      • Monitoring and Alerting
  • Interview Questions ↓↑
    • Questions by Shared Experience
  • Contact
    • My Personal Website
Powered by GitBook
On this page

Was this helpful?

  1. System Design

Feed

PreviousSW DiagrammingNextPyTorch

Last updated 4 years ago

Was this helpful?

Each ML solution has four major parts:

  • Machine learning algorithm

  • Training data

  • Signals (sometimes called features)

  • Validation and Metrics

For algorithms, what algorithm will you choose and why? Deep learning, linear regression, random forest? What are the strengths and weaknesses for each? What do they accomplish per your system’s needs?

For data, where will you get test data? What data points will you draw from? How many data points will you handle?

For signals, what metric does your program use to determine relevant data? Will you signal to focus on one aspect of the data or synthesize it from multiple? How long does it take to determine data relevancy?

For metrics, what metrics will you track for success and program learning? How would you measure the success of your system? How will you validate your hypothesis?

5 Steps to solve any ML system design problem

Our question is: Create a content feed to display personalized posts to users.

Step 1: example

If we were clarifying the feed question, we’d ask:

  • What type of feed will this be? Purely text? Text and images?

  • How many users do we expect to have? How many posts does each make per day?

  • What metric does our system optimize for? Do we want more engagement per post or to increase the number of posts?

  • Do we have a target latency?

  • How quickly will our system apply new learning?

Step 2: example

We’d write that our training data is from our current social media platform. Fresh live data will enter the system each time a new post is created based on the creator’s location, the popularity of the creator’s past posts, and the accounts that follow that creator.

We’ll use these metrics to determine how relevant a post is to a user. Relevancy will be determined when the app is launched. Our goal is to increase engagement per post.

Step 3: example

We’ll expect each user to follow 300 accounts and each account to make an average of 3 posts per day. We’ll have three layers of data evaluation to keep latency low when the system evaluates the 1000 posts. The first layer quickly cuts a majority of posts based on the post-popularity.

The second layer uses locational data to cut posted based on locality, this is our second quickest layer. The third layer will be the longest and will cut posts using cross-engagement data between the follower and followed.

Step 4: example

We’ll use the feedforward neural network algorithm to predict relevancy. This algorithm works well with our creator/user interactions signal because it forms predictions off of non-circular connection webs.

Step 5: example

Our relevancy-based feed will increase user engagement by 0.5%. We’ll first use offline models programmed to simulate users and see what types of posts come through to the feed.

Once we move online, we’ll track posts with the keyword “update” and “relevance” to determine effectiveness.

svg viewer
Anatomy of a machine learning system design interview questionEducative: Interactive Courses for Software Developers
Logo