Tags:concept Status:🟩


Scalability

Summary

Scalability ensures a system can handle increased loads effectively as user numbers or data volumes grow. Systems must adapt to higher demands, such as transitioning from 10,000 to 100,000 users, and manage various load parameters like requests per second or active users. For example, Twitter initially struggled with high read demands on home timelines and improved performance by adopting a caching strategy. Evaluating performance involves metrics like latency and response time, and scalability can be achieved through methods like scaling up, scaling out, or using elastic systems. Effective scalable architecture requires tailoring to specific needs, balancing stateless and stateful approaches, and adapting to evolving distributed system tools.

Details

Scalability is describing a system’s ability to cope with increased load. Just because a system is working reliably today, doesn’t mean it will necessarily work reliably in the future. A common reason for degradation is increased load - this could be a growth from 10.000 to 100.000 users. Today, we are handling significantly larger volumes of data than ever before.

Describing load

Load can be described with a few numbers we call load parameters. The best choice of parameters depends on the architecture of the system. It could be:

  • Requests per second to a web server
  • The ratio of reads to writes in a database
  • The number of simultaneously active users in a chat room
  • The hit rate on a cache

Twitter scaling example

Post tweet User can publish a new message to their followers (4.6k request/sec avg.)

Home timeline User can view tweet posted by the people they follow (300k request/sec)

Handling Handling 12000 writes per second would be fairly easy. However, Twitter’s scaling challenge is not primarily due to tweet volume, but due to fan-out (Originally an electronics term, refers to the number of inputs connected to a gate’s output. In transaction processing systems, it describes the number of external service requests needed to fulfil a single incoming request). Each user follows many people, and each user is followed by many people.

Two ways of implementing twitters operations

  1. Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them (sorted by time). In a relational database like in Figure 1-2, you could write a query such as:
  2. Maintain a cache for each user’s home timeline—like a mailbox of tweets for each recipient user (see Figure 1-3). When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is then cheap, because its result has been computed ahead of time.

Initial Twitter Approach

Approach 1: First version used this, but struggled with home timeline queries due to high load.

Switch to Approach 2

Reason for Change: The rate of published tweets is almost two orders of magnitude lower than the rate of timeline reads. Approach 2 reduces read-time load by doing more work at write-time.

Challenge: Posting a tweet now requires more work. A single tweet averages 75 followers, so 4.6k tweets per second result in 345k writes per second. Some users with over 30 million followers can generate 30 million writes, making timely delivery challenging.

Key Load Parameter

Fan-out Load: The distribution of followers per user determines scalability. Different applications can apply similar principles to manage load.

Hybrid Approach

Current Strategy: Most tweets are fanned out at post-time, but tweets from celebrities with many followers are fetched and merged during read-time, similar to approach 1. This hybrid approach offers consistent performance.

Describing performance

Latency and response time

Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service.

Average, percentiles and median

It’s common to see the average response time of a service reported.  However, the mean is not a very good metric if you want to know your “typical” response time, because it doesn’t tell you how many users actually experienced that delay.

Usually it is better to use percentiles. If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point. Example if the median is 200ms, that means half of the requests return in less than 200ms  and halv of the requests take longer.

Outliers

In order to figure out how bad your outliers are, you can look at higher percentiles: The 95th, 99th, and 99.9th percentiles are common

Approaches for coping with load

Scaling up: Moving to a more powerful machine. Scaling out: Distributing the load across multiple smaller machines. (shared-nothing architecture)

Single Machine vs. Scaling Out

Running on a single machine is simpler, but high-end machines can be expensive. Intensive workloads often require scaling out to multiple machines. A mix of both approaches is common—using a few powerful machines can be cheaper and simpler than many small virtual machines.

Elastic vs. Manual Scaling

Elastic systems automatically add resources with load increases; manual systems rely on human intervention. Elastic scaling is useful for unpredictable loads; manual scaling is simpler with fewer surprises.

Stateless vs. Stateful Systems

Distributing stateless services is straightforward. Distributing stateful data systems adds complexity; traditionally, databases are scaled up on a single node until distributed scaling is necessary.

As tools for distributed systems improve, the default may shift toward distributed data systems, even for smaller workloads. The book will explore distributed data systems, focusing on scalability, ease of use, and Maintainability.

Specificity of Scalable Architectures

Large-scale system architectures are highly specific to the application, with no generic solution (no “magic scaling sauce”). Factors like read/write volume, data complexity, response time, and access patterns influence architecture design.

Example and Flexibility

A system for 100,000 small requests per second differs from one handling 3 large requests per minute, despite similar data throughput. Scaling architecture relies on assumptions about common operations; incorrect assumptions can waste resources. In early-stage products, quick iteration on features is often more important than scaling for future load.

General-Purpose Building Blocks

Scalable architectures are usually built from general-purpose components arranged in familiar patterns. The book discusses these building blocks and patterns.