Requirements and Goals

  • To limit the number of requests per client in a particular time window. For example, a client is only allowed 100 requests per minute.
  • Rate limit should be considered across different servers of a service. The user should get an error message whenever the defined threshold is crossed across a cluster of servers.
  • Explore different types of algorithms used in rate limiting.

Why ?

  • To prevent DDOS attacks
  • To eliminate spikiness in traffic from one client. For example, a client may make requests in abundance in a particular time period, leading to starvation of other clients.

High Level Design (HLD)

Bring the rate limiting configurations of all clients like number of requests allowed per time interval stored in database during app initialisation. Rate Limiter Middleware will be responsible for deciding which request will be served by the API and which request will be declined. Once a new request arrives, the Rate Limiter decides if it will be served or throttled. If the request is not throttled, then it’ll be passed to the API to process the request.

Different Types of Algorithms

  1. Token Bucket Algorithm

In token bucket, for each client, we would record last request’s Unix timestamp and available token count within a hash in Redis. Refill rate of tokens is constant here.

Please see the algorithm in action below:

2. Leaky Bucket Algorithm

Suppose we have a bucket in which we are pouring water but we have to get water at a fixed rate, for this we will make a hole at the bottom of the bucket. It will ensure that water coming out is at some fixed rate, and also if bucket is full, we will stop pouring in it. The input rate can vary, but the output rate remains constant.

Leaky Bucket uses a bucket or queue to hold the incoming requests. Whenever a new request arrives, it is appended to the rear of the queue, until the queue is not full.

The requests are processed at fixed time intervals in the first come first serve (FCFS) manner, i.e. old requests are the one to be executed first. If the queue is full, the remaining are dropped or leaked with a proper message or notification to the client.


Define useful variables as follows:

Core Logic for Leaky Bucket.

Please note that, counters will be maintained inside REDIS. For the sake of simplicity, redis is not used in the below code.

3. GCRA(Genetic Cell Rate Algorithm)

Rate limits are defined by a limit, L, i.e. the number of requests, and a time period, P, such that a rate of only L/P is possible without the requests being limited (blocked).

CASE: When L = 1

To understand this approach, let the rate of requests be R = L/P and then for simplicity set L = 1 so that the rate is R = 1/P. With this target rate, we can expect that the requests are separated by at least PPut simply, if the arrival time of the current request t is within P of the arrival time of the previous request s, or t — s < P, then the request must be limited.

By defining a Theoretical Arrival Time, TAT, of the next request to be equal to the arrival time of the current request s plus P — i.e. TAT = s + P — we find that if t < TAT the request should be limited. Hence we can just calculate and store TAT on every request.

CASE: When L > 1

When L > 1, we need to reconsider what the Theoretical Arrival Time of the next request should be. In the L = 1 case it was always the current request time s plus P. However when L > 1 it would be possible to have L requests within P, therefore each request is separated by P/L. In addition, it is now possible for requests to bunch, i.e. for the arrival time s to be less than the expected TAT. When this happens the next request’s Theoretical Arrival Time is TAT’ = TAT + P/L. However when requests don’t bunch and s is greater than TAT, TAT’ = s + P/L hence TAT’ = max(TAT, s) + P/L

Now that we can calculate and store the TAT for the next request, we just need to decide when to limit. From the above it is clear that for each request that arrives before its TAT, the stored TAT increases by the interval P/L. Clearly if the new TAT’ exceeds the current time plus the period the request should be limited, i.e. if TAT’ — t > P or TAT — t > P — P/L. When L = 1 this reduces to TAT — t > 0 as previously stated. Note that the TAT should not be updated for limited requests, as they don’t have a theoretical arrival time.

Please have a look at the following python code to get the idea of the implementation

4. Fixed Window Algorithm

In Fixed window rate limiting algorithm, the timeline is divided into a fixed window(say 1min or 1 hour etc.) and each window is provided with a counter(to count a number of requests in a particular window). If the value of the counter exceeds the limit, the remaining requests are dropped.

The counter resets after every window.

Suppose we have a rate limit of 10 requests/hour and have a data model like below.

In the above example, if a new request arrives at 12:40, we get the count from the bucket(12:00–1:00) which is 7, and if less than our threshold of 10 req/hour, hence this request will be processed and count of the current window will become 8.

Now assume a case for window (1:00–2:00), a request arrives at 1:40 and the value of the counter in this window is 9, which is less than permissible limit(10), so this request will be accepted and the value of the counter will become 10. Now no more requests in the window (1:00–2:00) will be accepted.

5. Sliding Logs Algorithm

In sliding logs, we maintain a sliding window by keeping track of each request per client. We can store the timestamp of each request in a Redis sorted set.

Let’s assume our rate limiter is allowing 3 requests per minute per client, so, whenever a new request comes in, the Rate Limiter will perform following steps:

  • Remove all the timestamps from the Sorted Set that are older than “CurrentTime — 1 minute”.
  • Count the total number of elements in the sorted set. Reject the request if this count is greater than our throttling limit of “3”.
  • Insert the current time in the sorted set and accept the request.

6. Sliding Window with Counter Algorithm

We can maintain a sliding window if we can keep track of each request per user using multiple fixed time windows, , e.g., 1/60th the size of our rate limit’s time window. For example, if we have a minutely rate limit, we can keep a count for each second and calculate the sum of all counters in the past one minute when we receive a new request to calculate the throttling limit.

Let’s take an example where we rate-limit at 100 requests per minute with an additional limit of 5 requests per second(optional). This means that when the sum of the counters with timestamps in the past one minute exceeds the request threshold (100), the client has exceeded the rate limit. In addition to that, the client can’t send more than 5 requests per seconds(optional).

We can store our counters in a Redis hash. When each request increments a counter in the hash, it also sets the hash to expire a minute later.

So, whenever a new request comes in, the Rate Limiter middleware will perform following steps

  • Remove all the timestamps from the sorted set that are older than “CurrentTime — 1 minute”.
  • Insert the current time in the sorted set, set expiry of the current timestamp and accept the request.
  • Calculate the sum of the counters with timestamps in the set in the past 1 minute. Reject the request if this count is greater than our throttling limit of “100”.

Please see the algorithm in action in the following diagram:

Problems with Sliding window with Count

The only problem with this algorithm is there’ll be some inconsistency during race conditions. For example, let’s say that two requests come at the same time on the load balancer. Those two requests are redirected to different servers in the cluster. Let’s assume that we’ve allowed 10 requests per minute for a client and we’ve already allowed 9 requests in the past 1 minute. Now, at the 55th second of the minute, these two requests come, they’ll see that total requests in the past one minute are 9 and therefore, both the requests will be allowed.

One possible solution is to use distributed redis lock such that, if two requests come at the same time, only one can be allowed to read the data, the other request has to wait till the first request updates the sum variable in redis. But this will become a performance bottleneck since latency will take a hit.

The other solution would be to have relaxed rate limiting. For example, if we have a limit of 100 requests per minute and let’s assume we have served 99 requests in the past 50 seconds and now we get 5 requests in the 55th second of the minute. So, we can simply allow these 5 requests to have a total of 104 requests per minute.

Comparison Between Different Algorithms

That was all about different algorithms employed in rate limiting. Do let me know in case you have any doubts. Cheers :)

Software Engineer