Rate Limiter

Aman Jain

9 min readApr 30, 2021

Requirements and Goals

To limit the number of requests per client in a particular time window. For example, a client is only allowed 100 requests per minute.
Rate limit should be considered across different servers of a service. The user should get an error message whenever the defined threshold is crossed across a cluster of servers.
Explore different types of algorithms used in rate limiting.

Why ?

To prevent DDOS attacks
To eliminate spikiness in traffic from one client. For example, a client may make requests in abundance in a particular time period, leading to starvation of other clients.

High Level Design (HLD)

Bring the rate limiting configurations of all clients like number of requests allowed per time interval stored in database during app initialisation. Rate Limiter Middleware will be responsible for deciding which request will be served by the API and which request will be declined. Once a new request arrives, the Rate Limiter decides if it will be served or throttled. If the request is not throttled, then it’ll be passed to the API to process the request.

Different Types of Algorithms

Token Bucket Algorithm

In token bucket, for each client, we would record last request’s Unix timestamp and available token count within a hash in Redis. Refill rate of tokens is constant here.

Please see the algorithm in action below:

2. Leaky Bucket Algorithm

Suppose we have a bucket in which we are pouring water but we have to get water at a fixed rate, for this we will make a hole at the bottom of the bucket. It will ensure that water coming out is at some fixed rate, and also if bucket is full, we will stop pouring in it. The input rate can vary, but the output rate remains constant.

Leaky Bucket uses a bucket or queue to hold the incoming requests. Whenever a new request arrives, it is appended to the rear of the queue, until the queue is not full.

The requests are processed at fixed time intervals in the first come first serve (FCFS) manner, i.e. old requests are the one to be executed first. If the queue is full, the remaining are dropped or leaked with a proper message or notification to the client.

Algorithm

Define useful variables as follows:

// The number of requests left in the bucket 
var numOfRequestLeftInTheBucket int // The timestamp of the last successful request insertion 
var lastInsertedTime int64 // The bucket capacity 
var capacity int// The time required for the bucket to be drained
var duration int64// The request leakage rate of the bucket, which is equal to capacity/duration
var rate float64For Example: The requests can drain for 1min at the rate of 4 req/second i.e  Every 250 ms, one request leaves from the bucket

Core Logic for Leaky Bucket.

Please note that, counters will be maintained inside REDIS. For the sake of simplicity, redis is not used in the below code.

func isAllowRequest(String key) bool {
    currentTime := time.Now().Unix();
    /*    
     Total requests left in the bucket = 
     Previously left requests in the bucket – Requests leaked during the past period of time.
    */
    /*
     Requests leaked during the last period of time = 
     (Current time – Last request insertions time) * Request leakage rate
    */
     /*    
     If the current time is too far from the last request
     insertion time (no request has been inserted for a long time), the request left in the bucket is 0 (the bucket is drained)
     */
     
    numOfRequestLeftInTheBucket = math.max(0, numOfRequestLeftInTheBucket - int64(((currTime - lastInsertionTime) * rate))
    if (numOfRequestLeftInTheBucket < capacity) {
        lastInsertionTime = currTime;
        numOfRequestLeftInTheBucket++;
        return true;
    }
    
    return false;
}

3. GCRA(Genetic Cell Rate Algorithm)

Rate limits are defined by a limit, L, i.e. the number of requests, and a time period, P, such that a rate of only L/P is possible without the requests being limited (blocked).

CASE: When L = 1

To understand this approach, let the rate of requests be R = L/P and then for simplicity set L = 1 so that the rate is R = 1/P. With this target rate, we can expect that the requests are separated by at least PPut simply, if the arrival time of the current request t is within P of the arrival time of the previous request s, or t — s < P, then the request must be limited.

By defining a Theoretical Arrival Time, TAT, of the next request to be equal to the arrival time of the current request s plus P — i.e. TAT = s + P — we find that if t < TAT the request should be limited. Hence we can just calculate and store TAT on every request.

CASE: When L > 1

When L > 1, we need to reconsider what the Theoretical Arrival Time of the next request should be. In the L = 1 case it was always the current request time s plus P. However when L > 1 it would be possible to have L requests within P, therefore each request is separated by P/L. In addition, it is now possible for requests to bunch, i.e. for the arrival time s to be less than the expected TAT. When this happens the next request’s Theoretical Arrival Time is TAT’ = TAT + P/L. However when requests don’t bunch and s is greater than TAT, TAT’ = s + P/L hence TAT’ = max(TAT, s) + P/L

Now that we can calculate and store the TAT for the next request, we just need to decide when to limit. From the above it is clear that for each request that arrives before its TAT, the stored TAT increases by the interval P/L. Clearly if the new TAT’ exceeds the current time plus the period the request should be limited, i.e. if TAT’ — t > P or TAT — t > P — P/L. When L = 1 this reduces to TAT — t > 0 as previously stated. Note that the TAT should not be updated for limited requests, as they don’t have a theoretical arrival time.

Please have a look at the following python code to get the idea of the implementation

class RateLimit:
    def __init__(self, count: int, period: timedelta) -> None:
        self.count = count
        self.period = period@property
    def inverse(self) -> float:
        return self.period.total_seconds() / self.countclass Store:
    def get_tat(self, key: str) -> datetime:
        # This should return a previous tat for the key or the current time.
        pass
        
    def set_tat(self, key: str, tat: datetime) -> None:
        # Sets the tat for the next request
        passdef is_limit(self, key: str, limit: RateLimit) -> bool:
        now = datetime.utcnow()
        tat = max(self.get_tat(key), now)
        separation = (tat - now).total_seconds() 
        max_interval = limit.period.total_seconds() - limit.inverse
        if separation > max_interval:
            reject = True
        else:
            reject = False
            new_tat = max(tat, now) + timedelta(seconds=limit.inverse)
            self.set_tat(key, new_tat)
        return reject

4. Fixed Window Algorithm

In Fixed window rate limiting algorithm, the timeline is divided into a fixed window(say 1min or 1 hour etc.) and each window is provided with a counter(to count a number of requests in a particular window). If the value of the counter exceeds the limit, the remaining requests are dropped.

The counter resets after every window.

Suppose we have a rate limit of 10 requests/hour and have a data model like below.

In the above example, if a new request arrives at 12:40, we get the count from the bucket(12:00–1:00) which is 7, and if less than our threshold of 10 req/hour, hence this request will be processed and count of the current window will become 8.

Now assume a case for window (1:00–2:00), a request arrives at 1:40 and the value of the counter in this window is 9, which is less than permissible limit(10), so this request will be accepted and the value of the counter will become 10. Now no more requests in the window (1:00–2:00) will be accepted.

5. Sliding Logs Algorithm

In sliding logs, we maintain a sliding window by keeping track of each request per client. We can store the timestamp of each request in a Redis sorted set.

Let’s assume our rate limiter is allowing 3 requests per minute per client, so, whenever a new request comes in, the Rate Limiter will perform following steps:

Remove all the timestamps from the Sorted Set that are older than “CurrentTime — 1 minute”.
Count the total number of elements in the sorted set. Reject the request if this count is greater than our throttling limit of “3”.
Insert the current time in the sorted set and accept the request.

6. Sliding Window with Counter Algorithm

We can maintain a sliding window if we can keep track of each request per user using multiple fixed time windows, , e.g., 1/60th the size of our rate limit’s time window. For example, if we have a minutely rate limit, we can keep a count for each second and calculate the sum of all counters in the past one minute when we receive a new request to calculate the throttling limit.

Let’s take an example where we rate-limit at 100 requests per minute with an additional limit of 5 requests per second(optional). This means that when the sum of the counters with timestamps in the past one minute exceeds the request threshold (100), the client has exceeded the rate limit. In addition to that, the client can’t send more than 5 requests per seconds(optional).

We can store our counters in a Redis hash. When each request increments a counter in the hash, it also sets the hash to expire a minute later.

So, whenever a new request comes in, the Rate Limiter middleware will perform following steps

Remove all the timestamps from the sorted set that are older than “CurrentTime — 1 minute”.
Insert the current time in the sorted set, set expiry of the current timestamp and accept the request.
Calculate the sum of the counters with timestamps in the set in the past 1 minute. Reject the request if this count is greater than our throttling limit of “100”.

Please see the algorithm in action in the following diagram:

Problems with Sliding window with Count

The only problem with this algorithm is there’ll be some inconsistency during race conditions. For example, let’s say that two requests come at the same time on the load balancer. Those two requests are redirected to different servers in the cluster. Let’s assume that we’ve allowed 10 requests per minute for a client and we’ve already allowed 9 requests in the past 1 minute. Now, at the 55th second of the minute, these two requests come, they’ll see that total requests in the past one minute are 9 and therefore, both the requests will be allowed.

One possible solution is to use distributed redis lock such that, if two requests come at the same time, only one can be allowed to read the data, the other request has to wait till the first request updates the sum variable in redis. But this will become a performance bottleneck since latency will take a hit.

The other solution would be to have relaxed rate limiting. For example, if we have a limit of 100 requests per minute and let’s assume we have served 99 requests in the past 50 seconds and now we get 5 requests in the 55th second of the minute. So, we can simply allow these 5 requests to have a total of 104 requests per minute.

Comparison Between Different Algorithms