How GOP Handle Traffic Spikes

A big traffic spike can cause your website to fail. An overwhelming number of requests will crash your site if it isn’t equipped to take the pressure. This post introduced the definition and impact of traffic spike and how GOP, as an open platform to serve all Garena products, handled traffic spikes properly to make our service 24-7 available

A big traffic spike can cause your website to fail. An overwhelming number of requests will crash your site if it isn’t equipped to take the pressure. This post introduced the definition and impact of traffic spike and how GOP, as an open platform to serve all Garena products, handled traffic spikes properly to make our service 24-7 available

What is Traffic Spike?

The term Spike refers to a sudden and dramatic increase of incoming requests. It’s often understood to be a short-term event, taking place over a few minutes or hours, although spikes can continue over longer periods.

The Potential Impact of a Spike

Image that there is a big event from Game, and millions of users come to participate. Unfortunately, your service isn’t ready for this. Your database starts to topple under the load. With degraded performance, users start to reload or retry from the browser or game client, expecting to get a request through — which adds even more load to your server.
Then you will be in trouble. Believe me, unless the coming requests stopped, your service will eventually down due to the database can not response as expected.

Get Ready for Traffic Spike

The first thing is to check our server capacity and network bandwidth and better deploy your service in docker container so that you will easily scale your service up and down according to different traffic volume.
It is always good to have more powerful servers, however, as engineers, we should focus more on the application layer and optimise as much as possible.
To handle with Spike, the key factor is to protect your Database, 80% of performance issue comes from database from past experience. There are some suggestions:

  1. Design your database with proper index
  2. Separate write and read, get data from Slave and only update goes to Master node.
  3. Use cache in front of Database layer for read purpose. 80% of the data should be returned from cache.
  4. Keep an eye on your table size, shard your tables when necessary.
  5. Implement rate limit mechanism. Block requests to database in emergency to ensure database is still able to response.

How GOP prepared for Spike

However, sometimes bad things happen. As a platform serving a lot of Games, applications and websites, GOP have to do more on performance and robustness to ensure our service is available all the time no matter what happened from Games.
From our monitoring system, during the peaking hour of Free Fire, the QPS of login requests is more than 100K, which will put a lot of pressure on our Master Database since we need to update the user login grant time of each login user. Consider that there might be network flapping or unexpected bug from Game that cause login Spike, which makes it more challenging.
You may realise that all the 5 suggestions I mentioned in last paragraph can’t help when there are large amount of Database write operations. Since GOP login API will update user’s login grant time, this is definitely a bottleneck when the traffic is busy.
After some analytics, we found that the user login grant time is not used at all during user’s login process. The better way is the queue the requests somewhere and update it asynchronously, so we can easily control the speed of Database write and never worry about the Spike anymore.

Kafka as Message Queue

When talking about message queue, there are a few options, it is a big topic and I will not cover it today.
We decided to use Kafka as Message Queue since we have some experience using it. You may want to explore more on Kafka here if you want to. https://kafka.apache.org/intro

After setup Kafka, during user login, we will put the update request data into Kafka.

def update_grant_time(platform, uid, app_id):  
 	  
  now = get_timestamp()  
  grant_data = {  
	  'platform': platform,  
	  'uid': uid,  
	  'app_id': app_id,  
	  'now': now  
  }  
  offset = _producer.produce(json.dumps(grant_data))  
  log.data("enqueue_update_grant_time|request_data=%s", grant_data)  
  return True

Dedicate Consumer to Update Database

What we do next is consume the update requests that produced by login API and update the user login grant time. Considered that we may want to convert other similar data update request into async way, we have written a dedicate new service to consume the data.
Here, LoginConsumer will run by multiple workers, worker count is controlled via config file, we can easily scale it up and down.

class LoginConsumer(KafkaBaseConsumer):  
  
  def handle_task(self, offset, record):  
	  refresh_db_connections()  
	  platform = record['platform']  
	  uid = record['uid']  
	  app_id = record['app_id']  
	  last_use_time = record['now']  
	  grant_manager.update_grant_use_time(platform, uid, app_id, last_use_time)  
	  log.data('update_grant_use_time|offset=%d,record=%s,worker_id=%s', offset, record, self.worker_id)

In LoginConsumer.handle_task, the user login grant time is updated.

Summary

To make a conclusion, by putting user login grant time request into Kafka queue and consume separately, we can get benefit from:

  1. Decouple user login grant time from login API, make the code logic cleaner.
  2. Since Kafka is doing well in high concurrency, the login API never experienced issues from Traffic Spike.
  3. Consume update requests by dedicated service, so that we can control the update speed by our needs.
  4. The service can be easily reused to support similar async data update scenario.