How to optimize your SQL Database to handle millions of records — part 1
Data handling can be a mess, especially when we deal with a huge amount of data. Over the years, we realised the bottleneck of a project mostly is not on the application layer but the database layer instead.
I know we can fix this by :
- Upgrade our databases
- Scale up our databases to multiple read-only nodes
- Replace it with distributed databases (Vitess, Cockroach DB, TiDB, etc)
But are these necessary most of the time? Even GitHub uses only MySQL Cluster with partitioning to handle their load, you can check it out here.
In my opinion, unless your traffic is growing uncontrollable (maybe a million QPS), maybe you can consider going for the top options. Is there any cost-effective way to optimize your SQL database? Yes, the answer is “Avoid toxic queries”.
Having said that,
Prevention is always better than cure.
No matter how strong is your database, one toxic query can basically bring it to an unusable state. There have some guidelines we follow all the time to keep our database stay as healthy and perform even better than usual. And you may refer to it as well :
Cursor based pagination
Offset-based pagination is not scalable compared to cursor-based pagination, it will reduce your query performance over time. (Especially when the records keep increasing without house-keeping)
Avoid auto-increment primary key
Most of the systems today not only target a single region, but it could also be a global market. So auto-increment primary key no longer able to fulfill our needs, because it’s not recyclable and reusable. For future proof, I would suggest going for UUID or Timestamp as id.
Avoid joining table records (left join, outer join, inner join, etc)
Joining is bad, especially without a data range (record limit). As per your record increase, the join operation will take extra milliseconds. We never join our records, we push the operation to the application layer and utilize go routines.
Don’t use SQL lock
Yes, SQL lock is a good mechanism to ensure data consistency. But when we deal with high throughput scenarios, the SQL database starts to have some performance hit when we have a lot of concurrent locks. We might consider replacing it with a conditional update or redis distributed lock.
Avoid aggregation functions
SUM, AVERAGE, COUNT, or any similar function are very toxic in SQL databases, you need to have special treatment for these operations. Try to handle this with your Elasticsearch or wide column database.
Try to use SQL function only with a single record query
The query below is pretty bad when you have a million rows of records in table A, try to replace it with generated column (precompute the value), or else your SQL database will execute a million times of SUBSTR function on runtime.
SELECT SUBSTR(“ColumnA”, 1, 3) FROM “TableA”;
OK, a lot of you may ask, if we are not going to use this and skip that, why does SQL database offer these operations natively? Yes, they offered it, but you need to use it wisely based on the use cases, not because you can use it.
For more techniques, you may follow part 2.
To be continued…