{question}
My Singlestore cluster is using a lot of storage.
How can I see what's consuming it and what can I do about it?
{question}
{answer}
Overview
It is common that the database(s) size to be only a fraction of the overall storage consumption in Singlestore. This results from various factors, mainly transaction logs, snapshots, tracelogs, plancache, and audit logs. In addition to these, sometimes a core file can be generated too. This will be covered at the end of this article.
The default structure is similar to the drawing below. Some folders are missing for the sake of simplicity:
By default, these will sit under /var/lib/memsql/<NODE_ID>/ although these may be different, depending on how the cluster was deployed.
To identify what's consuming the most space, you can use the Linux command du. This command can traverse the full file data structure and provide a summary of the disk usage of each directory. For example, this summarizes the disk usage inside the data directory. We can see that data/logs, in this example, it's what's consuming the most disk (6.4 GB).
ubuntu@ip-10-0-0-100:/var/lib/memsql/2ce6f569-6e1d-491c-b380-f8954189fc14/data$ du -sh *
92K blobs
4.0K datetime_to_version
4.0K ephemeral_errors
4.0K ingest_staging
6.4G logs
0 memsql.sock
4.0K memsql_id
0 memsql_proxy.sock
4.0K memsql_role
4.0K memsqld.pid
4.0K memsqld_safe.pid
2.5M persisted_errors
1.3M roots
1008K snapshots
4.0K spill
4.0K staging_bottomless
36K tempblobs
4.0K temprestoreblobs
Let's go one by one, describe it, and understand what you can and can't do to free up space.
Database Data
All directories and files not listed below are, most probably, database files containing user data. The main two are snapshots and blobs. If those are the ones taking the most storage, then the only solution is to cleanup your database, dropping unneeded tables or deleting older data.
Transaction Logs
What is it?
Transaction logs are a key piece of any transactional database. These files contain records of all the changes made to the database before being committed to a snapshot. Each partition will have its own set of transaction logs. These files are located in data/logs and have the following format:
When the size of the logs reaches the value defined in the variable snapshot_trigger_size, a snapshot is taken and written to disk. A snapshot is a full backup of the database. Following the creation of a snapshot, subsequent DDL and DML updates are again written to the logs, until snapshot_trigger_size is again reached.
How large can the transaction logs grow?
A lot! Each transaction log is 256 MB for data partitions and 64 MB for master partitions. Also, the default snapshot_trigger_size is 2 GB. If a node is running with 8 partitions this means that the transaction logs can use up to 2 GB x 8 partitions = 16 GB. If the cluster has High Availability enabled, this will double, and if the default setting snapshots_to_keep stays at the recommended 2, then it will double again. If instead of 1 database, the cluster has 10 databases, then multiply again by 10. So this means that in a node with H/A, 10 databases, and 8 partitions per each database, we could easily be talking of up 640 GB just for snapshots.
Can I delete the transaction logs?
NO! This is the same as deleting your data! NEVER manually delete transaction logs. Transaction logs can only be deleted automatically in the following situations:
1. They reach the value set in snapshot_trigger_size and an Automatic SNAPSHOT is performed
2. A BACKUP command is issued. This causes also a SNAPSHOT
3. Some node maintenance operations will cause a SNAPSHOT as well, such as a cluster upgrade
4. You manually run SNAPSHOT DATABASE
What can I do to deal with the logs disk usage growth?
If transactions logs are growing with the expected behavior, then there are two answers being that the easiest one is to expand the disk size to accommodate the potential growth. The second option is to issue a SNAPSHOT command into the database whose logs are growing. You can check that by looking into the log filenames. The Linux du command can quickly tell you whose database logs are growing the most. For example:
$ du -smc metrics_*
This will result in a list of log files for that specific node expressed in MB with a sum in the bottom for a database named metrics.
Once you issue the SNAPSHOT command, logs up to the current log sequence number (LSN) will be imaged into a snapshot and we will clear out transaction logs that we can (we delete log files that are fully below the oldest snapshot, but we will "cache" up to four of them to minimize IO from reallocation).
- It takes the value you have set for
snapshots_to_keep
number of snapshots since a blob was deleted for it to be physically removed from the disk. - Ideally you should have enough disk space to maintain transaction logs/snapshots at their most saturated point (having
snapshots_to_keep
number of snapshots, and transaction logs full to just before hitting the size ofsnapshots_trigger_size
)
Log preallocation
Transaction logs are preallocated upon creation of the database. There will be two 256MB transaction logs created for each partition in a database upon creation of that database. As more transactions are run these logs will fill up and then new logs will be generated until snapshot_trigger_size is hit and a new snapshot is taken. GC will then cleanup all transaction logs that are not these initial pre-allocated logs, and those logs will be renamed to reflect the current LSN.
Plancache
What is it?
Compiled query plans are stored in a plancache for later use. When a plan expires, it remains in the disk plancache and loaded back into memory the next time the query is run. And when a node restarts, the in-memory plancache starts off empty, and plans are loaded back in from the on-disk plancache as queries are run. This means query plans do not recompile after a plan expires from the in-memory plancache or from a node restart.
The more unique queries a cluster receives, the more plancache files generate.
By default, there's a garbage collector (GC) that runs every 12 hours that checks if plancache files on disk are older than what's defined in the disk_plan_expiration_minutes variable. As long as these cache files are not being used, then the GC will delete them.
How large can the plan cache files grow?
Sometimes it happens that there are more plancache files being generated than what can be deleted. This leads to a gradual growth of the data/plancache directory to a point that it can consume several hundred GB per node.
Can I delete the plan cache files?
Yes and no. You can manually delete them but, to avoid any potential issues, you should stop the node before you delete the files. The procedure to delete these files is:
1. Stop the node
2. Delete the files in plancache
3. Start the node
4. Repeat for the remaining nodes
As this involves stopping a node, it should be done only in clusters with High Availability or during a maintenance window otherwise.
What can I do to deal with the plan cache disk usage growth?
The variable disk_plan_expiration_minutes it's the key here. By default is set to expire plan cache files every 14 days (20160 minutes). If you see that these files are taking too much space, you may want to consider lowering the value. If it's an emergency and you're close to run out of disk, you can temporarily set it to 0. This means that the GC will always run and it will delete any plan cache files that are not loaded into memory. Make sure also that enable_disk_plan_expiration is set to ON, otherwise, it will not even try to run! For more details refer to this article.
Tracelogs
What is it?
The tracelog is where the engine writes all its diagnostic messages, being those messages either INFO, WARNING, ERROR, or FAILURE. It is located in the tracelogs directory with the file name memsql.log
. When the server is started, it opens this file in append mode and begins to log messages.
How large can the tracelog file grow?
If you don't implement a log rotation policy, this file can keep growing until the limits of the disk or the O/S. So, it is important to regularly rotate the file
You can rotate the log by moving the memsql.log
file and then sending SIGHUP
to the memsqld process. This will trigger the server to reopen memsql.log
and continue writing.
Can I delete the tracelog files?
Yes, you can delete the file. But make sure you send a kill -SIGHUP
to the memsqld process. Also, keep in mind that these files can be very useful to understand the cause of specific issues so it's a good idea to keep a copy of it somewhere, even if it's outside of the cluster.
What can I do to deal with tracelog disk usage growth?
The best practice is to implement a log rotation policy. This policy can rotate based on the timestamp (e.g. every 7 days) or based on size (e.g. every 100MB). Given that these are text files, you may want to compress them after rotation which will achieve a very high compression ratio, freeing up disk space. To learn more about how to use logrotate visit this page.
Other
You may enable certain features that will create also big files. For example, both AUDIT LOGS and QUERY LOGS can generate large files over time. To deal will that, use the same recommendations as for Tracelogs and implement a log rotation policy with compression.
Core Files
What is it?
Core files are an important debugging resource. They contain a copy of what was loaded in memory, process, threads, calls, etc, that capture the status of a system at a given time. There are two situations that can lead to the generation of a core file:
1. Error/Bug: when there's some unhandled exception in the code or a low-level illegal operation is being attempted, such as writing to an invalid memory address.
2. On-demand: sometimes, to diagnose an issue, a support engineer can ask you to generate a core dump.
If the core is generated automatically, it can happen that multiple core dump files accumulate over time. These can be as large as the memory in use by the process that generated it. If we ask you to generate a core dump, it's normal also that we'll say you can delete it, once it's uploaded for us to analyze.
Usually, core dumps will have the format core.<pid> where <pid> it's the process number that caused the core dump. Your system might have core dumps generated by the memsqld process which will normally be in the node's data directory or core files from other processes.
You can scan your filesystem to locate core files with the command below:
find / -name "core.*" -type f 2>&- | grep -E "core\.[[:digit:]]+$"
Can I delete the core files?
Unless it's a recent file, usually it's fine to delete these files. Recent files might be important to diagnose a specific issue. In particular, if those files are recent and are in a Singlestore node folder, you should reach out to Singlestore Support for advice.
{answer}