Understanding and resolving High Cache Utilization scenarios in your Helios Workspace Group – SingleStore Support

{question}

How can I resolve high persistent cache utilization on the Helios Workspace Group?

{question}

{answer}

Utilization dashboards can be found in the SingleStore Portal under Monitoring -> Historical Monitoring. Here, Cache monitoring is used to analyze the disk usage across all nodes in your Deployments.

Click on Open Cache Monitoring, and the Persistent Cache Monitoring dashboard will appear.

The primary filters for this dashboard are located at the top.

The workspace filters specific workspaces. The components dropdown filters based on the node type (MA, CA, or the leaf nodes). The filters in the top right corner allow you to set the time range and the dashboard's auto-refresh rate.

Read the Cache monitoring chart definitions to understand what each graph indicates.

High Cache Utilization

During high cache utilization, the Distribution of Components using Cache and the Breakdown of Cache Utilization by Data graphs are helpful for understanding which SingleStore component is consuming the most disk space.

SingleStore stores your actual data on the leaf nodes during periods of high cache utilization or LOW_DISK_SPACE events. It's important to filter for the correct workspace and the appropriate node type component.

The Persistent Cache Utilization graph shows the percentage of disk space used. It is advisable to filter for the specific workspace you wish to examine and select DML-Processing (Leaves) under components.

The Distribution of Components using Cache chart shows the cache utilization by data, plancache, auditlogs, and tracelogs.

In the Helios Cloud platform, auditlogs and tracelogs generally occupy only a fraction of the total space, as these are frequently rotated and their data retention is controlled by auditlog_retention_period and tracelog_retention_period, respectively.

The main components to examine are your data and plancache. In the example above, the plancache is larger than the data for a specific deployment.

Understanding Plancache

When a query is first run, it activates code generation, gets optimized, and is translated into lower-level machine language. During the first run of the query, the code generation will be recorded in the logs as plans. This lower-level version of the query or the query plans is stored on disk in the plancache for later use.

Managing on-disk Plancache

A plan that has not been read from disk within the specified disk_plan_expiration_minutes limit will expire from the on-disk plancache and be removed from disk.
Dropping plans is safe, but it incurs a cost of slower query performance on the next execution, as the plans need to be generated again for the query.

To delete all plans from the on-disk plancache without waiting for garbage collection, set disk_plan_expiration_minutes to 0. Setting this engine variable to 0 will mark the plans in the plancache as invalid and lead to their subsequent deletion by the garbage collector.

SHOW VARIABLES LIKE 'disk_plan_expiration_minutes'; /* note this value */

Set the plan expiration minutes engine variables to 0 and invoke SLEEP to allow the garbage collector time to delete the plans

SET GLOBAL disk_plan_expiration_minutes=0;
SELECT SLEEP(300);

Reset the disk plan expiration minutes engine variable back to the previous value.

SET GLOBAL disk_plan_expiration_minutes=<oldvalue>;

Understanding Disk Usage by Data Component

If disk utilization is higher in the data component, it is important to examine the Breakdown of Cache Utilization by Data chart. As shown in the example below, data occupies most of the disk space.

The major usage in this scenario is by the blobs, which represent the actual data stored in SingleStore. If the /blobs and /snapshots are consuming the most storage, then you need to either clean up the databases (drop unneeded tables or delete older data) or scale up your workspace. You can only scale your cache as outlined in the Cache Configuration scaling document.

Note on Disk Utilization Under data/others

In some scenarios, disk utilization may increase due to data stored under the data/others component. This component includes everything other than blobs, snapshots, and transaction logs.

If queries are spilling to disk and a rise in disk usage is specifically observed, it typically indicates that this increase is due to query spilling activity.

NOTE for Aggregators: If you see high cache utilization under data/others in your Master Aggregator, this is likely due to the temporary ingest files files stored when pipelines are running. This temp files are removed once data is loaded on to the destination table. Running high number of concurrent pipelines can lead to this scenario and you can try limiting the pipeline concurrency by toggling pipelines_max_concurrent variable to a low value or use a staggered ingestion approach to allow the pipeline batches to complete and alleviate the disk pressure on the MA.

Transaction logs

Transaction logs contain records of all the changes made to the database before being committed to a snapshot. Each partition will have its own set of transaction logs. Depending on the number of partitions per database and the number of databases, the transaction logs can build up to very large values. The maximum size of a transaction logs is set by snapshot_trigger_size (by default 2GB), so for example, if you have 8 partitions per db, the transaction logs can build up to 8 * 2 =16GB, with High Availability this will double to 32GB, this will again double if snapshots_to_keep variable is set to the default value of 2 = 64GB. If you have 10 databases, your transaction logs can build up to 640GB alone.

If the major contributor to disk utilization is the /transaction logs, then it is recommended to run `SNAPSHOT DATABASE <database_name>;` 2-3 times for all large databases. Running Snapshots flushes the transaction logs to a snapshot, which reduces disk usage.

NOTE: Another likely reason for build up of transaction logs is long running queries that are blocking the flushing of transaction logs to a snapshot. If running snapshot database does not reduce the cache usage by transaction logs, check if you have any long running queries and either kill or let the queries complete before running snapshot database again.

Additionally, depending on the number of databases and partitions as highlighted above, you might actually need to scale up your workspace cache configuration to tackle the buildup of transaction logs.

{answer}

Articles in this section

High Cache Utilization

Understanding Plancache

Managing on-disk Plancache

Understanding Disk Usage by Data Component

Related articles