{question}
How do I troubleshoot `Leaf Error (127.0.0.1:3307): Unknown database 'testdb_2'` error?
{question}
{answer}
There are a couple of scenarios where you might see this error. One way occurs when you have queried a DR secondary cluster in the middle of reprovisioning some of its partitions from the primary cluster. Another scenario where this might happen is if you query the database while the reference database on a leaf node is in the middle of reprovisioning which can happen in any cluster environment, with or without DR.
SingleStore will automatically take the steps necessary to ensure the secondary cluster is consistent with the primary cluster. Additionally, SingleStore will automatically take the steps necessary to ensure the reference databases on each of the nodes in the cluster are consistent with the master aggregator's reference database.
For example, suppose a leaf node in a primary cluster with redundancy 2 and a replica partition on the secondary cluster gets ahead of a replica partition on the primary cluster (due to network or other irregularity). In that case, SingleStore will automatically drop and reprovision the replica partition on the secondary cluster to be consistent with the recently promoted master partition on the primary cluster.
When a query is executed while the partition is removed, preparing to be reprovisioned, you will see "Unknown database" errors as the partitions did not yet exist as the query was executed.
Troubleshooting Guide
Identify if Reprovisioning has occurred
- Consult the information_schema.MV_EVENTS table for reprovisioning events.
- Consider the following Error:
Leaf Error (127.0.0.1:3307): Unknown database 'testdb_2'
- Query the information_schema.mv_events table to confirm if there were any
DATABASE_REPROVISION
events for partition mentioned in the above error message. For example:+----------------+----------------------+----------+----------------------------+--------------------------------------------------------------+
| ORIGIN_NODE_ID | EVENT_TIME | SEVERITY | EVENT_TYPE | DETAILS |
+----------------+----------------------+----------+----------------------------+--------------------------------------------------------------+
| 6 | 2022-02-20T17:48:19Z | NOTICE | DATABASE_REPROVISION | {"database":"testdb_2"} |
+----------------+----------------------+----------+----------------------------+--------------------------------------------------------------+ - Additionally, you will find a trace referencing the removal of files in preparation for reprovisioning in the memsql.log file of the corresponding leaf node of the reprovisioned partition.
For example:
3868792805232 2022-04-06 13:27:38.737 INFO: Thread 114618: ProcessContentDescPacket: `testdb_2` log: Removing all files to prepare for reprovisioning.
- Consider the following Error:
Remediation Steps
Step 1) Troubleshoot any long-running queries
- Check the processlist for active long-running queries.
- Long-running queries can result in the cluster having difficulty keeping up with replication, thus resulting in a reprovision event.
- If there are no actively long-running queries, consider looking at the historical run time of your queries and check if there are any instances of unusually long-running queries.
- Consult the following documentation to troubleshoot any troublesome long-running queries:
Step 2) Increase snapshots_to_keep
- If you continue hitting reprovision issues in a DR cluster, consider increasing snapshots_to_keep (by default this is set to 2) on the primary cluster.
- Increasing the value for
snapshots_to_keep
on the primary will give the DR cluster more time to replicate before it would need to start from scratch and reprovision as explained above. - Please note: you should confirm you have enough disk space when increasing
snapshots_to_keep
as more snapshots will be saved to disk.
- Increasing the value for
Step 3) Confirm there are no hardware or network-related issues in the cluster.
{answer}