Leaf Error (127.0.0.1:3307): Unknown database 'testdb_2' – SingleStore Support

{question}

How do I troubleshoot `Leaf Error (127.0.0.1:3307): Unknown database 'testdb_2'` error?

{question}

{answer}

There are a couple of scenarios where you might see this error. One way occurs when you have queried a DR secondary cluster in the middle of reprovisioning some of its partitions from the primary cluster. Another scenario where this might happen is if you query the database while the reference database on a leaf node is in the middle of reprovisioning which can happen in any cluster environment, with or without DR.

SingleStore will automatically take the steps necessary to ensure the secondary cluster is consistent with the primary cluster. Additionally, SingleStore will automatically take the steps necessary to ensure the reference databases on each of the nodes in the cluster are consistent with the master aggregator's reference database.

For example, suppose a leaf node in a primary cluster with redundancy 2 and a replica partition on the secondary cluster gets ahead of a replica partition on the primary cluster (due to network or other irregularity). In that case, SingleStore will automatically drop and reprovision the replica partition on the secondary cluster to be consistent with the recently promoted master partition on the primary cluster.

When a query is executed while the partition is removed, preparing to be reprovisioned, you will see "Unknown database" errors as the partitions did not yet exist as the query was executed.

Troubleshooting Guide

Identify if Reprovisioning has occurred

Consult the information_schema.MV_EVENTS table for reprovisioning events.

Consider the following Error:

Leaf Error (127.0.0.1:3307): Unknown database 'testdb_2'

Query the information_schema.mv_events table to confirm if there were anyDATABASE_REPROVISIONevents for partition mentioned in the above error message. For example:

+----------------+----------------------+----------+----------------------------+--------------------------------------------------------------+

| ORIGIN_NODE_ID | EVENT_TIME | SEVERITY | EVENT_TYPE | DETAILS |

+----------------+----------------------+----------+----------------------------+--------------------------------------------------------------+

| 6 | 2022-02-20T17:48:19Z | NOTICE | DATABASE_REPROVISION | {"database":"testdb_2"} |

+----------------+----------------------+----------+----------------------------+--------------------------------------------------------------+

Additionally, you will find a trace referencing the removal of files in preparation for reprovisioning in the memsql.log file of the corresponding leaf node of the reprovisioned partition.
For example:
```
3868792805232 2022-04-06 13:27:38.737 INFO: Thread 114618: ProcessContentDescPacket: `testdb_2` log: Removing all files to prepare for reprovisioning.
```

Remediation Steps

Step 1) Troubleshoot any long-running queries

Check the processlist for active long-running queries.
- Long-running queries can result in the cluster having difficulty keeping up with replication, thus resulting in a reprovision event.
If there are no actively long-running queries, consider looking at the historical run time of your queries and check if there are any instances of unusually long-running queries.
Consult the following documentation to troubleshoot any troublesome long-running queries:

Step 2) Increase `snapshots_to_keep`

If you continue hitting reprovision issues in a DR cluster, consider increasing snapshots_to_keep (by default this is set to 2) on the primary cluster.
- Increasing the value for snapshots_to_keepon the primary will give the DR cluster more time to replicate before it would need to start from scratch and reprovision as explained above.
- Please note: you should confirm you have enough disk space when increasing snapshots_to_keep as more snapshots will be saved to disk.

Step 3) Confirm there are no hardware or network-related issues in the cluster.