What causes BACKUP/RESTORE ERROR: Remote connection timed out waiting for a response from the external endpoint OR connection reset by peer error?
The Error indicates that backup/restore is failing when trying to connect to cloud providers. Probably EC2 instance metadata server is somehow throttling/limiting access requests causing those connection timeouts. Check our Cloud Deployment Recommendations to make sure you're following those recommendations for optimized performance. Errors discussed here point to connection timeouts.
Example of the error:
SQL Error  [HY000]: Leaf Error (10.0.0.0:3306): Remote connection timed out waiting for a response from the external endpoint. Stderr:
2022/04/12 13:40:28 --storage-type s3 --target memsql-prod-backup/OPS/memsql/cluster-backup/MemSQL-PROD/2022-04-01T09:00:45.583Z/spend.backup/spend_126 --output-dir /memsql/data --output-name snapshots/spend_126_snapshot_v1_0
SQL Error  [HY000]: Leaf Error (10.0.0.0:3306): Attempting to get the size of the backup target failed. RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": read tcp 10.0.0.199:54592->0.0.29.25:443: read: connection reset by peer
How to fix this issue?
We can try fixing this by tweaking below two variables:
Let's try increasing
subprocess_ec2_metadata_timeout_ms variable from the default value. By default it's 1000ms so from there we can try increasing it incrementally, say 5000ms if not 10000ms, 40000ms, etc.
Also, Let's try increasing the variable
subprocess_io_idle_timeout_ms. By default, it is set to
240000 (4 minutes).
The maximum amount of time, in milliseconds, the engine waits for or retries a request before timing out to return metadata used to verify the cluster is on ec2 from which implicit credentials can be obtained. Click here to know more.
The maximum amount of time, in milliseconds, the engine waits for or retries a request before timing out and failing the backup when connecting to cloud providers. When you set this variable, its value is propagated to all nodes. Click here to know more.
How to set these variables?
We will need to connect to the master aggregator SQL Prompt and run the below SET GLOBAL command to set these values,
set global subprocess_ec2_metadata_timeout_ms = *****;
set global subprocess_io_idle_timeout_ms = *****;
If you are not able to fix the issue by tweaking the above variables, Contact SingleStore Support (see How do I file a support ticket?).
Click the below links to learn BACKUP and RESTORE commands: