Replication between clusters is not healthy – SingleStore Support

{question}

What messages/errors will show in the Tracelogs (memsql.log) when replication between two clusters fails?

{question}

{answer}

In general, you may replicate data between two SingleStore DB clusters, a Primary cluster (master) and a DR cluster (replica). However, when a replica is unable to replicate from the master, we can expect two classes of messages to show up in the Tracelogs on the DR cluster:

1. Error in replication

The following error messages are more or less the same wherein they indicate a replication issue on the replica node. They output the error number %d and a string description of the error %s:

"%s: Slave data write failed with error %d (%s) while in state %s."

"%s: Slave packet read (%d) failed with error %d (%s) while in state %s."

"%s: Slave data read failed with error %d (%s) while in state %s."

"%s: Slave data write failed with error %d (%s) while in state %s."

The errors are emitted by socket operations and can be sufficient to narrow down the issue (localhost/network/...etc.).

Upon detecting these errors in replication messages, we expect the master and replica links to be disconnected. However, replication management will attempt to reconnect the replica to the master so you may see the following messages on the replica cluster:

"Trying to establish replication connection for database '%s' from %s:x@'%s':%d/'%s'."

If the replica succeeds in reconnecting to the master, you will see a success message. However, if the replica fails to re-connect to the master, we'll encounter Reconnect errors, discussed below.

2. Reconnect errors

The following error messages are self-explanatory and indicate a connection issue between the replica and master:

"Connection failure connecting to node id %d: %s:x@'%s':%d. Failed with error: %s"

"Failure querying remote host to begin replication connection."

"Unexpected response while establishing replication connection."

"Replicating from a master older than 7.0 is not supported."

"Establishing replication connection: peer is not a valid MemSQL peer."

"Failed to setup the new master (term %lu) with the slave (term %lu)."

Further troubleshooting may be required to determine the underlying connection issue between replica and master.

Please note that most error messages relating to replication are logged on the DR replica cluster. The Primary cluster only notices when replicas are disconnected. For example, you may see one of the following disconnect messages in the Tracelogs on the Primary cluster:

"%s: Removed slave at node %ld because of disconnect."

"%s: Disconnecting slave at node %ld because its removal was requested."

More information on replication can be found here.

{answer}

Articles in this section

1. Error in replication

2. Reconnect errors

Related articles