{question}
My MA suddenly became unstable - crashing continuously. How to prevent MA from crashing? What is the root cause?
{question}
{answer}
Affected Versions: v7.1.13 or older
Potential Symptoms
- Master Aggregator crashes intermittently
- Master Aggregator is unstable and crashes continuously
- Child Aggregator(s) crash intermittently or on set cadence
Potential Root Cause
- Likely caused by an issue fixed in v7.1.14 relating to Stored procedures containing a TO_QUERY statement preceded by a create table statement
- Create table statement sets the session's current database context to null.
- TO_QUERY references the session's current db context when materializing the QTV
- De-referencing a null pointer causes the thread and aggregator to crash
- Whenever the culprit procedure is called, it will result in the Aggregator ( Master or Child ) to crash, hence the intermittent crashes observed.
- Cron jobs calling the procedure would lead to aggregator crashes based on the cron schedule
- Background pipelines calling the culprit procedure, result in the following sequence
- the pipeline runs on the Master Aggregator so the MA would crash.
- Ops or System Service watches the Master Aggregator node and will restart it if it is down
- Upon restart, the Master Aggregator restarts background pipelines, triggering the crash leading to a continuous crash loop
Verifying the issue
1. Using dmp.stack file
- locate the dmp.stack file created for the crash in the data directory of the master node
- dmp.stack filename format: YYYY-MM-DD_HH:MM:SS.dmp.stack
2. memsql.log for the crashing Aggregator
location : in tracelogs subdirectory under the main node dir (default: /var/lib/memsql/<nodedir>)
- locate the backtrace at the end of the log
- If the Aggregator is continuously crashing you may want to make a copy of log and locate the last restart. The backtrace will be just prior to the restart messages
Format of first line in backtrace
query: call <culprit_procedure_name_here>
Rest of the stack in the file may contain some or all of the following function names :
CodePrinterV5_DoNotUse::PrintBacktickedSqlName()
OperatorTable::ToSQL
OperatorSelect::ToSQL
GetQueryStringWithTypeCasts
StrToQuery()
opToQuery()
SAMPLE dmp.stack file
query: call populatecalllegdmandcdrdm()
[libmemtrack.so (0x7f883093e32b)] backtrace 0x3B
[memsqld (0x33470c5)] PrintCallStack(_IO_FILE*) 0x25
[memsqld (0x16e8960)] RegisterCrashReport 0xD0
/opt/memsql-server-7.1.11-6c108deb15/memsqld() [0x12388a8]
[libpthread.so.0 (0x7f88303107e0)] 0x117E0
[libc.so.6 (0x7f882bc6f006)] 0x91006
[memsqld (0x31ffce8)] CodePrinterV5_DoNotUse::PrintBacktickedSqlName(char const*, int, bool) 0x128
[memsqld (0x1dab164)] OperatorTable::ToSQL(QueryBuilder&) const 0x804
[memsqld (0x1daa6b2)] OperatorSelect::ToSQL(QueryBuilder&) const 0x582
[memsqld (0x1b3245e)] GetQueryStringWithTypeCasts(Types::QueryType const*, Query*, std::string&) 0x17E
[memsqld (0x1e1c23e)] StrToQuery(QueryTypeVariable*, char const*, char const*, unsigned int, char const*, char const*, char const*, char const*) 0x2DE
[memsqld (0x12642c1)] opToQuery 0x91
[0x7f86abf4df34]
[0x7f86abf35088]
[memsqld (0x169f6c7)] ExecuteImpl::CallOrEcho() 0x1E7
[memsqld (0x166fed7)] MemsqlAutoParamExecute(QueryContext&, char*, unsigned int, char*&, EOQ_PACKET_MODE, QueryStats&, ConnectionTask&, int, int, bool&, bool&) 0x1CC7
[memsqld (0x1690664)] MemSqlExecute(char*, unsigned int, int, int, EOQ_PACKET_MODE, QueryContext&) 0x184
[memsqld (0x1607a02)] HandleRequest(ConnectionContext*, char*, unsigned long) 0x4C2
[memsqld (0x1608ac1)] ReadAndExecute(ConnectionContext*) 0x171
[memsqld (0x1608e0b)] ConnectionThreadScheduler::HandleConnectionThread(voi
How to stabilize the Aggregator
1. Identify whether the culprit stored procedure is called via pipeline or cron job or application
2. For cron job or application calls - stop calling the procedure
3. For pipeline triggered crashes
- Set the global variable pipelines_max_concurrent = 0 using the appropriate command ( ops v/s tools)
-
memsql-ops memsql-update-config --key pipelines_max_concurrent --value 0 --all
-
memsql-admin update-config --key pipelines_max_concurrent --value 0 --all
-
- This forces Master aggregator to not start background pipelines - stabilizing the MA
- Once MA is stable, stop the culprit pipeline and reset the global variable to its default value ( or original value ) using the appropriate command(ops v/s tools)
-
memsql-ops memsql-update-config --key pipelines_max_concurrent --value 50 --all
-
memsql-admin update-config --key pipelines_max_concurrent --value 50 --all
-
4. UPGRADE to the latest version in 7.1.x release or higher ( this fix is included in v7.3GA)
{answer}