How to fix disconnecting Graylog node after upgrading from version 3.0 to 3.1

Fix disconnecting Graylog node after upgrading from version 3.0 to 3.1.

I have upgraded Graylog from version 3.0 to 3.1, but immediately I have experienced issues with my master node. It was disconnecting every few seconds.

Notification condition [NO_MASTER] has been fixed. message was logged in System messages and [NodePingThread] Did not find meta info of this node. Re-registering. in Graylog log file on master node.

[[email protected] ~]$ tail -f /var/log/graylog-server/server.log
[...]
2020-04-12T12:20:22.777Z WARN  [NodePingThread] Did not find meta info of this node. Re-registering.                                     
2020-04-12T12:20:33.761Z WARN  [NodePingThread] Did not find meta info of this node. Re-registering.                                     
2020-04-12T12:22:13.708Z WARN  [NodePingThread] Did not find meta info of this node. Re-registering.                                     
2020-04-12T12:22:57.770Z WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
[...]

The solution is to increase stale_master_timeout from default 2 seconds in server.conf Graylog server configuration file.

# Time in milliseconds after which a detected stale master node is being rechecked on startup.
#stale_master_timeout = 2000

Inspect current stale_master_timeout value.

[[email protected] ~]$ grep stale_master_timeout /etc/graylog/server/server.conf
#stale_master_timeout = 2000

Increase stale_master_timeout to 10 seconds.

[[email protected] ~]$ sudo sed -i -e "s/#stale_master_timeout = 2000/stale_master_timeout = 10000/" /etc/graylog/server/server.conf

Inspect current stale_master_timeout value.

[[email protected] ~]$ grep stale_master_timeout /etc/graylog/server/server.conf
stale_master_timeout = 10000

Restart graylog-server service.

[[email protected] ~]$ systemctl restart graylog-server

Inspect graylog-server service status.

[[email protected] ~]$ systemctl status graylog-server
● graylog-server.service - Graylog server
   Loaded: loaded (/usr/lib/systemd/system/graylog-server.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2020-04-12 12:28:25 UTC; 6s ago
     Docs: http://docs.graylog.org/
 Main PID: 7493 (graylog-server)
   CGroup: /system.slice/graylog-server.service
           ├─7493 /bin/sh /usr/share/graylog-server/bin/graylog-server
           └─7507 /usr/bin/java -Xms1g -Xmx1g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:-OmitStackTraceInFastThrow -XX:+UseParNewGC -jar -Dlog4j.configurationFile=file:///etc/graylog/server/...

Apr 12 12:28:25 graylog-server-5 systemd[1]: Started Graylog server.

No more unexpected behaviour and warning messages.

[[email protected] ~]$ tail -f /var/log/graylog-server/server.log
2020-04-12T12:29:48.540Z INFO  [NetworkListener] Started listener bound to [192.0.2.11:9000]
2020-04-12T12:29:48.542Z INFO  [HttpServer] [HttpServer] Started.
2020-04-12T12:29:48.542Z INFO  [JerseyService] Started REST API at <192.0.2.11:9000>
2020-04-12T12:29:48.543Z INFO  [ServiceManagerListener] Services are healthy
2020-04-12T12:29:48.544Z INFO  [ServerBootstrap] Services started, startup times in ms: {InputSetupService [RUNNING]=3, EtagService [RUNNING]=73, ConfigurationEtagService [RUNNING]=73, OutputSetupService [RUNNING]=73, JobSchedulerService [RUNNING]=73, GracefulShutdownService [RUNNING]=94, JournalReader [RUNNING]=95, UrlWhitelistService [RUNNING]=127, KafkaJournal [RUNNING]=143, MongoDBProcessingStatusRecorderService [RUNNING]=147, PeriodicalsService [RUNNING]=147, BufferSynchronizerService [RUNNING]=160, LookupTableService [RUNNING]=184, StreamCacheService [RUNNING]=420, JerseyService [RUNNING]=27175}
2020-04-12T09:29:48.546Z INFO  [InputSetupService] Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD]
to Running [LB:ALIVE]
2020-04-12T12:29:48.548Z INFO  [ServerBootstrap] Graylog server up and running.
2020-04-12T12:29:48.567Z INFO  [InputStateListener] Input [Beats/5cfe0aeaec88901911304649] is now STARTING
2020-04-12T12:29:48.864Z INFO  [InputStateListener] Input [Beats/5cfe0aeaec88901911304649] is now RUNNING
[...]

Use ansible or any other configuration management tool to apply update to other Graylog servers and perform service restart.