How to rebuild hawkular metrics after cluster deploy on OpenShift 3.11

Rebuild hawkular metrics after cluster deploy on OpenShift 3.11.

All operations are performed on the management node.

Assume that there was a change that requires to execute deploy cluster playbook.

$ ansible-playbook -i hosts playbooks/deploy_cluster.yml

Now, there are no metrics.

Log in to the OpenShift cluster.

$ oc login https://openshift-example.example.org:8443 --token=GWJqvER5-N3PBEGt14bwjx9K39ztqUqoUGQxki19kud -n openshift-infra
Logged into "https://openshift-example.example.org:8443" as "admin" using the token provided.
You have access to the following projects and can switch between them with 'oc project <projectname>':
    default
    development-milosz
    kube-public
    kube-service-catalog
    kube-system
    management-infra
    openshift
    openshift-ansible-service-broker
    openshift-console
    openshift-descheduler
  * openshift-infra
    openshift-logging
    openshift-metrics-server
    openshift-monitoring
    openshift-node
    openshift-sdn
    openshift-template-service-broker
    openshift-web-console
    ops-view
Using project "openshift-infra".

Notice, hawkular-metrics is not running.

$ oc get pods -n openshift-infra
NAME                            READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-kwgrf      1/1       Running   0          10m
hawkular-metrics-jxr42          0/1       Running   1          10m
hawkular-metrics-schema-psqgr   1/1       Running   0          10m
heapster-ckfpw                  0/1       Running   1          10m

Everything here is falling apart.

$ oc logs --tail=10 hawkular-metrics-jxr42 -n openshift-infra
2020-04-15 10:17:38,031 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990
2020-04-15 10:17:38,031 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: WildFly Full 11.0.0.Final (WildFly Core 3.0.8.Final) started in 9055ms - Started 343 of 593 services (340 services are lazy, passive or on-demand)
2020-04-15 10:17:47,406 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Version check failed: Keyspace hawkular_metrics does not exist
2020-04-15 10:17:47,406 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Trying again in 10000 ms
2020-04-15 10:17:57,410 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Version check failed: Keyspace hawkular_metrics does not exist
2020-04-15 10:17:57,411 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Trying again in 10000 ms
2020-04-15 10:18:07,414 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Version check failed: Keyspace hawkular_metrics does not exist
2020-04-15 10:18:07,414 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Trying again in 10000 ms
2020-04-15 10:18:17,417 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Version check failed: Keyspace hawkular_metrics does not exist
2020-04-15 10:18:17,417 INFO  [org.hawkular.metrics.api.jaxrs.util.SchemaVersionChecker] (metricsservice-lifecycle-thread) Trying again in 10000 ms
$ oc logs --tail=15 hawkular-cassandra-1-kwgrf -n openshift-infra
Caused by: javax.net.ssl.SSLHandshakeException: null cert chain
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) ~[na:1.8.0_181]
	at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1666) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:330) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:318) ~[na:1.8.0_181]
	at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:1935) ~[na:1.8.0_181]
	at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:237) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1052) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:992) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:989) ~[na:1.8.0_181]
	at java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_181]
	at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1467) ~[na:1.8.0_181]
	at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1256) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1169) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
	... 16 common frames omitted
$ oc logs --tail=10 hawkular-metrics-schema-psqgr -n openshift-infra
INFO  2020-04-15 10:22:06,740 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:06,746 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
INFO  2020-04-15 10:22:11,746 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:11,755 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
INFO  2020-04-15 10:22:16,755 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:16,761 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
INFO  2020-04-15 10:22:21,761 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:21,768 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
INFO  2020-04-15 10:22:26,769 [main] com.datastax.driver.core.ClockFactory:newInstance:52 - Using native clock to generate timestamps.
INFO  2020-04-15 10:22:26,775 [main] org.hawkular.metrics.schema.Installer:initSession:134 - Cassandra may not be up yet. Retrying in 5000 ms
$ oc logs --tail=10 heapster-ckfpw -n openshift-infra
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.

Hawkular metrics route will return 503 Service Unavailable.

$ curl --head https://hawkular-metrics.openshift-example.example.org/hawkular/metrics
HTTP/1.0 503 Service Unavailable
Pragma: no-cache
Cache-Control: private, max-age=0, no-cache, no-store
Connection: close
Content-Type: text/html

To rebuild hawkular metrics delete its hawkular-metrics-schema job.

$ oc get jobs -n openshift-infra
NAME                      DESIRED   SUCCESSFUL   AGE
hawkular-metrics-schema   1         0            19m
$ oc delete jobs/hawkular-metrics-schema -n openshift-infra
job.batch "hawkular-metrics-schema" deleted

Execute openshift-metrics/schema.yml playbook.

$ ansible-playbook -i hosts  playbooks/openshift-metrics/schema.yml
[...]
PLAY RECAP *******************************************************************************************************************************************************
localhost                              : ok=12   changed=0    unreachable=0    failed=0    skipped=4    rescued=0    ignored=0
openshift-example-infra-1.example.org  : ok=0    changed=0    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0
openshift-example-lb-1.example.org     : ok=1    changed=0    unreachable=0    failed=0    skipped=5    rescued=0    ignored=0
openshift-example-master-1.example.org : ok=53   changed=2    unreachable=0    failed=0    skipped=37   rescued=0    ignored=0
openshift-example-node-1.example.org   : ok=0    changed=0    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0
openshift-example-node-2.example.org   : ok=0    changed=0    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0
INSTALLER STATUS *************************************************************************************************************************************************
Initialization  : Complete (0:00:11)
Wednesday 15 April 2020  12:25:24 +0200 (0:00:00.143)       0:00:15.571 *******
===============================================================================
Gathering Facts ------------------------------------------------------------------------------------------------------------------------------------------- 1.75s
openshift_metrics : generate hawkular-metrics schema job -------------------------------------------------------------------------------------------------- 0.76s
Gather Cluster facts -------------------------------------------------------------------------------------------------------------------------------------- 0.64s
get openshift_current_version ----------------------------------------------------------------------------------------------------------------------------- 0.59s
openshift_metrics : Applying /tmp/openshift-metrics-ansible-7Eo6Im/templates/hawkular_metrics_schema_job.yaml --------------------------------------------- 0.54s
openshift_metrics : Checking generation of Job hawkular-metrics-schema ------------------------------------------------------------------------------------ 0.43s
openshift_metrics : Create temp directory for all our templates ------------------------------------------------------------------------------------------- 0.40s
openshift_metrics : Determine change status of Job hawkular-metrics-schema -------------------------------------------------------------------------------- 0.39s
openshift_control_plane : slurp --------------------------------------------------------------------------------------------------------------------------- 0.39s
openshift_metrics : list installed jobs ------------------------------------------------------------------------------------------------------------------- 0.38s
Detecting Operating System from ostree_booted ------------------------------------------------------------------------------------------------------------- 0.38s
openshift_metrics : Create temp directory for doing work in on target ------------------------------------------------------------------------------------- 0.36s
Initialize openshift.node.sdn_mtu ------------------------------------------------------------------------------------------------------------------------- 0.33s
openshift_metrics : Create temp directory local on control node ------------------------------------------------------------------------------------------- 0.25s
Fetch ca.crt from cluster if exists ----------------------------------------------------------------------------------------------------------------------- 0.21s
openshift_metrics : Copy the admin client config(s) ------------------------------------------------------------------------------------------------------- 0.20s
set_fact -------------------------------------------------------------------------------------------------------------------------------------------------- 0.19s
openshift_control_plane : stat ---------------------------------------------------------------------------------------------------------------------------- 0.18s
openshift_metrics : slurp --------------------------------------------------------------------------------------------------------------------------------- 0.18s
openshift_sanitize_inventory : Check for usage of deprecated variables ------------------------------------------------------------------------------------ 0.17s

Inspect hawkular-metrics-schema job.

$ oc get jobs -n openshift-infra
NAME                      DESIRED   SUCCESSFUL   AGE
hawkular-metrics-schema   1         1            15s

Inspect pods.

$ oc get pods -n openshift-infra
NAME                            READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-kwgrf      1/1       Running     0          20m
hawkular-metrics-jxr42          1/1       Running     3          20m
hawkular-metrics-schema-td29l   0/1       Completed   0          1m
heapster-ckfpw                  1/1       Running     2          20m

Hawkular metrics route will return 200 OK.

$ curl --head https://hawkular-metrics.openshift-example.example.org/hawkular/metrics
HTTP/1.1 200 OK
Cache-Control: no-cache
Vary: Origin,Accept-Encoding
X-Powered-By: Undertow/1
Server: WildFly/11
Content-Type: application/json
Content-Length: 132
Date: Wed, 15 Apr 2020 10:27:12 GMT
Set-Cookie: a054b5d9e987bf679f10c9d29be39478=3ce5579d1b00caa62afe078c982aca15; path=/; HttpOnly; Secure

Inspect logs.

$ oc logs --tail=10 hawkular-metrics-jxr42 -n openshift-infra
2020-04-15 10:25:40,102 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-5) gc_grace_seconds for locks is set to 864000. Resetting to 0
2020-04-15 10:25:40,192 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-4) gc_grace_seconds for metrics_idx is set to 864000. Resetting to 0
2020-04-15 10:25:40,331 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-6) gc_grace_seconds for metrics_tags_idx is set to 864000. Resetting to 0
2020-04-15 10:25:40,332 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-6) gc_grace_seconds for retentions_idx is set to 864000. Resetting to 0
2020-04-15 10:25:40,391 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-8) gc_grace_seconds for scheduled_jobs_idx is set to 864000. Resetting to 0
2020-04-15 10:25:40,497 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-1) gc_grace_seconds for sys_config is set to 864000. Resetting to 0
2020-04-15 10:25:40,497 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-1) gc_grace_seconds for tasks is set to 864000. Resetting to 0
2020-04-15 10:25:40,543 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-3) gc_grace_seconds for tenants is set to 864000. Resetting to 0
2020-04-15 10:25:40,799 INFO  [org.hawkular.metrics.core.util.GCGraceSecondsManager] (RxComputationScheduler-6) Finished gc_grace_seconds updates in 1344 ms
2020-04-15 10:27:12,678 WARN  [org.jboss.resteasy.resteasy_jaxrs.i18n] (default task-23) RESTEASY002142: Multiple resource methods match request "HEAD /". Selecting one. Matching methods: [public javax.ws.rs.core.Response org.hawkular.metrics.api.jaxrs.handler.BaseHandler.baseJSON(), public void org.hawkular.metrics.api.jaxrs.handler.BaseHandler.baseHTML(javax.servlet.ServletContext) throws java.lang.Exception]

Everything is fine now.