Steps to remove/add node from a cluster if RHP fails to move gihome

I am getting more and more experience patching clusters with the local-mode automaton. The whole process would be very complex, but the local-mode automaton makes it really easy.

I have had nevertheless a couple of clusters where the process did not work:

#1: The very first cluster that I installed in 18c

This cluster has “kind of failed” patching the first node. Actually, the rhpctl command exited with an error:

$ rhpctl move gihome -sourcehome /u01/crs/crs1830 -desthome /u01/crs/crs1860 -node server1
server1.cern.ch: Audit ID: 2
server1.cern.ch: verifying versions of Oracle homes ...
server1.cern.ch: verifying owners of Oracle homes ...
server1.cern.ch: verifying groups of Oracle homes ...
server1.cern.ch: starting to move the Oracle Grid Infrastructure home from "/u01/crs/crs1830" to "/u01/crs/crs1860" on server cluster "AISTEST-RAC16"
[...]
2019/07/08 09:45:06 CLSRSC-329: Replacing Clusterware entries in file 'oracle-ohasd.service'
PRCG-1239 : failed to close a proxy connection
Connection refused to host: server1.cern.ch; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused)
PRCG-1079 : Internal error: ClientFactoryImpl-submitAction-error1
PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

But actually, the helper lept running and configured everything properly:

$ tail -f /ORA/dbs01/oracle/crsdata/server1/crsconfig/crs_postpatch_server1_2019-07-08_09-41-36AM.log
2019-07-08 09:55:25:
2019-07-08 09:55:25: Succeeded in writing the checkpoint:'ROOTCRS_POSTPATCH' with status:SUCCESS
2019-07-08 09:55:25: Executing cmd: /u01/crs/crs1860/bin/clsecho -p has -f clsrsc -m 672
2019-07-08 09:55:25: Executing cmd: /u01/crs/crs1860/bin/clsecho -p has -f clsrsc -m 672
2019-07-08 09:55:25: Command output:
>  CLSRSC-672: Post-patch steps for patching GI home successfully completed.
>End Command output
2019-07-08 09:55:25: CLSRSC-672: Post-patch steps for patching GI home successfully completed.

The cluster was OK on the first node, with the correct patch level. The second node, however, was filing with:

$  rhpctl move gihome -sourcehome /u01/crs/crs1830 -desthome /u01/crs/crs1860 -node server2
server1.cern.ch: retrieving status of databases ...
server1.cern.ch: retrieving status of services of databases ...
PRCT-1011 : Failed to run "rhphelper". Detailed error: <HLP_EMSG>,RHPHELP_procCmdLine-05,</HLP_EMSG>,<HLP_VRES>3</HLP_VRES>,<HLP_IEEMSG>,PRCG-1079 : Internal error: RHPHELP122_main-01,</HLP_IEEMSG>,<HLP_ERES>1</HLP_ERES>

I am not sure about the cause, but let’s assume it is irrelevant for the moment.

#2: A cluster with new GI home not properly linked with RAC

This was another funny case, where the first node patched successfully, but the second one failed upgrading in the middle of the process with a java NullPointer exception. We did a few bad tries of prePatch and postPatch to solve, but after that the second node of the cluster was in an inconsistent state: in ROLLING_UPGRADE mode and not possible to patch anymore.

Common solution: removing the node from the cluster and adding it back

In both cases we were in the following situation:

one node was successfully patched to 18.6
one node was not patched and was not possible to patch it anymore (at least without heavy interventions)

So, for me, the easiest solution has been removing the failing node and adding it back with the new patched version.

Steps to remove the node

Although the steps are described here: https://docs.oracle.com/en/database/oracle/oracle-database/18/cwadd/adding-and-deleting-cluster-nodes.html#GUID-8ADA9667-EC27-4EF9-9F34-C8F65A757F2A, there are a few differences that I will highlight:

Stop of the cluster:

(root)# crsctl stop crs

The actual procedure to remove a node asks to deconfigure the databases and managed homes from the active cluster version. But as we manage our homes with golden images, we do not need this; we rather want to keep all the entries in the OCR so that when we add it back, everything is in place.

Once stopped the CRS, we have deinstalled the CRS home on the failing node:

(oracle)$ $OH/deinstall/deinstall -local

This complained about the CRS that was down, but continued and ask for this script to be executed:

/u01/crs/crs1830/crs/install/rootcrs.sh -force  -deconfig -paramfile "/tmp/deinstall2019-07-08_11-37-20AM/response/deinstall_1830.rsp"

We’ve got errors also for this script, but the remove process was OK afterall.

Then, from the surviving node:

root # crsctl delete node -n server2
oracle $ srvctl stop vip -vip server2
root $ srvctl remove vip -vip server2

Adding the node back

From the surviving node, we ran gridSetup.sh and followed the steps to ad the node.

Wait before running root.sh.

In our case, we have originally installed the cluster starting with a SW_ONLY install. This type of installation keeps some leftovers in the configuration files that prevent the root.sh from configuring the cluster…we have had to modify rootconfig.sh:

check/modify /u01/crs/crs1860/crs/config/rootconfig.sh and change this:
# before:
# SW_ONLY=true
# after:
SW_ONLY=false

then, after running root.sh and the config tools, everything was back as before removing the node form the cluster.

For one of the clusters , both nodes were at the same patch level, but the cluster was still in ROLLING_PATCH mode. So we have had to do a

(root) # crsctl stop rollingpatch

—

Ludo