Real-Time Cascade Standby Container Databases without Oracle Managed Files

July 10, 2020, 12:48 am

≫ Next: Script to check Data Guard status from SQL

≪ Previous: How to get the Data Guard broker configuration from a SQL query?

OK, the title might not be the best… I just would like to add more detail to content you can already find in other blogs (E.g. this nice one from Philippe Fierens http://pfierens.blogspot.com/2020/04/19c-data-guard-series-part-iii-adding.html).

I have this Cascade Standby configuration:

DGMGRL> connect /
Connected to "TOOLCDB1_SITE1"
Connected as SYSDG.
DGMGRL> show configuration;

Configuration - toolcdb1

  Protection Mode: MaxPerformance
  Members:
  toolcdb1_site1 - Primary database
    toolcdb1_site2 - Physical standby database
      toolcdx1_site2 - Physical standby database (receiving current redo)

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 42 seconds ago)

Years ago I wrote this whitepaper about cascaded standbys:
https://fr.slideshare.net/ludovicocaldara/2014-603-caldarappr
While it is still relevant for non-CDBs, things have changed with Multitenant architecture.

In my config, the Oracle Database version is 19.7 and the databases are actually CDBs. No Grid Infrastructure, non-OMF datafiles.
It is important to highlight that a lot of things have changed since 12.1. And because 19c is the LTS version now, it does not make sense to try anything older.

First, I just want to make sure that my standbys are aligned.

Primary:

alter system switch logfile;

1st Standby alert log:

2020-07-07T10:20:23.370868+02:00
 rfs (PID:6408): Archived Log entry 58 added for B-1044796516.T-1.S-39 ID 0xf15601c6 LAD:2
 rfs (PID:6408): No SRLs available for T-1
2020-07-07T10:20:23.386410+02:00
 rfs (PID:6408): Opened log for T-1.S-40 dbid 4048667172 branch 1044796516
2020-07-07T10:20:24.552766+02:00
PR00 (PID:6478): Media Recovery Log /u03/oradata/fra/TOOLCDB1_SITE2/archivelog/2020_07_07/o1_mf_1_39_hj8cs7vo_.arc
PR00 (PID:6478): Media Recovery Waiting for T-1.S-40 (in transit)

2nd Standby alert log:

2020-07-07T10:20:31.051281+02:00
 rfs (PID:6498): Opened log for T-1.S-39 dbid 4048667172 branch 1044796516
2020-07-07T10:20:31.150748+02:00
 rfs (PID:6498): Archived Log entry 38 added for B-1044796516.T-1.S-39 ID 0xf15601c6 LAD:2
2020-07-07T10:20:31.862337+02:00
PR00 (PID:6718): Media Recovery Log /u03/oradata/fra/TOOLCDX1_SITE2/archivelog/2020_07_07/o1_mf_1_39_hj8d2h1k_.arc
PR00 (PID:6718): Media Recovery Waiting for T-1.S-40

Then, I create a pluggable database (from PDB$SEED):

SQL>         CREATE PLUGGABLE DATABASE LATERALUS ADMIN USER PDBADMIN IDENTIFIED BY "NfrwTgbjwq7MbPNT92cH"  ROLES=(DBA)
  2                  FILE_NAME_CONVERT=('/pdbseed/','/LATERALUS/')
  3                  DEFAULT TABLESPACE USERS DATAFILE '/u02/oradata/TOOLCDB1/data/LATERALUS/USERS01.dbf' SIZE 50M AUTOEXTEND ON NEXT 50M MAXSIZE 1G;

Pluggable database created.

SQL>         ALTER PLUGGABLE DATABASE LATERALUS OPEN;

Pluggable database altered.

SQL>         ALTER PLUGGABLE DATABASE LATERALUS SAVE STATE;

Pluggable database altered.

On the first standby I get:

2020-07-07T10:23:33.148457+02:00
 rfs (PID:6408): Archived Log entry 60 added for B-1044796516.T-1.S-40 ID 0xf15601c6 LAD:2
 rfs (PID:6408): No SRLs available for T-1
2020-07-07T10:23:33.184335+02:00
 rfs (PID:6408): Opened log for T-1.S-41 dbid 4048667172 branch 1044796516
2020-07-07T10:23:33.887665+02:00
PR00 (PID:6478): Media Recovery Log /u03/oradata/fra/TOOLCDB1_SITE2/archivelog/2020_07_07/o1_mf_1_40_hj8d27d0_.arc
Recovery created pluggable database LATERALUS
Recovery copied files for tablespace SYSTEM
Recovery successfully copied file /u02/oradata/TOOLCDB1/data/LATERALUS/system01.dbf from /u02/oradata/TOOLCDB1/data/pdbseed/system01.dbf
LATERALUS(4):WARNING: File being created with same name as in Primary
LATERALUS(4):Existing file may be overwritten
LATERALUS(4):Recovery created file /u02/oradata/TOOLCDB1/data/LATERALUS/system01.dbf
LATERALUS(4):Successfully added datafile 16 to media recovery
LATERALUS(4):Datafile #16: '/u02/oradata/TOOLCDB1/data/LATERALUS/system01.dbf'
2020-07-07T10:23:35.846985+02:00
Recovery copied files for tablespace SYSAUX
Recovery successfully copied file /u02/oradata/TOOLCDB1/data/LATERALUS/sysaux01.dbf from /u02/oradata/TOOLCDB1/data/pdbseed/sysaux01.dbf
LATERALUS(4):WARNING: File being created with same name as in Primary
LATERALUS(4):Existing file may be overwritten
LATERALUS(4):Recovery created file /u02/oradata/TOOLCDB1/data/LATERALUS/sysaux01.dbf
LATERALUS(4):Successfully added datafile 17 to media recovery
LATERALUS(4):Datafile #17: '/u02/oradata/TOOLCDB1/data/LATERALUS/sysaux01.dbf'
2020-07-07T10:23:41.004383+02:00
Recovery copied files for tablespace UNDOTBS1
Recovery successfully copied file /u02/oradata/TOOLCDB1/data/LATERALUS/undotbs01.dbf from /u02/oradata/TOOLCDB1/data/pdbseed/undotbs01.dbf
LATERALUS(4):WARNING: File being created with same name as in Primary
LATERALUS(4):Existing file may be overwritten
LATERALUS(4):Recovery created file /u02/oradata/TOOLCDB1/data/LATERALUS/undotbs01.dbf
LATERALUS(4):Successfully added datafile 18 to media recovery
LATERALUS(4):Datafile #18: '/u02/oradata/TOOLCDB1/data/LATERALUS/undotbs01.dbf'
2020-07-07T10:23:42.191607+02:00
(4):WARNING: File being created with same name as in Primary
(4):Existing file may be overwritten
(4):Recovery created file /u02/oradata/TOOLCDB1/data/LATERALUS/USERS01.dbf
(4):Successfully added datafile 19 to media recovery
(4):Datafile #19: '/u02/oradata/TOOLCDB1/data/LATERALUS/USERS01.dbf'
PR00 (PID:6478): Media Recovery Waiting for T-1.S-41 (in transit)

On the second:

2020-07-07T10:24:31.393410+02:00
 rfs (PID:6500): Opened log for T-1.S-40 dbid 4048667172 branch 1044796516
2020-07-07T10:24:31.460391+02:00
 rfs (PID:6500): Archived Log entry 39 added for B-1044796516.T-1.S-40 ID 0xf15601c6 LAD:2
2020-07-07T10:24:32.360726+02:00
PR00 (PID:6718): Media Recovery Log /u03/oradata/fra/TOOLCDX1_SITE2/archivelog/2020_07_07/o1_mf_1_40_hj8d9zd7_.arc
Recovery created pluggable database LATERALUS
2020-07-07T10:24:36.000250+02:00
Recovery copied files for tablespace SYSTEM
Recovery successfully copied file /u02/oradata/TOOLCDX1/data/LATERALUS/system01.dbf from /u02/oradata/TOOLCDX1/data/pdbseed/system01.dbf
LATERALUS(4):Recovery created file /u02/oradata/TOOLCDX1/data/LATERALUS/system01.dbf
LATERALUS(4):Successfully added datafile 16 to media recovery
LATERALUS(4):Datafile #16: '/u02/oradata/TOOLCDX1/data/LATERALUS/system01.dbf'
2020-07-07T10:24:40.657596+02:00
Recovery copied files for tablespace SYSAUX
Recovery successfully copied file /u02/oradata/TOOLCDX1/data/LATERALUS/sysaux01.dbf from /u02/oradata/TOOLCDX1/data/pdbseed/sysaux01.dbf
LATERALUS(4):Recovery created file /u02/oradata/TOOLCDX1/data/LATERALUS/sysaux01.dbf
LATERALUS(4):Successfully added datafile 17 to media recovery
LATERALUS(4):Datafile #17: '/u02/oradata/TOOLCDX1/data/LATERALUS/sysaux01.dbf'
2020-07-07T10:24:47.688298+02:00
Recovery copied files for tablespace UNDOTBS1
Recovery successfully copied file /u02/oradata/TOOLCDX1/data/LATERALUS/undotbs01.dbf from /u02/oradata/TOOLCDX1/data/pdbseed/undotbs01.dbf
LATERALUS(4):Recovery created file /u02/oradata/TOOLCDX1/data/LATERALUS/undotbs01.dbf
LATERALUS(4):Successfully added datafile 18 to media recovery
LATERALUS(4):Datafile #18: '/u02/oradata/TOOLCDX1/data/LATERALUS/undotbs01.dbf'
(4):Recovery created file /u02/oradata/TOOLCDX1/data/LATERALUS/USERS01.dbf
(4):Successfully added datafile 19 to media recovery
(4):Datafile #19: '/u02/oradata/TOOLCDX1/data/LATERALUS/USERS01.dbf'
2020-07-07T10:24:48.924510+02:00
PR00 (PID:6718): Media Recovery Waiting for T-1.S-41

So, yeah, not having OMF might get you some warnings like: WARNING: File being created with same name as in Primary
But it is good to know that the cascade standby deals well with new PDBs.

Of course, this is not of big interest as I know that the problem with Multitenant comes from CLONING PDBs from either local or remote PDBs in read-write mode.

So let’s try a relocate from another CDB:

CREATE PLUGGABLE DATABASE PNEUMA FROM PNEUMA@LUDOCDB1_PNEUMA_tempclone
         RELOCATE AVAILABILITY NORMAL
         file_name_convert=('/LUDOCDB1/data/PNEUMA/','/TOOLCDB1/data/PNEUMA/')
         PARALLEL 2;

Pluggable database created.

SQL>         ALTER PLUGGABLE DATABASE PNEUMA OPEN;

Pluggable database altered.

SQL>         ALTER PLUGGABLE DATABASE PNEUMA SAVE STATE;

Pluggable database altered.

This is what I get on the first standby:

2020-07-07T12:03:02.364271+02:00
Recovery created pluggable database PNEUMA
PNEUMA(5):Tablespace-SYSTEM during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):File #20 added to control file as 'UNNAMED00020'. Originally created as:
PNEUMA(5):'/u02/oradata/TOOLCDB1/data/PNEUMA/system01.dbf'
PNEUMA(5):because the pluggable database was created with nostandby
PNEUMA(5):or the tablespace belonging to the pluggable database is
PNEUMA(5):offline.
PNEUMA(5):Tablespace-SYSAUX during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):File #21 added to control file as 'UNNAMED00021'. Originally created as:
PNEUMA(5):'/u02/oradata/TOOLCDB1/data/PNEUMA/sysaux01.dbf'
PNEUMA(5):because the pluggable database was created with nostandby
PNEUMA(5):or the tablespace belonging to the pluggable database is
PNEUMA(5):offline.
PNEUMA(5):Tablespace-UNDOTBS1 during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):File #22 added to control file as 'UNNAMED00022'. Originally created as:
PNEUMA(5):'/u02/oradata/TOOLCDB1/data/PNEUMA/undotbs01.dbf'
PNEUMA(5):because the pluggable database was created with nostandby
PNEUMA(5):or the tablespace belonging to the pluggable database is
PNEUMA(5):offline.
PNEUMA(5):Tablespace-TEMP during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):Tablespace-USERS during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):File #23 added to control file as 'UNNAMED00023'. Originally created as:
PNEUMA(5):'/u02/oradata/TOOLCDB1/data/PNEUMA/USERS01.dbf'
PNEUMA(5):because the pluggable database was created with nostandby
PNEUMA(5):or the tablespace belonging to the pluggable database is
PNEUMA(5):offline.

and this is on the cascaded standby:

2020-07-07T12:03:02.368014+02:00
Recovery created pluggable database PNEUMA
PNEUMA(5):Tablespace-SYSTEM during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):File #20 added to control file as 'UNNAMED00020'. Originally created as:
PNEUMA(5):'/u02/oradata/TOOLCDB1/data/PNEUMA/system01.dbf'
PNEUMA(5):because the pluggable database was created with nostandby
PNEUMA(5):or the tablespace belonging to the pluggable database is
PNEUMA(5):offline.
PNEUMA(5):Tablespace-SYSAUX during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):File #21 added to control file as 'UNNAMED00021'. Originally created as:
PNEUMA(5):'/u02/oradata/TOOLCDB1/data/PNEUMA/sysaux01.dbf'
PNEUMA(5):because the pluggable database was created with nostandby
PNEUMA(5):or the tablespace belonging to the pluggable database is
PNEUMA(5):offline.
PNEUMA(5):Tablespace-UNDOTBS1 during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):File #22 added to control file as 'UNNAMED00022'. Originally created as:
PNEUMA(5):'/u02/oradata/TOOLCDB1/data/PNEUMA/undotbs01.dbf'
PNEUMA(5):because the pluggable database was created with nostandby
PNEUMA(5):or the tablespace belonging to the pluggable database is
PNEUMA(5):offline.
PNEUMA(5):Tablespace-TEMP during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):Tablespace-USERS during PDB create skipped since source is in            r/w mode or this is a refresh clone
PNEUMA(5):File #23 added to control file as 'UNNAMED00023'. Originally created as:
PNEUMA(5):'/u02/oradata/TOOLCDB1/data/PNEUMA/USERS01.dbf'
PNEUMA(5):because the pluggable database was created with nostandby
PNEUMA(5):or the tablespace belonging to the pluggable database is
PNEUMA(5):offline.

So absolutely the same behavior between the two levels of standby.
According to the documentation: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-PLUGGABLE-DATABASE.html#GUID-F2DBA8DD-EEA8-4BB7-A07F-78DC04DB1FFC
I quote what is specified for the parameter STANDBYS={ALL|NONE|…}:
“If you include a PDB in a standby CDB, then during standby recovery the standby CDB will search for the data files for the PDB. If the data files are not found, then standby recovery will stop and you must copy the data files to the correct location before you can restart recovery.”

“Specify ALL to include the new PDB in all standby CDBs. This is the default.”

“Specify NONE to exclude the new PDB from all standby CDBs. When a PDB is excluded from all standby CDBs, the PDB’s data files are unnamed and marked offline on all of the standby CDBs. Standby recovery will not stop if the data files for the PDB are not found on the standby. […]”

So, in order to avoid the MRP to crash, I should have included STANDBYS=NONE
But the documentation is not up to date, because in my case the PDB is skipped automatically and the recovery process DOES NOT STOP:

SQL> r
  1* select process, status, sequence#, client_process from v$managed_standby

PROCESS   STATUS        SEQUENCE# CLIENT_P
--------- ------------ ---------- --------
ARCH      CONNECTED             0 ARCH
DGRD      ALLOCATED             0 N/A
DGRD      ALLOCATED             0 N/A
ARCH      CLOSING              43 ARCH
ARCH      CLOSING              40 ARCH
ARCH      CLOSING              42 ARCH
RFS       IDLE                  0 Archival
RFS       IDLE                  0 UNKNOWN
RFS       IDLE                 44 LGWR
RFS       IDLE                  0 UNKNOWN
MRP0      APPLYING_LOG         44 N/A
LNS       WRITING              44 LNS
DGRD      ALLOCATED             0 N/A

13 rows selected.

However, the recovery is marked ENABLED for the PDB on the standby, while usind STANDBYS=NONE it would have been DISABLED.

1* select name, recovery_status from v$pdbs

NAME                           RECOVERY
------------------------------ --------
PDB$SEED                       ENABLED
LATERALUS                      ENABLED
PNEUMA                         ENABLED

So, another difference with the doc who states:
“You can enable a PDB on a standby CDB after it was excluded on that standby CDB by copying the data files to the correct location, bringing the PDB online, and marking it as enabled for recovery.”

This reflects the findings of Philippe Fierens in his blog (http://pfierens.blogspot.com/2020/04/19c-data-guard-series-part-iii-adding.html).

This behavior has been introduced probably between 12.2 and 19c, but I could not manage to find exactly when, as it is not explicitly stated in the documentation.
However, I remember well that in 12.1.0.2, the MRP process was crashing.

In my configuration, not on purpose, but interesting for this article, the first standby has the very same directory structure, while the cascaded standby has not.

In any case, there is a potentially big problem for all the customers implementing Multitenant on Data Guard:

With the old behaviour (MRP crashing), it was easy to spot when a PDB was cloned online into a primary database, because a simple dgmgrl “show configuration” whould have displayed a warning because of the increasing lag (following the MRP crash).

With the current behavior, the MRP keeps recovering and the “show configuration” displays “SUCCESS” despite there is a PDB not copied on the standby (thus not protected).

Indeed, this is what I get after the clone:

DGMGRL> show configuration;

Configuration - toolcdb1

  Protection Mode: MaxPerformance
  Members:
  toolcdb1_site1 - Primary database
    toolcdb1_site2 - Physical standby database
      toolcdx1_site2 - Physical standby database (receiving current redo)

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 21 seconds ago)

DGMGRL> show database  toolcdb1_site2;

Database - toolcdb1_site2

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 1 second ago)
  Apply Lag:          0 seconds (computed 1 second ago)
  Average Apply Rate: 8.00 KByte/s
  Real Time Query:    ON
  Instance(s):
    TOOLCDB1

Database Status:
SUCCESS

I can see that the Data Guard Broker is completely silent about the missing PDB. So I might think my PDB is protected while it is not!

I actually have to add a check on the standby DBs to check if I have any missing datafiles:

1* select con_id, name, status from v$datafile where status not in ('SYSTEM','ONLINE');

    CON_ID NAME                                                  STATUS
---------- ----------------------------------------------------- -------
         5 /u01/app/oracle/product/db_19_7_0/dbs/UNNAMED00020    SYSOFF
         5 /u01/app/oracle/product/db_19_7_0/dbs/UNNAMED00021    RECOVER
         5 /u01/app/oracle/product/db_19_7_0/dbs/UNNAMED00022    RECOVER
         5 /u01/app/oracle/product/db_19_7_0/dbs/UNNAMED00023    RECOVER

Although this first query seems OK to get the missing datafiles, actually the next one is the correct one to use:

SQL> select * from v$recover_file where online_status='OFFLINE';

     FILE# ONLINE  ONLINE_ ERROR               CHANGE# TIME             CON_ID
---------- ------- ------- ---------------- ---------- ------------ ----------
        20 OFFLINE OFFLINE FILE MISSING              0                       5
        21 OFFLINE OFFLINE FILE MISSING              0                       5
        22 OFFLINE OFFLINE FILE MISSING              0                       5
        23 OFFLINE OFFLINE FILE MISSING              0                       5

This check should be implemented and put under monitoring (custom metrics in OEM?)

SQL> select 'ERROR: CON_ID '||con_id||' has '||count(*)||' datafiles offline!' from v$recover_file where online_status='OFFLINE' group by con_id;

'ERROR:CON_ID'||CON_ID||'HAS'||COUNT(*)||'DATAFILESOFFLINE!'
--------------------------------------------------------------------------------
ERROR: CON_ID 5 has 4 datafiles offline!

The missing PDB is easy to spot once I know that I have to do it. However, for each PDB to recover (I might have many!), I have to prepare the rename of datafiles and creation of directory (do not forget I am using non-OMF here).

Now, the datafile names on the standby got changed to …/UNNAMEDnnnnn.

So I have to get the original ones from the primary database and do the same replace that db_file_name_convert would do:

set trim on
col rename_file for a300
set lines 400
select 'set newname for datafile '||file#||' to '''||replace(name,'/TOOLCDB1/','/TOOLCDX1/')||''';' as rename_file  from v$datafile where con_id=6;

and put this in a rman script (this will be for the second standby, the first has the same name so same PATH):

run {
set newname for datafile 20 to '/u02/oradata/TOOLCDX1/data/PNEUMA/system01.dbf';
set newname for datafile 21 to '/u02/oradata/TOOLCDX1/data/PNEUMA/sysaux01.dbf';
set newname for datafile 22 to '/u02/oradata/TOOLCDX1/data/PNEUMA/undotbs01.dbf';
set newname for datafile 23 to '/u02/oradata/TOOLCDX1/data/PNEUMA/USERS01.dbf';
restore pluggable database PNEUMA from service 'newbox01:1521/TOOLCDB1_SITE1_DGMGRL' ;
}
switch pluggable database PNEUMA to copy;

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

Starting restore at 07-JUL-2020 14:19:22
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1530 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: using network backup set from service newbox01:1521/TOOLCDB1_SITE1_DGMGRL
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00020 to /u02/oradata/TOOLCDB1/data/PNEUMA/system01.dbf
channel ORA_DISK_1: restore complete, elapsed time: 00:00:03
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: using network backup set from service newbox01:1521/TOOLCDB1_SITE1_DGMGRL
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00021 to /u02/oradata/TOOLCDB1/data/PNEUMA/sysaux01.dbf
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: using network backup set from service newbox01:1521/TOOLCDB1_SITE1_DGMGRL
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00022 to /u02/oradata/TOOLCDB1/data/PNEUMA/undotbs01.dbf
channel ORA_DISK_1: restore complete, elapsed time: 00:00:03
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: using network backup set from service newbox01:1521/TOOLCDB1_SITE1_DGMGRL
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00023 to /u02/oradata/TOOLCDB1/data/PNEUMA/USERS01.dbf
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
Finished restore at 07-JUL-2020 14:19:43

datafile 20 switched to datafile copy "/u02/oradata/TOOLCDB1/data/PNEUMA/system01.dbf"
datafile 21 switched to datafile copy "/u02/oradata/TOOLCDB1/data/PNEUMA/sysaux01.dbf"
datafile 22 switched to datafile copy "/u02/oradata/TOOLCDB1/data/PNEUMA/undotbs01.dbf"
datafile 23 switched to datafile copy "/u02/oradata/TOOLCDB1/data/PNEUMA/USERS01.dbf"

Then, I need to stop the recovery, start it and stopping again, put the datafiles online and finally restart the recover.
These are the same steps used my Philippe in his blog post, just adapted to my taste

DGMGRL> edit database "TOOLCDB1_SITE2" set state='APPLY-OFF';

For the second part, I use this HEREDOC to online all offline datafiles:

$ sqlplus / as sysdba <<EOF
RECOVER STANDBY DATABASE UNTIL CANCEL;
CANCEL
ALTER SESSION SET CONTAINER=PNEUMA;
DECLARE
        CURSOR c_fileids IS
                SELECT  file#  FROM v\$recover_file where online_STATUS='OFFLINE';

		r_fileid c_fileids%ROWTYPE;
BEGIN
        OPEN c_fileids;
        LOOP
                FETCH  c_fileids  INTO r_fileid;
                EXIT WHEN c_fileids%NOTFOUND;
                BEGIN
					EXECUTE IMMEDIATE 'ALTER DATABASE DATAFILE '||to_char(r_fileid.file#)||' ONLINE';
                END;
        END LOOP;
END;
/
exit
EOF

and finally:

DGMGRL> edit database "TOOLCDB1_SITE2" set state='APPLY-ON';

Now, I do not have anymore any datafiles offline on the standby:

SQL> select 'ERROR: CON_ID '||con_id||' has '||count(*)||' datafiles offline!' from v$recover_file where online_status='OFFLINE' group by con_id;

no rows selected

I will not publish the steps for the second standby, they are exactly the same (same output as well).

At the end, for me it is important to highlight that monitoring the OFFLINE datafiles on the standby becomes a crucial point to guarantee the health of Data Guard in Multitenant. Relying on the Broker status or “PDB recovery disabled” is not enough.

On the bright side, it is nice to see that Cascade Standby configurations do not introduce any variation, so cascaded standbys can be threated the same as “direct” standby databases.

HTH

—

Ludovico

↧

Script to check Data Guard status from SQL

August 12, 2020, 2:49 am

≫ Next: Data Guard, Easy Connect and the Observer for multiple configurations

≪ Previous: Real-Time Cascade Standby Container Databases without Oracle Managed Files

In a previous blog post I have explained how to get the basic configuration from x$drc and display something like:

OBJECT_ID DATABASE       INTENDED_STATE    CONNECT_STRING               ENABLED ROLE     RECEIVE_FROM SHIP_TO        FSFOVALIDITY STATUS
--------- -------------- ----------------- ---------------------------- ------- -------- ------------ -------------- ------------ -------
 16842752 toolcdb1_site1 READ-WRITE-XPTON  newbox01:1521/TOOLCDB1_SITE1 YES     PRIMARY  -N/A-        toolcdb1_site2 2            SUCCESS
 33619968 toolcdb1_site2 PHYSICAL-APPLY-ON newbox02:1521/TOOLCDB1_SITE2 YES     PHYSICAL -UNKNOWN-    -N/A-          1            SUCCESS

There are other possibilities, by using the DBMS_DRS PL/SQL package.

The package is quite rich. In order to get more details, I use CHECK_CONNECT to check the connectivity to the member databases:

PROCEDURE CHECK_CONNECT
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 MEMBER_NAME                    VARCHAR2                IN
 INSTANCE_NAME                  VARCHAR2                IN

example:

SQL> execute dbms_drs.check_connect('TOOLCDB1_SITE2','TOOLCDB1');

PL/SQL procedure successfully completed.

SQL> execute dbms_drs.check_connect('TOOLCDB1_SITE3','TOOLCDB1');
BEGIN dbms_drs.check_connect('TOOLCDB1_SITE3','TOOLCDB1'); END;

*
ERROR at line 1:
ORA-16596: member not part of the Oracle Data Guard broker configuration
ORA-06512: at "SYS.DBMS_DRS", line 1851
ORA-06512: at line 1


SQL> execute dbms_drs.check_connect('TOOLCDB1_SITE2','TOOLCDB1');
BEGIN dbms_drs.check_connect('TOOLCDB1_SITE2','TOOLCDB1'); END;

*
ERROR at line 1:
ORA-12514: TNS:listener does not currently know of service requested in connect
descriptor
ORA-06512: at "SYS.DBMS_DRS", line 1851
ORA-06512: at line 1

In the first case I get no exceptions, that means that the database is reachable using the DGConnectIdentifier specified in the configuration (‘TOOLCDB1_SITE2’ is my database name in the configuration, it is NOT a TNS entry. I use EZConnect in my lab).

In the second case I specify a database that is not in the configuration.

In the third case, it looks like the database is down (no service), or the DGConnectIdentifier is not correct.

GET_PROPERTY_OBJ is useful to get e single property of a database/instance:

FUNCTION GET_PROPERTY_OBJ RETURNS VARCHAR2
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 OBJECT_ID                      NUMBER(38)              IN
 PROPERTY_NAME                  VARCHAR2                IN

Example:

SQL> SELECT numtodsinterval(dbms_drs.get_property_obj(16842752,'TransportLagThreshold'),'second')
>  FROM dual;

NUMTODSINTERVAL(DBMS_DRS.GET_PROPERTY_OBJ(16842752,'TRANSPORTLAGTHRESHOLD')
---------------------------------------------------------------------------
+000000000 00:00:30.000000000

Here I have, for the primary (the object_id from x$drc), a TransportLagThreshold of 30 seconds.

DO_CONTROL does a specific check and returns a document with the results:

PROCEDURE DO_CONTROL
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 INDOC                          VARCHAR2                IN
 OUTDOC                         VARCHAR2                OUT
 REQUEST_ID                     NUMBER(38)              IN/OUT
 PIECE                          NUMBER(38)              IN
 CONTEXT                        VARCHAR2                IN     DEFAULT

The problem is… what’s the format for indoc?

To get the correct format, I have enabled sql trace to get the executions, with bind variables, of the dgmgrl commands. It happens that the input format is XML and the output format is HTML.

This is how you can get the LogXptStatus, for example:

SQL> r
  1  DECLARE
  2  -- variables for the dbms_drs.do_control
  3  v_indoc VARCHAR2 ( 4000 );
  4  v_outdoc VARCHAR2 ( 4000 );
  5  v_rid NUMBER;
  6  v_context VARCHAR2(100);
  7  v_pieceno NUMBER ;
  8  BEGIN
  9  v_indoc := '<DO_MONITOR version="19.1"><PROPERTY name="LogXptStatus" object_id="16842752"/></DO_MONITOR>';
 10  v_pieceno  := 1;
 11  dbms_drs.do_control(v_indoc, v_outdoc, v_rid, v_pieceno, v_context);
 12  dbms_output.put_line (v_outdoc);
 13* END;
<TABLE  name="LOG TRANSPORT STATUS"><DESCRIPTION ><COLUMN  name="PRIMARY_INSTANCE_NAME" type="string" max_length="22"></COLUMN><COLUMN  name="STANDBY_DATABASE_NAME" type="string"
max_length="31"></COLUMN><COLUMN  name="STATUS" type="string" max_length="10"></COLUMN><COLUMN  name="ERROR" type="string" max_length="256"></COLUMN></DESCRIPTION><TR ><TD >TOOLCDB1</TD><TD
>toolcdb1_site2</TD><TD >VALID</TD><TD ></TD></TR></TABLE>

The big script

So I said… why not trying to have a comprehensive SQL script to check a few vital statuses of Data Guard?

This is the script that came out:

-- 
-- Author    : Ludovico Caldara
-- Version   : 0.1
-- Purpose   : Checks the health of a Data Guard configuration on ONE database
-- Run as    : SYSDBA , to execute on each DB in the config 
--             it does not check ALL the DBs in the configuration but only the current one
--             You can use a wrapper to check all the DBs in the configuration 
--             or all the standby instances on a server
-- Limitations: Does not work on RAC environments yet
--              Does not work on 11g databases (tested on 19c)

set serveroutput on
set lines 200
DECLARE
	v_dgconfig BINARY_INTEGER;
	v_num_errors BINARY_INTEGER;
	v_num_warnings BINARY_INTEGER;
	v_apply_lag INTERVAL DAY TO SECOND;
	v_transport_lag INTERVAL DAY TO SECOND;
	v_apply_th INTERVAL DAY TO SECOND;
	v_transport_th INTERVAL DAY TO SECOND;
	v_delay INTERVAL DAY TO SECOND;
	v_delaymins BINARY_INTEGER;
	v_flashback v$database.flashback_on%type;

	CURSOR c_dgconfig IS SELECT piv.*, obj.status FROM (
		SELECT object_id, attribute, value FROM x$drc
			WHERE object_id IN ( SELECT object_id FROM x$drc WHERE attribute = 'DATABASE')
	) drc PIVOT ( MAX ( value ) FOR attribute
		IN (
		'DATABASE' DATABASE ,
		'intended_state' intended_state ,
		'connect_string' connect_string ,
		'enabled' enabled ,
		'role' role ,
		'receive_from' receive_from ,
		'ship_to' ship_to ,
		'FSFOTargetValidity' FSFOTargetValidity
		)
	) piv JOIN x$drc obj ON ( obj.object_id = piv.object_id AND obj.attribute = 'DATABASE' )
	WHERE upper(piv.database)=sys_context('USERENV','DB_UNIQUE_NAME');

	CURSOR c_priconfig IS SELECT piv.*, obj.status FROM (
		SELECT object_id, attribute, value FROM x$drc
			WHERE object_id IN ( SELECT object_id FROM x$drc WHERE attribute = 'DATABASE')
	) drc PIVOT ( MAX ( value ) FOR attribute
		IN (
		'DATABASE' DATABASE ,
		'intended_state' intended_state ,
		'connect_string' connect_string ,
		'enabled' enabled ,
		'role' role ,
		'receive_from' receive_from ,
		'ship_to' ship_to ,
		'FSFOTargetValidity' FSFOTargetValidity
		)
	) piv JOIN x$drc obj ON ( obj.object_id = piv.object_id AND obj.attribute = 'DATABASE' )
	WHERE piv.role='PRIMARY';

	r_dgconfig c_dgconfig%ROWTYPE;
	r_priconfig c_priconfig%ROWTYPE;

	v_open_mode v$database.open_mode%TYPE;

	-- variables for the dbms_drs.do_control
	v_indoc VARCHAR2 ( 4000 );
	v_outdoc VARCHAR2 ( 4000 );
	v_rid NUMBER;
	v_context VARCHAR2(100);
	v_pieceno NUMBER ;
	/* xmltype does not work on mounted databases 
	v_y CLOB;
	v_z XMLTYPE;
	v_xml XMLTYPE;
	*/
	v_status VARCHAR2(100);
	v_error VARCHAR2(100);
	v_p_connect BINARY_INTEGER;
	v_s_connect BINARY_INTEGER;
	v_offline_datafiles BINARY_INTEGER;
	
BEGIN

	v_num_errors := 0;
	v_num_warnings := 0;
	v_p_connect := 0;
	v_s_connect := 0;

	dbms_output.put_line('Checking Data Guard Configuration for '||sys_context('USERENV','DB_UNIQUE_NAME'));
	dbms_output.put_line('--------------------------------------');
	-- get open_mode
	SELECT open_mode INTO v_open_mode FROM v$database;

	-- check if the configuration exists
	SELECT count(*) INTO v_dgconfig FROM x$drc;
	IF v_dgconfig = 0 THEN
		dbms_output.put_line('ERROR: Current database does not have a Data Guard config.');
		v_num_errors := v_num_errors + 1;
		GOTO stop_checks;
	else
		dbms_output.put_line('___OK: Current database has a Data Guard config.');
	END IF;

	-- fetch the current DB config in record
	OPEN c_dgconfig;
	BEGIN
		FETCH c_dgconfig INTO r_dgconfig;
	EXCEPTION
		WHEN NO_DATA_FOUND THEN 
			dbms_output.put_line('ERROR: Current database does not have a Data Guard config.');
			v_num_errors := v_num_errors + 1;
			GOTO stop_checks;
	END;

	-- fetch the primary DB config in record
	OPEN c_priconfig;
	BEGIN
		FETCH c_priconfig INTO r_priconfig;
	EXCEPTION
		WHEN NO_DATA_FOUND THEN 
			dbms_output.put_line('ERROR: There is no primary database in the config?');
			v_num_errors := v_num_errors + 1;
			GOTO stop_checks;
	END;

	-- enabled?
	IF r_dgconfig.enabled = 'YES' THEN
		dbms_output.put_line('___OK: Current database is enabled in Data Guard.');
	ELSE
		dbms_output.put_line('ERROR: Current database is not enabled in Data Guard.');
		v_num_errors := v_num_errors + 1;
	END IF;

	-- status SUCCESS?
	IF r_dgconfig.status = 'SUCCESS' THEN
		dbms_output.put_line('___OK: Data Guard status for the database is: '||r_dgconfig.status);
	ELSE
		dbms_output.put_line('ERROR: Data Guard status for the database is: '||r_dgconfig.status);
		v_num_errors := v_num_errors + 1;
	END IF;

	-- reachability of the primary
	BEGIN
		dbms_drs.CHECK_CONNECT (r_priconfig.database ,r_priconfig.database);
		dbms_output.put_line('___OK: Primary ('||r_priconfig.database||') is reachable.');
		v_p_connect := 1;
	EXCEPTION
		WHEN OTHERS THEN
		dbms_output.put_line('ERROR: Primary ('||r_priconfig.database||') unreachable. Error code ' || SQLCODE || ': ' || SQLERRM);
		v_num_errors := v_num_errors + 1;
	END;

	-- if we are not on the primary, check the current database connectivity as well through the broker
	IF r_priconfig.object_id <> r_dgconfig.object_id THEN
		BEGIN
			dbms_drs.CHECK_CONNECT (r_dgconfig.database ,r_dgconfig.database);
			dbms_output.put_line('___OK: current DB ('||r_dgconfig.database||') is reachable.');
			v_s_connect := 1;
		EXCEPTION
			WHEN OTHERS THEN
			dbms_output.put_line('ERROR: current DB ('||r_dgconfig.database||') unreachable. Error code ' || SQLCODE || ': ' || SQLERRM);
			v_num_errors := v_num_errors + 1;
		END;
	END IF;


	-- we check primary transport only if reachable
	IF v_p_connect = 1 THEN
		-- primary logxpt?
		v_indoc := '<DO_MONITOR version="19.1"><PROPERTY name="LogXptStatus" object_id="'||r_priconfig.object_id||'"/></DO_MONITOR>';
		v_pieceno  := 1;
		dbms_drs.do_control(v_indoc, v_outdoc, v_rid, v_pieceno, v_context);
	
		select regexp_substr(v_outdoc, '(<TD >)([[:alnum:]].*?)(</TD>)',1,3,'i',2) into v_status from dual;

		/* does not work on MOUNTED databases 
		v_y := TO_CLOB ( v_outdoc );
		v_z := XMLType ( v_y );

		select xt.status , xt.error into v_status, v_error from xmltable  ('/TABLE/TR' passing v_z columns 
			status varchar2(100) PATH 'TD[3]',
			error varchar2(100) PATH 'TD[4]'
		) xt ;
		*/

		IF v_status = 'VALID' THEN
			dbms_output.put_line('___OK: LogXptStatus of primary is VALID.');
		ELSE
			dbms_output.put_line('ERROR: LogXptStatus of primary is '||nvl(v_status,'NULL'));
			v_num_errors := v_num_errors + 1;
		END IF;
	END IF;

	-- flashback?
	SELECT flashback_on into v_flashback
	FROM v$database;
	IF v_flashback = 'YES' THEN
		dbms_output.put_line('___OK: Flashback Logging is enabled.');
	ELSE
		dbms_output.put_line('_WARN: Flashback Logging is disabled.');
		v_num_warnings := v_num_warnings + 1;
	END IF;

	-- role?
	IF r_dgconfig.ROLE = 'PRIMARY' THEN
		dbms_output.put_line('___OK: The database is PRIMARY, skipping standby checks.');
		GOTO stop_checks;
	ELSE
		dbms_output.put_line('___OK: The database is STANDBY, executing standby checks.');
	END IF;

	-- intended state?
	IF r_dgconfig.intended_state = 'PHYSICAL-APPLY-ON' THEN
		dbms_output.put_line('___OK: The database intended state is APPLY-ON.');
	ELSIF r_dgconfig.intended_state = 'PHYSICAL-APPLY-READY' THEN
		dbms_output.put_line('_WARN: The database intended state is APPLY-OFF.');
		v_num_warnings := v_num_warnings + 1;
	ELSE
		dbms_output.put_line('ERROR: The database intended state is '||r_dgconfig.intended_state);
		v_num_errors := v_num_errors + 1;
	END IF;

	-- real time apply?
	IF v_open_mode = 'READ ONLY WITH APPLY' THEN
		dbms_output.put_line('_WARN: Real Time Apply is used.');
		v_num_warnings := v_num_warnings + 1;
	ELSIF v_open_mode = 'MOUNTED' THEN
		dbms_output.put_line('___OK: The standby database is mounted.');
	ELSE
		dbms_output.put_line('ERROR: The database open_mode is '||v_open_mode);
		v_num_errors := v_num_errors + 1;
	END IF;
	

	-- offline datafiles?
	BEGIN
		select count(distinct con_id) into v_offline_datafiles from v$recover_file where online_status='OFFLINE' group by con_id;
		dbms_output.put_line('ERROR: There are '||v_offline_datafiles||' OFFLINE datafiles');
		v_num_errors := v_num_errors + 1;
	EXCEPTION WHEN NO_DATA_FOUND THEN
		dbms_output.put_line('___OK: There are no PDBs with OFFLINE datafiles');
	END;

	-- we get the delay as well, so that we can compute the apply threshold in a more intelligent way than the broker...
	v_delaymins := dbms_drs.get_property_obj(r_dgconfig.object_id,'DelayMins');
	v_delay := numtodsinterval(v_delaymins,'minute');

	IF v_delaymins > 0 THEN
		dbms_output.put_line('_WARN: Standby delayed by '||v_delaymins||' minutes.');
		v_num_warnings := v_num_warnings + 1;
	END IF;

	-- apply lag?
	v_apply_th := numtodsinterval(dbms_drs.get_property_obj(r_dgconfig.object_id,'ApplyLagThreshold'),'second');
	BEGIN
		SELECT TO_DSINTERVAL(value) into v_apply_lag FROM v$dataguard_stats WHERE name='apply lag';
		IF v_apply_lag > ( v_apply_th + v_delay ) THEN
			dbms_output.put_line('ERROR: apply lag is '||v_apply_lag);
			v_num_errors := v_num_errors + 1;
		ELSE
			dbms_output.put_line('___OK: apply lag is '||v_apply_lag);
		END IF;
	EXCEPTION WHEN OTHERS THEN
		dbms_output.put_line('ERROR: cannot determine apply lag.');
		v_num_errors := v_num_errors + 1;
	END;


	-- transport lag?
	v_transport_lag := numtodsinterval(dbms_drs.get_property_obj(r_dgconfig.object_id,'TransportLagThreshold'),'second');
	BEGIN
		SELECT TO_DSINTERVAL(value) into v_transport_lag FROM v$dataguard_stats WHERE name='transport lag';
		IF v_transport_lag > v_transport_th THEN
			dbms_output.put_line('ERROR: transport lag is '||v_transport_lag);
			v_num_errors := v_num_errors + 1;
		ELSE
			dbms_output.put_line('___OK: transport lag is '||v_transport_lag);
		END IF;
	EXCEPTION WHEN OTHERS THEN
		dbms_output.put_line('_WARN: cannot determine transport lag.');
		v_num_warnings := v_num_warnings + 1;
	END;
	

	<<stop_checks>>
	
	dbms_output.put_line('--------------------------------------');
	IF v_num_errors > 0 THEN
		dbms_output.put_line('RESULT: ERROR: '||to_char(v_num_errors)||' errors - '||to_char(v_num_warnings)||' warnings');
	ELSIF v_num_warnings > 0 THEN
		dbms_output.put_line('RESULT: _WARN: '||to_char(v_num_errors)||' errors - '||to_char(v_num_warnings)||' warnings');
	ELSE
		dbms_output.put_line('RESULT: ___OK: '||to_char(v_num_errors)||' errors - '||to_char(v_num_warnings)||' warnings');
	END IF;
END;
/

Of course, it is not perfect (many checks missing: FSFO readiness, observer checks, etc.), but it is good enough for base monitoring. Also, it’s faster than a normal shell+dgmgrl script.

Output on a Primary database:

SQL> @check_dg_config
Checking Data Guard Configuration for TOOLCDB1_SITE1
--------------------------------------
___OK: Current database has a Data Guard config.
___OK: Current database is enabled in Data Guard.
___OK: Data Guard status for the database is: SUCCESS
___OK: Primary (toolcdb1_site1) is reachable.
___OK: LogXptStatus of primary is VALID.
___OK: Flashback Logging is enabled.
___OK: The database is PRIMARY, skipping standby checks.
--------------------------------------
RESULT: ___OK: 0 errors - 0 warnings

PL/SQL procedure successfully completed.

Output on a standby database:

SQL> @check_dg_config.sql
Checking Data Guard Configuration for TOOLCDB1_SITE2
--------------------------------------
___OK: Current database has a Data Guard config.
___OK: Current database is enabled in Data Guard.
___OK: Data Guard status for the database is: SUCCESS
___OK: Primary (toolcdb1_site1) is reachable.
___OK: current DB (toolcdb1_site2) is reachable.
___OK: LogXptStatus of primary is VALID.
___OK: Flashback Logging is enabled.
___OK: The database is STANDBY, executing standby checks.
___OK: The database intended state is APPLY-ON.
_WARN: Real Time Apply is used.
___OK: There are no PDBs with OFFLINE datafiles
___OK: apply lag is +00 00:00:00.000000
___OK: transport lag is +00 00:00:00.000000
--------------------------------------
RESULT: _WARN: 0 errors - 1 warnings

PL/SQL procedure successfully completed.

In case of errors (e.g. standby listener stopped), I would get:

SQL> @check_dg_config.sql
Checking Data Guard Configuration for TOOLCDB1_SITE2
--------------------------------------
___OK: Current database has a Data Guard config.
___OK: Current database is enabled in Data Guard.
___OK: Data Guard status for the database is: SUCCESS
___OK: Primary (toolcdb1_site1) is reachable.
ERROR: current DB (toolcdb1_site2) unreachable. Error code -12541: ORA-12541: TNS:no listener
___OK: LogXptStatus of primary is VALID.
___OK: Flashback Logging is enabled.
___OK: The database is STANDBY, executing standby checks.
___OK: The database intended state is APPLY-ON.
_WARN: Real Time Apply is used.
___OK: There are no PDBs with OFFLINE datafiles
___OK: apply lag is +00 00:00:00.000000
___OK: transport lag is +00 00:00:00.000000
--------------------------------------
RESULT: ERROR: 1 errors - 1 warnings

PL/SQL procedure successfully completed.

So easy to spot the error and use a shell wrapper to grep ^ERROR or similar.

Be careful, the script is not RAC aware, and it lacks some checks, so you might want to reuse it and extend it to fit your exact configuration.

Hope you like it!

—

Ludovico

↧

Data Guard, Easy Connect and the Observer for multiple configurations

August 14, 2020, 3:46 am

≫ Next: The fear of (availability) loss is a path to the dark side.

≪ Previous: Script to check Data Guard status from SQL

EZConnect

One of the challenges of automation in bin Oracle Environments is dealing with tnsnames.ora files.
These files might grow big and are sometimes hard to distribute/maintain properly.
The worst is when manual modifications are needed: manual operations, if not made carefully, can screw up the connection to the databases.
The best solution is always using LDAP naming resolution. I have seen customers using OID, OUD, Active Directory, openldapd, all with a great level of control and automation. However, some customer don’t have/want this possibility and keep relying on TNS naming resolution.
When Data Guard (and eventually RAC) are in place, the tnsnames.ora gets filled by entries for the DGConnectIdentifiers and StaticConnectIdentifier. If I add the observer, an additional entry is required to access the dbname_CFG service created by the Fast Start Failover.

Actually, all these entries are not required if I use Easy Connect.

My friend Franck Pachot wrote a couple of nice blog posts about Easy Connect while working with me at CERN:
https://medium.com/@FranckPachot/19c-easy-connect-e0c3b77968d7

https://medium.com/@FranckPachot/19c-ezconnect-and-wallet-easy-connect-and-external-password-file-8e326bb8c9f5

Basic Data Guard configuration

The basic configuration with Data Guard is quite simple to achieve with Easy Connect. In this examples I have:
– The primary database TOOLCDB1_SITE1
– The duplicated database for standby TOOLCDB1_SITE2

After setting up the static registration (no Grid Infrastructure in my lab):

SID_LIST_LISTENER=
  (SID_LIST=
    (SID_DESC=
      (GLOBAL_DBNAME=TOOLCDB1_SITE1_DGMGRL)
      (SID_NAME=TOOLCDB1)
      (ORACLE_HOME=/u01/app/oracle/product/db_19_8_0)
    )
  )

and copying the passwordfile, the configuration can be created with:

DGMGRL> create configuration TOOLCDB1 as primary database is TOOLCDB1_SITE1 connect identifier is 'newbox01:1521/TOOLCDB1_SITE1';
Configuration "toolcdb1" created with primary database "toolcdb1_site1"

DGMGRL>  edit database TOOLCDB1_SITE1 set property 'StaticConnectIdentifier'='newbox01:1521/TOOLCDB1_SITE1_DGMGRL';
Property "StaticConnectIdentifier" updated

DGMGRL>  add database TOOLCDB1_SITE2 as connect identifier is 'newbox02:1521/TOOLCDB1_SITE2';
Database "toolcdb1_site2" added

DGMGRL>  edit database TOOLCDB1_SITE2 set property 'StaticConnectIdentifier'='newbox02:1521/TOOLCDB1_SITE2_DGMGRL';
Property "StaticConnectIdentifier" updated

DGMGRL>  enable configuration;
Enabled.

That’s it.

Now, if I want to have the configuration observed, I need to activate the Fast Start Failover:

DGMGRL> edit database toolcdb1_site1 set property LogXptMode='SYNC';
Property "logxptmode" updated

DGMGRL> edit database toolcdb1_site2 set property LogXptMode='SYNC';
Property "logxptmode" updated

DGMGRL> edit database toolcdb1_site1 set property FastStartFailoverTarget='toolcdb1_site2';
Property "faststartfailovertarget" updated

DGMGRL> edit database toolcdb1_site2 set property FastStartFailoverTarget='toolcdb1_site1';
Property "faststartfailovertarget" updated

DGMGRL> edit configuration set protection mode as maxavailability;
Succeeded.

DGMGRL> enable fast_start failover;
Enabled in Zero Data Loss Mode.

With just two databases, FastStartFailoverTarget is not explicitly needed, but I usually do it as other databases might be added to the configuration in the future.
After that, the broker complains that FSFO is enabled but there is no observer yet:

DGMGRL> show fast_start failover;

Fast-Start Failover: Enabled in Zero Data Loss Mode

  Protection Mode:    MaxAvailability
  Lag Limit:          0 seconds

  Threshold:          180 seconds
  Active Target:      toolcdb1_site2
  Potential Targets:  "toolcdb1_site2"
    toolcdb1_site2 valid
  Observer:           (none)
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: 180 seconds
  Observer Override:  FALSE

Configurable Failover Conditions
  Health Conditions:
    Corrupted Controlfile          YES
    Corrupted Dictionary           YES
    Inaccessible Logfile            NO
    Stuck Archiver                  NO
    Datafile Write Errors          YES

  Oracle Error Conditions:
    (none)


DGMGRL> show configuration;

Configuration - toolcdb1

  Protection Mode: MaxAvailability
  Members:
  toolcdb1_site1 - Primary database
    Warning: ORA-16819: fast-start failover observer not started

    toolcdb1_site2 - (*) Physical standby database

Fast-Start Failover: Enabled in Zero Data Loss Mode

Configuration Status:
WARNING   (status updated 39 seconds ago)

Observer for multiple configurations

This feature has been introduced in 12.2 but it is still not widely used.
Before 12.2, the Observer was a foreground process: the DBAs had to start it in a wrapper script executed with nohup in order to keep it live.
Since 12.2, the observer can run as a background process as far as there is a valid wallet for the connection to the databases.
Also, 12.2 introduced the capability of starting multiple configurations with a single dgmgrl command: “START OBSERVING”.

For more information about it, you can check the documentation here:
https://docs.oracle.com/en/database/oracle/oracle-database/19/dgbkr/using-data-guard-broker-to-manage-switchovers-failovers.html#GUID-BC513CDB-1E06-4EB3-9FE1-E1331E15E492

How to set it up with Easy Connect?

First, I need a wallet. And here comes the first compromise:
Having a single dgmgrl session to start all my configurations means that I have a single wallet for all the databases that I want to observe.
Fair enough, all the DBs (CDBs?) are managed by the same team in this case.
If I have only observers on my host I can easily point to the wallet from my central sqlnet.ora:

WALLET_LOCATION =
   (SOURCE =
      (METHOD = FILE)
      (METHOD_DATA = (DIRECTORY = /u01/app/oracle/admin/observers/wallet))
  )
SQLNET.WALLET_OVERRIDE = TRUE

Otherwise I need to create a separate TNS_ADMIN for my observer management environment.
Then, I create the wallet:

$ WALLET_DIR=$ORACLE_BASE/admin/observers/wallet
$ mkdir -p $WALLET_DIR
$ orapki wallet create -wallet $WALLET_DIR -auto_login_local -pwd Password2020
Oracle PKI Tool Release 21.0.0.0.0 - Production
Version 21.0.0.0.0
Copyright (c) 2004, 2020, Oracle and/or its affiliates. All rights reserved.

Operation is successfully completed.

Now I need to add the connection descriptors.

Which connection descriptors do I need?
The Observer uses the DGConnectIdentifier to keep observing the databases, but needs a connection to both of them using the TOOLCDB1_CFG service (unless I specify something different with the broker configuration property ConfigurationWideServiceName) to connect to the configuration and get the DGConnectIdentifier information. Again, you can check it in the doc. or the note Oracle 12.2 – Simplified OBSERVER Management for Multiple Fast-Start Failover Configurations (Doc ID 2285891.1)

So I need to specify three secrets for three connection descriptors:

$ mkstore -wrl "$TNS_ADMIN" -createCredential newbox01,newbox02:1521/TOOLCDB1_CFG sysdg
Oracle Secret Store Tool Release 21.0.0.0.0 - Production
Version 21.0.0.0.0
Copyright (c) 2004, 2020, Oracle and/or its affiliates. All rights reserved.

Your secret/Password is missing in the command line
Enter your secret/Password:
Re-enter your secret/Password:
Enter wallet password:

$ mkstore -wrl "$TNS_ADMIN" -createCredential newbox01:1521/TOOLCDB1_SITE1 sysdg
Oracle Secret Store Tool Release 21.0.0.0.0 - Production
Version 21.0.0.0.0
Copyright (c) 2004, 2020, Oracle and/or its affiliates. All rights reserved.

Your secret/Password is missing in the command line
Enter your secret/Password:
Re-enter your secret/Password:
Enter wallet password:


$ mkstore -wrl "$TNS_ADMIN" -createCredential newbox02:1521/TOOLCDB1_SITE2 sysdg
Oracle Secret Store Tool Release 21.0.0.0.0 - Production
Version 21.0.0.0.0
Copyright (c) 2004, 2020, Oracle and/or its affiliates. All rights reserved.

Your secret/Password is missing in the command line
Enter your secret/Password:
Re-enter your secret/Password:
Enter wallet password:

The first one will be used for the initial connection. The other two to observe the Primary and Standby.
I need to be careful that the first EZConnect descriptor matches EXACTLY what I put in observer.ora (see next step) and the last two match my DGConnectIdentifier (unless I specify something different with ObserverConnectIdentifier), otherwise I will get some errors and the observer will not observe correctly (or will not start at all).

The dgmgrl needs then a file named observer.ora.
$ORACLE_BASE/admin/observers or the central TNS_ADMIN would be good locations, but what if I have observers that must be started from multiple Oracle Homes?
In that case, having a observer.ora in $ORACLE_HOME/network/admin (or $ORACLE_BASE/homes/{OHNAME}/network/admin/ if Read-Only Oracle Home is enabled) would be a better solution: in this case I would need to start one session per Oracle Home

The content of my observer.ora must be something like:

BROKER_CONFIGS=
   (
     (CONFIG=
       (NAME=TOOLCDB1)
       (CONNECT_ID=newbox01,newbox02:1521/TOOLCDB1_CFG)
       (CONFIG_HOME=/export/soft/oracle/admin/TOOLCDB1/observer)
     )
   )

This is the example for my configuration, but I can put as many (CONFIG=…) as I want in order to observe multiple configurations.
Then, if everything is configured properly, I can start all the observers with a single command:

DGMGRL> SET OBSERVERCONFIGFILE=/u01/app/oracle/admin/observers/observer.ora
DGMGRL> START OBSERVING
ObserverConfigFile=observer.ora
observer configuration file parsing succeeded
Submitted command "START OBSERVER" using connect identifier "newbox01,newbox02:1521/TOOLCDB1_CFG"

Check superobserver.log, individual observer logs and Data Guard Broker logs for execution details.

DGMGRL> show observers
ObserverConfigFile=/u01/app/oracle/admin/observers/observer.ora
observer configuration file parsing succeeded
Submitted command "SHOW OBSERVER" using connect identifier "newbox01,newbox02:1521/TOOLCDB1_CFG"
Connected to "TOOLCDB1_SITE2"

Configuration - toolcdb1

  Primary:            toolcdb1_site1
  Active Target:      toolcdb1_site2

Observer "newbox03.trivadistraining.com1" - Master

  Host Name:                    newbox03.trivadistraining.com
  Last Ping to Primary:         1 second ago
  Last Ping to Target:          2 seconds ago

Troubleshooting

If the observer does not work, sometimes it is not easy to understand the cause.

Has SYSDG been granted to SYSDG user? Is SYSDG account unlocked?
Does sqlnet.ora contain the correct wallet location?
Is the wallet accessible in autologin?
Are the entries in the wallet correct? (check with “sqlplus /@connstring as sysdg”)

Missing pieces

Here, a few features that I think would be a nice addition in the future:

Awareness for the ORACLE_HOME to be used for each observer
Possibility to specify a different TNS_ADMIN per observer (different wallets)
Integration with Grid Infrastructure (srvctl add observer…) and support for multiple observers

—

Ludovico

↧

The fear of (availability) loss is a path to the dark side.

October 5, 2020, 8:43 am

≫ Next: Oracle Fleet Patching and Provisioning (FPP): My new role as PM and a brand new series of blog posts

≪ Previous: Data Guard, Easy Connect and the Observer for multiple configurations

I have been a DBA/consultant for customers and big production environments for over twenty years. I have explained more or less my career path in this blog post.

Database (and application) high availability has always been one of my favorite areas. Over the years I have become a high availability expert (my many blog posts are there to confirm it) and I have spent a lot of time building, troubleshooting, teaching, presenting, advocating these gems of technology that are RAC, Data Guard, Application Continuity and the many other products that are part of the Oracle Maximum Availability Architecture solution. Customers fear downtime, and I have always been with them on that. But in my case, it looks like Yoda’s famous quote worked well for me (in a good way):

I’ll be joining the Oracle Maximum Availability Architecture Product Management team as MAA Product Manager (or rather Cloud MAA, I will not explain here ;-)) next November.

(for those who are not familiar with the joke, the “Dark Side” is how we often refer to the Oracle employees in the Oracle Community )

I remember just like if it was yesterday that I was presenting some Data Guard 12c new features in front of a big audience at Collaborate 2014. There I have met two incredible people that were part of the MAA Product Management team: Larry Carpenter and Markus Michalewicz. Larry has been a great source of inspiration to improve my seniority and ease of presenting in front of the public, while Markus has become a friend over the years in addition of being one of the most influent persons in my professional network.

Now I have got the opportunity to join that team, and I feel like it’s the most natural change to do in my career.

And because I imagine some of you will have some questions, there are some answers to questions I’ve been frequently asked so far:

MAA PM does not mean becoming team lead or supervising other colleagues, I’ll be a “regular” PM
I will stay in Switzerland and work remotely from here
I will stay in “the conference circus” and keep presenting as soon the COVID-19 situation will allow to do so
Yes, I was VERY happy in Trivadis and it will always have a special place in my heart
Yep, that means no ACE Director award anymore

Exciting times ahead!

↧

Oracle Fleet Patching and Provisioning (FPP): My new role as PM and a brand new series of blog posts

May 4, 2021, 3:36 am

≫ Next: Why do PMs ask you to open Service Requests for almost EVERYTHING?

≪ Previous: The fear of (availability) loss is a path to the dark side.

It’s been 6 years since I’ve tried FPP for the first time (formerly Rapid Home Provisioning, or RHP).

Rapid Home Provisioning

FPP was still young and lacking many features at that time, but it already changed the way I’ve worked during the next years. I embraced the out of place patching, developed some basic scripts to install Oracle Homes, and sought automation and standardization at all costs:

Oracle Home Management – part 7: Putting all together

When 18c came with the FPP local-mode automaton, I have implemented it for the Grid Infrastructure patching strategy at CERN:

Oracle Grid Infrastructure 18c patching part 3: Executing out-of-place patching with the local-mode automaton

And discovered that meanwhile FPP did giant steps, with many new features and fixes for quite a few usability and performance problems.

Last year, when joining the Oracle Database High Availability (HA), Scalability and Maximum Availability Architecture (MAA) Product Management Team at Oracle, I took (among others) the Product Manager role for FPP.

Becoming an Oracle employee after 20 years working with the Oracle technology is a big leap. It allows me to understand how big the company is, and how collaborative and friendly the Oracle employees are (Yes, I was used to marketing nonsense, insisting salesmen and unfriendly license auditors. This is slowly changing with Oracle embracing the Cloud, but it is still a fresh wound for many customers. Expect this to change even more! Regarding me… I’ll be the same I’ve always been ).

Now I have daily meetings with big customers (bigger than the ones I have ever head in the past), development teams, other product managers, Oracle consultants and community experts. My primary goal is making the product better, increasing its adoption and helping customers having the best experience with it. This includes testing the product myself, writing specs, presentations, videos, collecting feedback from the customers, tracking bugs and manage escalations.

I am Product Manager for other products as well, but I have to admit that FPP is the product that takes most of my Product Manager time. Why?

I will give a few reasons in my next blog post(s).

—

Ludo

↧

Why do PMs ask you to open Service Requests for almost EVERYTHING?

May 6, 2021, 2:44 am

≫ Next: Changing FPP temporary directory (/tmp in noexec and other issues)

≪ Previous: Oracle Fleet Patching and Provisioning (FPP): My new role as PM and a brand new series of blog posts

If you attend Oracle-related events or if you are active on Twitter or other social medias used by technologists, you might know many of us Product Managers directly. If it is the case, you know that we are in general very easy to reach and always happy to help.

When you contact us directly, however, sometimes we answer “Please open a SR for that“. Somehow irritating, huh? “We had chats and drinks together at conferences and now this bureaucracy?” This is understandable. Who likes opening SRs after all? Isn’t just easier to forward that e-mail internally and get the answer first hand?

This is something that happened to me as well in the past when I was not working for Oracle yet, and that still happens with me now (the answer coming from me, as PM).

Why? The first answer is “it depends on the question“. If it is anything that we can answer directly, we will probably do it.

It might be a question about a specific feature: “Does product X support Y?”, “can you add this feature in your product?” or a known problem for which the PM already knows the bug (in that case is just a matter of looking up the bug number), or anything that is relatively easy to answer: “What are the best practices for X?”, “Do you have a paper explaining that?”, “Does this bug have a fix already?”

But there is a plethora of questions for which we need more information.

“I try this, but it does not work“. “I get this error and I think it is a bug“. “I have THIS performance problem“.

This is when I’d personally ask to open a SR most of the times (unless I have a quick answer to give). And there are a few reasons:

Data protection

Oracle takes data protection very seriously. Oracle employees are trained to deal with potentially sensitive data and cannot forward customer information via e-mail. That could be exposed or forwarded to the wrong recipients by mistake, etc. We don’t ask for TFA collections or logs via e-mail (even if sometimes customers send them to us anyway…).

There are special privileges required to access customer SRs, that’s the only secure way we provide to transfer logs and protected information. The files uploaded into the SRs must be accessed through a specific application. All the checkouts and downloads are tracked. When we need to forward customer information internally, we just specify the SR number and let our colleagues access the information themselves. Sometimes we use SRs just as placeholder to exchange data with customers, without having a support engineer working on it.

This is the single most important point that somehow makes the other points irrelevant. But still the remaining ones are good points.

Important pieces in the discussion do not get lost

The answer does not always come from first-hand… it might take 3-4 hops (sometimes more) and analysis, comments, explanations, discussions.

E-mail is not a good tool for this. Long threads can split and include just part of the audience (the “don’t reply to all” effect). Attachments are deleted when replying instead of forwarding… and pieces get lost.

This is where you would use a Jira, or a trouble ticketing system. Guess which is the one that Oracle uses for its customers?

MOS has internal views to dig into TFA logs (that’s why it is a good idea to provide one, whenever it might be relevant), and all the attachments, comments and internal discussions are centralized there. But we need a SR to add information to!

Win-win: knowledge base, feedback, continuous improvement

If you discover something new from a technical discussion, what do you do? Do you share it or do you keep it for yourself? MOS is part of our knowledge base and it is a good idea to store important discussions in it. Support engineers can find solutions in SRs with similar cases. It is a good opportunity for the support engineer him/herself to be involved in one more interesting discussion, so next time he/she might have the answer on top of the fingers.

To conclude, think about it as a win-win. You give us interesting problems that might help improving the product, and you get a Guardian Angel on your SR for free

—

Ludo

↧

Changing FPP temporary directory (/tmp in noexec and other issues)

July 13, 2021, 7:11 am

≫ Next: rhpctl addnode gihome: specify HUB or LEAF when adding new nodes to a Flex Cluster

≪ Previous: Why do PMs ask you to open Service Requests for almost EVERYTHING?

When using FPP, you might experience the following error (PRVF-7546):

$ rhpctl add workingcopy -workingcopy WC_db_19_11_FPPC -image db_19_11 -path /u01/app/oracle/product/WC_db_19_11_FPPC -client fppc -oraclebase /u01/app/oracle
fpps01: Audit ID: 121
PRGO-1260 : Cluster Verification checks for database home provisioning  failed for the specified working copy WC_db_19_11_FPPC.
PRCR-1178 : Execution of command failed on one or more nodes
 
PRVF-7546 : The work directory "/tmp/CVU_19.0.0.0.0_oracle/" cannot be used on node "fppc02"

This is often related to the filesystem /tmp that has the “noexec” option:

$ mount | grep /tmp
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec)

Although it is tempting to just remount the filesystem with “exec”, you might be in this situation because your systems are configured to adhere to the STIG recommendations:

The noexec option must be added to the /tmp partition (https://www.stigviewer.com/stig/red_hat_enterprise_linux_6/2016-12-16/finding/V-57569)

FPP 19.9 contains fix 30885598 that allows specifying the temporary location for FPP operations:

$ srvctl modify rhpserver  -tmploc <new_tmp>

After that, the operation should run smoothly:

fppc02: Successfully executed clone operation.
fppc02: Executing root script on nodes ltora401,ltora402.
fppc02: Successfully executed root script on nodes fppc01,fppc02.
fppc02: Working copy creation completed.
fppc02: Oracle home provisioned.
fpps01: Client-side action completed.

HTH

—

Ludo

↧

rhpctl addnode gihome: specify HUB or LEAF when adding new nodes to a Flex Cluster

July 29, 2021, 8:18 am

≫ Next: Can I rename a PDB in a Data Guard configuration?

≪ Previous: Changing FPP temporary directory (/tmp in noexec and other issues)

I have a customer trying to add a new node to a cluster using Fleet Patching and Provisioning.

The error in the command output is not very friendly:

[grid@fpps ~]$ rhpctl addnode gihome -workingcopy WC_gi19110_FPPC3 \
  -newnodes fppc3:fppc3-vip  -cred fppc-cred
fpps: Audit ID: 269
PRCT-1003 : failed to run "rhphelper" on node "fppc2"
PRCT-1014 : Internal error: RHPHELP_preNodeAddVal-05null

The “RHPHELP_preNodeAddVal” might already give an idea of the cause: something related to the “cluvfy stage -pre nodeadd” evaluation that we normally do when adding a node by hand. FPP does not really run cluvfy, but it calls the same primitives cluvfy is based on.

In FPP, when the error does not give any useful information, this is the flow to follow:

use “rhpctl query audit” to get the date and time of the failing operation
open the “rhpserver.log.0” and look for the operation log in that time frame
get the UID of the operation e.g., in the following line it is “1556344143”:

[UID:-1556344143] [RMI TCP Connection(153)-192.168.1.151] [ 2021-07-27 00:25:20.741 KST ]
  [ServerCommon.processParameters:485]  before parsing: params = 
  {-methodName=addnodesWorkingCopy, -userName=grid, -version=19.0.0.0.0, -auditId=-1556344143,
  -auditCli=rhpctl addnode gihome -workingcopy WC_gi19110_FPPC3 -newnodes fppc3:fppc3-vip -cred cred_fppc,
  -plsnrPort=31605, -noun=gihome, -isSingleNodeProv=FALSE, -nls_lang=AMERICAN_AMERICA.AL32UTF8,
  -clusterName=fpps-cluster, -plsnrHost=fpps, -SA11204ClusterName=null,
  -lang=en_US, -clientNode=fpps, -verb=addnode, -ghopuid=-1556344143}

Isolate the log for the operation: grep $UID rhpserver.log.0 > $UID.log
Locate the trace file of the rhphelper remote execution:

[UID:-1556344143] [RMI TCP Connection(153)-192.168.1.151] [ 2021-07-27 00:26:07.031 KST ] [RHPHELPERUtil.getTraceEnvs:4386] 
  TraceFileLocEnv is :RHPHELPER_TRACEFILE=/u01/app/grid/crsdata/fppc2/rhp/rhphelp_20210727002603.trc

Find the root cause in the rhphelper trace:

[main] [ 2021-07-27 00:27:02.600 KST ] [reflect.GeneratedMethodAccessor1.invoke:-1]  PRVG-11406 : API with node roles argument must be called for Flex Cluster

In this case, the target cluster is a Flex Cluster, so the command must be run specifying the node_role.

The documentation is not clear (we will fix it soon):

rhpctl addnode gihome {-workingcopy workingcopy_name | -client cluster_name}
  -newnodes node_name:node_vip[:node_role][,node_name:node_vip[:node_role]...]

node_role must be specified for Flex Clusters, and it must be either HUB or LEAF.

After using the correct command line, the command succeeded.

rhpctl addnode gihome -workingcopy WC_gi19110_FPPC3 \
 -newnodes fppc3:fppc3-vip:HUB  -cred fppc-cred

HTH

—

Ludovico

↧

Can I rename a PDB in a Data Guard configuration?

November 21, 2021, 8:55 am

≫ Next: Can a physical standby database receive the redo SYNC if the Far Sync instance fails?

≪ Previous: rhpctl addnode gihome: specify HUB or LEAF when adding new nodes to a Flex Cluster

Someone asked me this question recently.

The answer is: yes!

Let’s see it in action.

On the primary I have:

----- PRIMARY
SQL> show pdbs;

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       READ ONLY  NO
         3 RED                            READ WRITE NO
         4 SAND                           READ WRITE NO

And of course the same PDBs on the standby:

----- STANDBY
SQL> show pdbs

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       MOUNTED
         3 RED                            MOUNTED
         4 SAND                           MOUNTED

Let’s change the PDB RED name to TOBY: The PDB rename operation is straightforward (but it requires a brief downtime). To be done on the primary:

SQL> alter pluggable database red close;

Pluggable database altered.

SQL> alter pluggable database red open restricted;

Pluggable database altered.

SQL> alter session set container=red;

Session altered.

SQL> alter pluggable database rename global_name to toby;

Pluggable database altered.

SQL> alter session set container=cdb$root;

Session altered.

SQL> show pdbs

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       READ ONLY  NO
         3 TOBY                           READ WRITE YES
         4 SAND                           READ WRITE NO

SQL> alter pluggable database toby close;

Pluggable database altered.


SQL> alter pluggable database toby open;

Pluggable database altered.

SQL>

On the standby, I can see that the PDB changed its name:

SQL> show pdbs

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       MOUNTED
         3 TOBY                           MOUNTED
         4 SAND                           MOUNTED
SQL>

The PDB name change is propagated transparently with the redo apply.

—

Ludo

↧

Can a physical standby database receive the redo SYNC if the Far Sync instance fails?

April 7, 2022, 3:59 am

≫ Next: Far Sync and Fast-Start Failover Protection modes

≪ Previous: Can I rename a PDB in a Data Guard configuration?

The answer is YES.

In the following configuration, cdgsima_lhr1pq (primary) sends synchronously to cdgsima_farsync1 (far sync), which forwards the redo stream asynchronously to cdgsima_lhr1bm (physical standby):

DGMGRL> show configuration verbose

Configuration - cdgsima

  Protection Mode: MaxPerformance
  Members:
  cdgsima_lhr1pq   - Primary database
    cdgsima_farsync1 - Far sync instance
      cdgsima_lhr1bm   - Physical standby database
    cdgsima_lhr1bm   - Physical standby database (alternate of cdgsima_farsync1)

  Members Not Receiving Redo:
  cdgsima_farsync2 - Far sync instance

But if cdgsima_farsync1 is not available, I want the primary to send synchronously to the physical standby database. I accept a performance penalty, but I do not want to compromise my data protection.

I just need to set up the Redoroutes as follows:

-- when primary is cdgsima_lhr1pq 
EDIT DATABASE 'cdgsima_lhr1pq' SET PROPERTY 'RedoRoutes' = '(LOCAL : (cdgsima_farsync1 SYNC PRIORITY=1, cdgsima_lhr1bm SYNC PRIORITY=2 ))';
EDIT FAR_SYNC 'cdgsima_farsync1' SET PROPERTY 'RedoRoutes' = '(cdgsima_lhr1pq : cdgsima_lhr1bm ASYNC)';

-- when primary is cdgsima_lhr1bm
EDIT DATABASE 'cdgsima_lhr1bm' SET PROPERTY 'RedoRoutes' = '(LOCAL : (cdgsima_farsync2 SYNC PRIORITY=1, cdgsima_lhr1pq SYNC PRIORITY=2 ))';
EDIT FAR_SYNC 'cdgsima_farsync2' SET PROPERTY 'RedoRoutes' = '(cdgsima_lhr1bm : cdgsima_lhr1pq ASYNC)';

This is defined the second part of the RedoRoutes rules:

cdgsima_lhr1bm SYNC PRIORITY=2

Let’s test. If I shutdown abort the farsync instance:

$ rlwrap sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Mar 26 10:55:31 2022
Version 19.13.0.0.0

Copyright (c) 1982, 2021, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c EE Extreme Perf Release 19.0.0.0.0 - Production
Version 19.13.0.0.0

SQL> shutdown abort
ORACLE instance shut down.
SQL>

I can see the new SYNC destination being open almost instantaneously (because the old destination fails immediately with ORA-03113):

2022-03-26T10:55:35.581460+00:00
LGWR (PID:42101): Attempting LAD:2 network reconnect (3113)
LGWR (PID:42101): LAD:2 network reconnect abandoned
2022-03-26T10:55:35.602542+00:00
Errors in file /u01/app/oracle/diag/rdbms/cdgsima_lhr1pq/cdgsima/trace/cdgsima_lgwr_42101.trc:
ORA-03113: end-of-file on communication channel
LGWR (PID:42101): Error 3113 for LNO:3 to 'dgsima1.dbdgsima.misclabs.oraclevcn.com:1521/cdgsima_farsync1.dbdgsima.misclabs.oraclevcn.com'
2022-03-26T10:55:35.608691+00:00
LGWR (PID:42101): LAD:2 is UNSYNCHRONIZED
2022-03-26T10:55:36.610098+00:00
LGWR (PID:42101): Failed to archive LNO:3 T-1.S-141, error=3113
LGWR (PID:42101): Error 1041 disconnecting from LAD:2 standby host 'dgsima1.dbdgsima.misclabs.oraclevcn.com:1521/cdgsima_farsync1.dbdgsima.misclabs.oraclevcn.com'
2022-03-26T10:55:37.143448+00:00
LGWR (PID:42101): LAD:3 is UNSYNCHRONIZED
2022-03-26T10:55:37.143569+00:00
LGWR (PID:42101): LAD:2 no longer supports SYNCHRONIZATION
Starting background process NSS3
2022-03-26T10:55:37.227954+00:00
NSS3 started with pid=38, OS id=78251
2022-03-26T10:55:40.733905+00:00
Thread 1 advanced to log sequence 142 (LGWR switch),  current SCN: 8068734
  Current log# 1 seq# 142 mem# 0: /u03/app/oracle/redo/CDGSIMA_LHR1PQ/onlinelog/o1_mf_1_k251hfvk_.log
2022-03-26T10:55:40.781499+00:00
ARC0 (PID:42266): Archived Log entry 220 added for T-1.S-141 ID 0x9eb046ef LAD:1
2022-03-26T10:55:41.606175+00:00
ALTER SYSTEM SET log_archive_dest_state_3='ENABLE' SCOPE=MEMORY SID='*';
2022-03-26T10:55:43.747483+00:00
LGWR (PID:42101): LAD:3 is SYNCHRONIZED
2022-03-26T10:55:43.816978+00:00
Thread 1 advanced to log sequence 143 (LGWR switch),  current SCN: 8068743
  Current log# 2 seq# 143 mem# 0: /u03/app/oracle/redo/CDGSIMA_LHR1PQ/onlinelog/o1_mf_2_k251hfwz_.log

Indeed, I can see the new NSS process (synchronous redo transport) spawn at that time:

SQL> r
  1  select NAME
  2  ,PID
  3  ,TYPE
  4  ,ROLE ACTION
  5  ,CLIENT_PID
  6  ,CLIENT_ROLE
  7  ,GROUP#
  8  ,RESETLOG_ID
  9  ,THREAD#
 10  ,SEQUENCE#
 11  ,BLOCK#
 12* from v$dataguard_process where name like 'NSS%'

NAME  PID                      TYP ACTION                   CLIENT_PID CLIENT_ROLE          GROUP# RESETLOG_ID    THREAD#  SEQUENCE#     BLOCK#
----- ------------------------ --- ------------------------ ---------- ---------------- ---------- ----------- ---------- ---------- ----------
NSS2  54961                    KSB sync                              0 none                      0           0          0          0          0
NSS3  78251                    KSB sync                              0 none                      0           0          0          0          0

SQL> !ps -eaf | grep ora_nss
oracle   54961     1  0 Mar10 ?        00:00:55 ora_nss2_cdgsima
oracle   78251     1  0 10:55 ?        00:00:00 ora_nss3_cdgsima

—

Ludo

↧

Far Sync and Fast-Start Failover Protection modes

April 14, 2022, 12:40 am

≫ Next: Check, check… Does the mic still work? #JoelKallmanday

≪ Previous: Can a physical standby database receive the redo SYNC if the Far Sync instance fails?

Oracle advertises Far Sync as a solution for “Zero Data Loss at any distance”. This is because the primary sends its redo stream synchronously to the Far Sync, which relays it to the remote physical standby.

There are many reasons why Far Sync is an optimal solution for this use case, but that’s not the topic of this post

Some customers ask: Can I configure Far Sync to receive the redo stream asynchronously?

Although a direct standby receiving asynchronously would be a better idea, Far Sync can receive asynchronously as well.

And one reason might be to send asynchronously to one Far Sync member that redistributes locally to many standbys.

It is very simple to achieve: just changing the RedoRoutes property on the primary.

RedoRoutes = '(LOCAL : cdgsima_farsync1 ASYNC)'

This will work seamlessly. The v$dataguard_process will show the async transport process:

NAME PID TYP ACTION CLIENT_PID CLIENT_ROLE GROUP# RESETLOG_ID THREAD# SEQUENCE# BLOCK#
TT02 440 KSV async ORL multi 0 none 2 1098480879 1 146 456

What about Fast-Start Failover?

Up to and including 19c, ASYNC transport to Far Sync will not work with Fast-Start Failover (FSFO).

ASYNC redo transport mandates Maximum Performance protection mode, and FSFO supports that in conjunction with Far Sync only starting with 21c.

Before 21c, trying to enable FSFO with a Far Sync will fail with:

effective redo transport mode is incompatible with the configuration protection mode

DGMGRL> show fast_start failover

Fast-Start Failover:  Disabled

  Protection Mode:    MaxPerformance
  Lag Limit:          30 seconds

  Threshold:          30 seconds
  Active Target:      (none)
  Potential Targets:  "cdgsima_lhr1bm"
    cdgsima_lhr1bm invalid - effective redo transport mode is incompatible with the configuration protection mode
  Observer:           (none)
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE

Configurable Failover Conditions
  Health Conditions:
    Corrupted Controlfile          YES
    Corrupted Dictionary           YES
    Inaccessible Logfile            NO
    Stuck Archiver                  NO
    Datafile Write Errors          YES

  Oracle Error Conditions:
    (none)

So if you want FSFO with Far Sync in 19c, it has to be MaxAvailability (and SYNC redo transport to the FarSync).

If you don’t need FSFO, as we have seen, there is no problem. The only protection mode that will not work with Far Sync is Maximum Protection:

If FSFO is required, and you want Maximum Performance before 21c, or Maximum Protection, you have to remove Far Sync from the redo route.

—

Ludovico

↧

Check, check… Does the mic still work? #JoelKallmanday

October 11, 2022, 4:17 am

≫ Next: Find Ludovico at Oracle Cloud World 2022!

≪ Previous: Far Sync and Fast-Start Failover Protection modes

Update PHP:
Update WordPress:
New content:

It’s almost six months without blogging from my side. What a bad score!
It’s not a coincidence that I’m blogging today during #JoelKallmanDay.
A day that reminds the community how important it is to share. Knowledge, mostly. But also good and bad experiences, emotions…

A bittersweet day, at least for me.
On the bitter side: it reminds me of Joel, Pieter, and other friends that are not there anymore. That as a Product Manager, I have to wear big shoes, and it does not matter how good I try to do; I will always feel that it’s not good enough for the high expectations that I set for myself. Guess what! Being PM is way more complicated than I expected when I applied for the position two years ago. So many things to do or learn, so many requests, and so many customers! And being PM at Oracle is probably twice as complicated because it does not matter how good I (or we as a team) try to do; there will always be a portion of the community that picks on the Oracle technology for one reason or another.

On the bright side: it reminds me that I am incredibly privileged to have this role, working in a great team and helping the most demanding customers to get the most out of incredible technology. I love sharing, teaching, giving constructive feedback, producing quality content, and improving the customer experience. This is the sweet part of the job, where I am still taking baby steps when comparing myself to the PM legends we have in our organization. They are always glad to explain our products to the community, the customers, and colleagues! And they are all excellent mentors, each with a different style, background, and personal life.

And knowing people personally is, at least for me, the best thing about being part of a community (outside Oracle) and team (inside Oracle). We all strive for the best technical solutions, performance, developer experience, or uptime for the business. But we are human first of all. And this is what #JoelKallmanDay stands for me—trying to be a better human as a goal so that everything else comes naturally, including being a great colleague, community servant, or friend.

↧

Find Ludovico at Oracle Cloud World 2022!

October 14, 2022, 10:54 am

≫ Next: Video: The importance of Fast-Start Failover in an Oracle Data Guard configuration

≪ Previous: Check, check… Does the mic still work? #JoelKallmanday

Are you attending OCW, and do you want to find me and know more about how to avoid downtime and data loss? Or how to optimize your application configuration to make the most out of MAA technologies? Or any database, or technology-related topic?

Maybe you prefer just a chat and discussing life? Over a coffee, or tea? (or maybe beer?)

This is where you can find me during OCW.

Monday, October 17, 2022

6:30 PM – 10:00 PM – Customer Appreciation Event

Where: Mandalay Bay Shark Reef

This is an invitation-only event. If you are one of the lucky customers that possess an invitation, let’s meet there! It will be fun to discuss technology, business, and life while watching sharks and enjoying a drink together.

Tuesday, October 18, 2022

2:00 PM – 4:30 PM – Oracle Maximum Availability Architecture with Oracle RAC and Active Data Guard

Where: CloudWorld Hub, Database booth DB-01

Come together and ask anything Data Guard, Active Data Guard, RAC, FPP, or High Availability! See some products in action, and get some insights from my colleagues and me. The booth will be open during the whole exhibition time, but I will be certainly there on Tuesday for these two hours.

4:00 PM – 5:30 PM – Protect Your Business Using Oracle Full Stack Disaster Recovery Service – Interactive Hands-On-Lab [HOL4089]

Where: Bellini 2003, The Venetian, Level 2

I will help my colleague Suraj Ramesh run the hands-on lab of this brand-new (actually, still to be released!) service for general-purpose Disaster Recovery in the cloud.

After HOL4089 until – 7:00 pm – Welcome Reception

Where: CloudWorld Hub, Database booth DB-01

I will probably join to say hello during the Welcome Reception. Maybe you can spot me there

Wednesday, October 19, 2022

10:00 AM – 12:00 PM – Oracle Maximum Availability Architecture with Oracle RAC and Active Data Guard

Where: CloudWorld Hub, Database booth DB-01

I will be there once again to answer all your questions and show some fancy stuff

1:15 PM – 2:00 PM – Oracle Data Guard—Active, Autonomous, and Always Protective [LRN3528]

Where: San Polo 3403, The Venetian, Level 3

I will talk about Data Guard, Active Data Guard, and what I consider the most important features today. Come to the session to know more!

3:00 PM – 4:30 PM – Protect Your Data with Oracle Active Data Guard – Interactive Hands-On-Lab [HOL4054]

Where: Bellini 2003, The Venetian, Level 2

I will run this hands-on lab. You will have an Active Data Guard 19c configuration in the cloud at your fingertips and you will play with role changes, corruption detection and reparation, and other features. I will be there to explain insights, hints, and recommendations on how to implement it in your work environment.

Thursday, October 20, 2022

11:40 AM – 12:00 PM – The Least-Known Facts About Oracle Data Guard and Oracle Active Data Guard [LIT4029]

Where: Ascend Lounge, CloudWorld Hub, The Venetian

This will be great! I bet you will discover MANY things that you did not know about Data Guard and Active Data Guard. Come to know more!

See you there!

—

Ludovico

↧

Video: The importance of Fast-Start Failover in an Oracle Data Guard configuration

November 29, 2022, 1:07 am

≫ Next: Video: Where should I put the Observer in a Fast-Start Failover configuration?

≪ Previous: Find Ludovico at Oracle Cloud World 2022!

Why is Fast-Start Failover a crucial component for mission-critical Data Guard deployments?
The observer lowers the RTO in case of failure, and the Fast-Start Failover protection modes protect the database from split-brain and data loss.

↧

Video: Where should I put the Observer in a Fast-Start Failover configuration?

November 29, 2022, 1:10 am

≫ Next: When it comes to using Oracle, trust Oracle…

≪ Previous: Video: The importance of Fast-Start Failover in an Oracle Data Guard configuration

The video explains best practices and different failure scenarios for different observer placements. It also shows how to configure high availability for the observer.

Here’s the summary:

Always try to put the observer(s) on an external site.
If you don’t have any, put it where the primary database is, and have one ready on the secondary site after the role transition.
Don’t put the observer together with the standby database!
Configure multiple observers for high availability, and use the PreferredObserverHosts Data Guard member property to ensure you never run the observer where the standby database is.

↧

When it comes to using Oracle, trust Oracle…

July 14, 2023, 7:51 am

≫ Next: Does FLASHBACK QUERY work across incarnations or after a Data Guard failover?

≪ Previous: Video: Where should I put the Observer in a Fast-Start Failover configuration?

A month ago, I saw this article published on the AWS architecture blog:

Disaster Recovery for Oracle Database on Amazon EC2 with Fast-Start Failover

I love seeing people suggesting Oracle Data Guard Fast-Start Failover for high availability. Nevertheless, there are a few problems with the architecture and steps proposed in the article.

I sent my comments via Disqus on the AWS blogging platform, but after a month, my comment was rejected, and the blog content hasn’t changed.

For this reason, I don’t have other places to post my comment but here…

The link to the setup procedure is from 2009.
We have official documentation that we keep up to date. The Fast-Start Failover part:
https://docs.oracle.com/en/database/oracle/oracle-database/19/dgbkr/using-data-guard-broker-to-manage-switchovers-failovers.html#GUID-D26D79F2-0093-4C0E-98CD-224A5C8CBFA4
and the Best Practices guide:
https://docs.oracle.com/en/database/oracle/oracle-database/19/haovw/oracle-data-guard-best-practices.html#GUID-C3A78B07-6584-4380-8D53-E5B831A5894C
The part about cascading standbys references a step-by-step guide from an external blog written many years ago for 11gR2.
The DBMS_SERVICE doc is from 12cR1, while other links are from 21c doc or 19c doc. As of today, most implement 19c. That’s probably the version to use.
https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_SERVICE.html#GUID-C11449DC-EEDE-4BB8-9D2C-0A45198C1928
The steps used to create the database service do not include any HA property, which will make most efforts useless. (see Table 153-6 in the link above).
The article talks about TAF, but no steps exist to configure it. We don’t recommend TAF since 12c anyway. Today (19c), the recommendation is TAC (Transparent Application Continuity).
https://www.oracle.com/docs/tech/application-checklist-for-continuous-availability-for-maa.pdf
But, most important, TAF (or Oracle connectivity in general) does NOT require a host IP change! There is no need to change the DNS when using the recommended connection string with multiple address_lists.
Some RedoRoutes examples are not correct. In this video I explain how they work and how to set them up:
https://www.youtube.com/watch?v=huG8JPu_s4Q
The diagram shows the master observer together with the standby database, which is a bad practice. I explain why and how here:
https://www.youtube.com/watch?v=e81UPLfnLi0

The central message is:

If you need to implement a complex architecture using a software solution, pay attention that the practices suggested by the partner/integrator/3rd party match the ones from the software vendor. In the case of Oracle Data Guard, Oracle knows better

Cheers

—

Ludovico

↧

Does FLASHBACK QUERY work across incarnations or after a Data Guard failover?

December 13, 2023, 4:57 am

≫ Next: New in Data Guard 21c and 23c: Automatic preparation of the primary

≪ Previous: When it comes to using Oracle, trust Oracle…

Short answer: yes.

Let’s just see it in action.

First, I have a Data Guard configuration in place. On the primary database, the current incarnation has a single parent (the template from which it has been created):

SQL> select * from v$database_incarnation;

INCARNATION# RESETLOGS_CHANGE# RESETLOGS PRIOR_RESETLOGS_CHANGE# PRIOR_RES
------------ ----------------- --------- ----------------------- ---------
STATUS  RESETLOGS_ID PRIOR_INCARNATION# FLASHBACK_DATABASE_ALLOWED     CON_ID
------- ------------ ------------------ -------------------------- ----------
           1                 1 14-AUG-23                       0
PARENT    1144840863                  0 NO                                  0

           2           1343420 08-DEC-23                       1 14-AUG-23
CURRENT   1155034180                  1 NO                                  0

Just to make room for some undo, I increase the undo_retention. On a PDB, that requires LOCAL UNDO to be configured (I hope it’s the default everywhere nowadays).

SQL> alter session set container=PDB1;

Session altered.

SQL> alter system set undo_retention=86400;

System altered.

Then, I update some data to test flashback query:

SQL> alter session set current_schema=HR;

Session altered.

SQL> update hr.employees set HIRE_DATE=sysdate where employee_id=100;

1 row updated.

SQL> commit;

Commit complete.

At this point, I can see the current data, and the data as it was 1 hour ago:

SQL> select hire_date from hr.employees where employee_id=100;

HIRE_DATE
---------
13-DEC-23

SQL> select hire_date from hr.employees as of timestamp systimestamp-1/24 where employee_id=100;

HIRE_DATE
---------
17-JUN-03

Now, I kill the primary database and fail over to the standby database:

# on the primary:
[ primary ] bash-4.4$ ps -eaf | grep pmon
lcaldara 1485907       1  0 10:29 ?        00:00:00 ora_pmon_orcl
lcaldara 1486768 1484883  0 10:37 pts/0    00:00:00 grep pmon
[ primary ] bash-4.4$ kill -9 1485907

# on the standby:
DGMGRL> connect /
Connected to "orcl_site2"
Connected as SYSDG.
DGMGRL> failover to "orcl_site2";
2023-12-13T10:38:31.179+00:00
Performing failover NOW, please wait...

2023-12-13T10:38:37.728+00:00
Failover succeeded, new primary is "orcl_site2".

2023-12-13T10:38:37.729+00:00
Failover processing complete, broker ready.
DGMGRL>

After connecting to the new primary, I can see the new incarnation due to the open resetlogs after the failover.

SQL> select * from v$database_incarnation;

INCARNATION# RESETLOGS_CHANGE# RESETLOGS PRIOR_RESETLOGS_CHANGE# PRIOR_RES
------------ ----------------- --------- ----------------------- ---------
STATUS  RESETLOGS_ID PRIOR_INCARNATION# FLASHBACK_DATABASE_ALLOWED     CON_ID
------- ------------ ------------------ -------------------------- ----------
           1                 1 14-AUG-23                       0
PARENT    1144840863                  0 NO                                  0

           2           1343420 08-DEC-23                       1 14-AUG-23
PARENT    1155034180                  1 NO                                  0

           3           2704078 13-DEC-23                 1343420 08-DEC-23
CURRENT   1155465511                  2 NO                                  0

And I can still query the data as of a previous timestamp:

SQL> select hire_date from hr.employees where employee_id=100;

HIRE_DATE
---------
13-DEC-23

SQL> select hire_date from hr.employees as of timestamp systimestamp-1/24 where employee_id=100;

HIRE_DATE
---------
17-JUN-03

Or flash back the table, if required:

SQL> flashback table hr.employees to timestamp sysdate-1/24;
flashback table hr.employees to timestamp sysdate-1/24
                   *
ERROR at line 1:
ORA-08189: cannot flashback the table because row movement is not enabled


SQL> alter table hr.employees enable row movement;

Table altered.

SQL> flashback table hr.employees to timestamp sysdate-1/24;

Flashback complete.

SQL> select hire_date from hr.employees where employee_id=100;

HIRE_DATE
---------
17-JUN-03

So yes, that works. The caveat is still that you need to retain enough data in the undo tablespace to rebuild the rows in their previous state.

—

Ludo

↧

New in Data Guard 21c and 23c: Automatic preparation of the primary

December 22, 2023, 5:57 am

≫ Next: New views in Oracle Data Guard 23c

≪ Previous: Does FLASHBACK QUERY work across incarnations or after a Data Guard failover?

Oracle Data Guard 21c came with a new command:

prepare database for data guard
with db_unique_name is {db_unique_name}
db_recovery_file_dest_size is "{size}"
db_recovery_file_dest is "{dest}" ;

This command prepares a database to become primary in a Data Guard configuration.

It sets many recommended parameters:

DB_FILES                      = 1024
LOG_BUFFER                    = 256M
DB_BLOCK_CHECKSUM             = TYPICAL
DB_LOST_WRITE_PROTECT         = TYPICAL
DB_FLASHBACK_RETENTION_TARGET = 120
PARALLEL_THREADS_PER_CPU      = 1
STANDBY_FILE_MANAGEMENT       = AUTO
DG_BROKER_START               = TRUE

Sets the RMAN archive deletion policy, enables flashback and force logging, creates the standby logs according to the online redo logs configuration, and creates an spfile if the database is running with an init file.

If you tried this in 21c, you have noticed that there is an automatic restart of the database to set all the static parameters. If you weren’t expecting this, the sudden restart could be a bit brutal approach.

In 23c, we added an additional keyword “restart” to specify that you are OK with the restart of the database. If you don’t specify it, the broker will complain that it cannot proceed without a restart:

DGMGRL> prepare database for data guard
> with db_unique_name is chol23c_hwq_lhr
> db_recovery_file_dest_size is "200g"
> db_recovery_file_dest is "/u03/app/oracle/fast_recovery_area"
> ;
Validating database "cdb1" before executing the command.
  DGM-17552: Primary database must be restarted after setting static initialization parameters.
  DGM-17327: Primary database must be restarted to enable archivelog mode.
Failed.
DGMGRL>

If you specify it, it will proceed with the restart:

DGMGRL> prepare database for data guard
>   with db_unique_name is chol23c_hwq_lhr
>   db_recovery_file_dest_size is "200g"
>   db_recovery_file_dest is "/u03/app/oracle/fast_recovery_area"
>   restart;
Validating database "chol23c_hwq_lhr" before executing the command.
Preparing database "chol23c_hwq_lhr" for Data Guard.
Initialization parameter DB_FILES set to 1024.
Initialization parameter LOG_BUFFER set to 268435456.
Primary database must be restarted after setting static initialization parameters.
Shutting down database "chol23c_hwq_lhr".
Database closed.
Database dismounted.
ORACLE instance shut down.
Starting database "chol23c_hwq_lhr" to mounted mode.
ORACLE instance started.
Database mounted.
Initialization parameter DB_FLASHBACK_RETENTION_TARGET set to 120.
Initialization parameter DB_LOST_WRITE_PROTECT set to 'TYPICAL'.
RMAN configuration archivelog deletion policy set to SHIPPED TO ALL STANDBY.
Initialization parameter DB_RECOVERY_FILE_DEST_SIZE set to '200g'.
Initialization parameter DB_RECOVERY_FILE_DEST set to '/u03/app/oracle/fast_recovery_area'.
LOG_ARCHIVE_DEST_n initialization parameter already set for local archival.
Initialization parameter LOG_ARCHIVE_DEST_2 set to 'location=use_db_recovery_file_dest valid_for=(all_logfiles, all_roles)'.
Initialization parameter LOG_ARCHIVE_DEST_STATE_2 set to 'Enable'.
Adding standby log group size 1073741824 and assigning it to thread 1.
Adding standby log group size 1073741824 and assigning it to thread 1.
Adding standby log group size 1073741824 and assigning it to thread 1.
Initialization parameter STANDBY_FILE_MANAGEMENT set to 'AUTO'.
Initialization parameter DG_BROKER_START set to TRUE.
Database set to FLASHBACK ON.
Database opened.
Succeeded.
DGMGRL>

Notice that if you already have these static parameters set, the broker will just set the missing dynamic parameters without the need for a restart:

DGMGRL> prepare database for data guard
>   with db_unique_name is chol23c_hwq_lhr
>   db_recovery_file_dest_size is "200g"
>   db_recovery_file_dest is "/u03/app/oracle/fast_recovery_area"
> ;
Validating database "chol23c_hwq_lhr" before executing the command.
Preparing database "chol23c_hwq_lhr" for Data Guard.
Initialization parameter DB_RECOVERY_FILE_DEST_SIZE set to '200g'.
Initialization parameter DB_RECOVERY_FILE_DEST set to '/u03/app/oracle/fast_recovery_area'.
LOG_ARCHIVE_DEST_n initialization parameter already set for local archival.
Initialization parameter LOG_ARCHIVE_DEST_1 set to 'location=use_db_recovery_file_dest valid_for=(all_logfiles, all_roles)'.
Initialization parameter LOG_ARCHIVE_DEST_STATE_1 set to 'Enable'.
Succeeded.

This new command greatly simplifies the preparation of a Data Guard configuration!

Before 21c, you had to do everything by hand.

—

Ludo

↧

New views in Oracle Data Guard 23c

January 3, 2024, 2:55 am

≪ Previous: New in Data Guard 21c and 23c: Automatic preparation of the primary

Oracle Data Guard 23c comes with many nice improvements for observability, which greatly increase the usability of Data Guard in environments with a high level of automation.

For the 23c version, we have the following new views.V$DG_BROKER_ROLE_CHANGE

This view tracks the last role transitions that occurred in the configuration. Example:

SQL> select * from v$dg_broker_role_change;

EVENT         STANDBY_TYPE    OLD_PRIMARY       NEW_PRIMARY       FS_FAILOVER_REASON    BEGIN_TIME                         END_TIME                              CON_ID
_____________ _______________ _________________ _________________ _____________________ __________________________________ __________________________________ _________
Switchover    Physical        adghol_53k_lhr    adghol_p4n_lhr                          18-DEC-23 10.40.12.000000000 AM    18-DEC-23 10.40.32.000000000 AM            0
Switchover    Physical        adghol_p4n_lhr    adghol_53k_lhr                          18-DEC-23 10.48.55.000000000 AM    18-DEC-23 10.49.15.000000000 AM            0

The event might be a Switchover, Failover, or Fast-Start Failover.

In the case of Fast-Start Failover, you will see the reason (typically “Primary Disconnected” if it comes from the observer, or whatever reason you put in DBMS_DG.INITIATE_FS_FAILOVER.

No more need to analyze the logs to find out which database was primary at any moment in time!

V$DG_BROKER_PROPERTY

Before 23c, the only possible way to get a broker property from SQL was to use undocumented (unsupported) procedures in the fixed package DBMS_DRS. I’ve blogged about it in the past, before joining Oracle.

Now, it’s as easy as selecting from a view, where you can get the properties per member or per configuration:

SQL> select member, property, value from V$DG_BROKER_PROPERTY where value is not null;

MEMBER      PROPERTY                        VALUE
___________ _______________________________ _________
mydb        FastStartFailoverThreshold      180
mydb        OperationTimeout                30
mydb        TraceLevel                      USER
mydb        FastStartFailoverLagLimit       300
mydb        CommunicationTimeout            180
mydb        ObserverReconnect               0
mydb        ObserverPingInterval            0
mydb        ObserverPingRetry               0
mydb        FastStartFailoverAutoReinstate  TRUE
mydb        FastStartFailoverPmyShutdown    TRUE
...
mydb_site1  DGConnectIdentifier             mydb_site1
mydb_site1  FastStartFailoverTarget         mydb_site2
mydb_site1  LogShipping                     ON
mydb_site1  LogXptMode                      ASYNC
mydb_site1  DelayMins                       0
...
mydb_site1  StaticConnectIdentifier         (DESCRIPTION=<...>)))
mydb_site1  TopWaitEvents                   (monitor)
mydb_site1  SidName                         (monitor)
mydb_site2  DGConnectIdentifier             mydb_site2
mydb_site2  FastStartFailoverTarget         mydb_site1

The example selects just three columns, but the view is rich in detailing which properties apply to which situation (scope, valid_role):

SQL> set sqlformat json-formatted
SQL> select * from v$dg_broker_property where member='adghol_p4n_lhr' and upper(property) like '%REDO%';
{
  "results" : [
    {
    ...
      "items" : [
        {
          "member" : "adghol_p4n_lhr",
          "instance" : "N/A",
          "dataguard_role" : "PHYSICAL STANDBY",
          "property" : "PreferredObserverHosts",
          "property_type" : "CONFIGURABLE",
          "value" : "",
          "value_type" : "STRING",
          "scope" : "MEMBER",
          "valid_role" : "N/A",
          "con_id" : 0
        },
        {
          "member" : "adghol_p4n_lhr",
          "instance" : "N/A",
          "dataguard_role" : "PHYSICAL STANDBY",
          "property" : "RedoRoutes",
          "property_type" : "CONFIGURABLE",
          "value" : "",
          "value_type" : "STRING",
          "scope" : "MEMBER",
          "valid_role" : "N/A",
          "con_id" : 0
        },
        {
          "member" : "adghol_p4n_lhr",
          "instance" : "N/A",
          "dataguard_role" : "PHYSICAL STANDBY",
          "property" : "RedoCompression",
          "property_type" : "CONFIGURABLE",
          "value" : "DISABLE",
          "value_type" : "STRING",
          "scope" : "MEMBER",
          "valid_role" : "STANDBY",
          "con_id" : 0
        }
      ]
    }
  ]
}

The monitorable properties can be monitored using DBMS_DG.GET_PROPERTY(). I’ll write a blog post about the new PL/SQL APIs in the upcoming weeks.

I wish I had this view when I was a DBA

V$FAST_START_FAILOVER_CONFIG

If you have a Fast-Start Failover configuration, this view will show its details:

SQL> SELECT fsfo_mode, status, current_target, threshold, observer_present, observer_host,
 2> protection_mode, lag_limit, auto_reinstate, observer_override, shutdown_primary FROM V$FAST_START_FAILOVER_CONFIG;

FSFO_MODE           STATUS                 CURRENT_TARGET THRESHOLD OBSERVE OBSERVER_HOST PROTECTION_MODE  LAG_LIMIT AUTO_ OBSER SHUTD
___________________ ______________________ ______________ _________ _______ _____________ ________________ _________ _____ _____ _____
POTENTIAL DATA LOSS TARGET UNDER LAG LIMIT mydb_site2           180 YES     mydb-obs      MaxPerformance         300 TRUE  FALSE TRUE

This view replaces some columns currently in v$database, that are therefore deprecated:

SQL> desc v$database

Name                            Null?    Type
_______________________________ ________ ________________
...
FS_FAILOVER_MODE                         VARCHAR2(19)
FS_FAILOVER_STATUS                       VARCHAR2(22)
FS_FAILOVER_CURRENT_TARGET               VARCHAR2(30)
FS_FAILOVER_THRESHOLD                    NUMBER
FS_FAILOVER_OBSERVER_PRESENT             VARCHAR2(7)
FS_FAILOVER_OBSERVER_HOST                VARCHAR2(512)
...

V$FS_LAG_HISTOGRAM

This view is useful to calculate the optimal FastStartFailoverLagTime.

SQL> select * from v$fs_lag_histogram;

   THREAD# LAG_TYPE      LAG_TIME  LAG_COUNT LAST_UPDATE_TIME         CON_ID                        
---------- ----------- ---------- ---------- -------------------- ----------                        
         1 APPLY                5        122 01/23/2023 10:46:07           0                        
         1 APPLY               10          5 01/02/2023 16:12:42           0                        
         1 APPLY               15          2 12/25/2022 12:01:23           0                        
         1 APPLY               30          0                               0                        
         1 APPLY               60          0                               0                        
         1 APPLY              120          0                               0                        
         1 APPLY              180          0                               0                        
         1 APPLY              300          0                               0                        
         1 APPLY            65535          0                               0

It shows the frequency of Fast-Start Failover lags and the most recent occurrence for each bucket.

LAG_TIME is the upper bound of the bucket, e.g.

5 -> between 0 and 5 seconds
10 -> between 5 and 10 seconds
etc.

It’s refreshed every minute, only when Fast-Start Failover is enabled (also in observe-only mode).

V$FS_FAILOVER_OBSERVERS

This view is not new, however, its definition now contains more columns:

SQL> desc  v$fs_failover_observers
 Name                           Null?    Type
 ------------------------------ -------- -----------------
 NAME                                    VARCHAR2(513)
 REGISTERED                              VARCHAR2(4)
 HOST                                    VARCHAR2(513)
 ISMASTER                                VARCHAR2(4)
 TIME_SELECTED                           TIMESTAMP(9)
 PINGING_PRIMARY                         VARCHAR2(4)
 PINGING_TARGET                          VARCHAR2(4)
 CON_ID                                  NUMBER
 
 -- new in 23c:
 LAST_PING_PRIMARY                       NUMBER
 LAST_PING_TARGET                        NUMBER
 LOG_FILE                                VARCHAR2(513)
 STATE_FILE                              VARCHAR2(513)
 CURRENT_TIME                            TIMESTAMP(9)

This gives important additional information about the observers, for example, the last time a specific observer was able to ping the primary or the target (in seconds).

Also, the path of the log file and runtime data file are available, making it easier to find them on the observer host in case of a problem.

Conclusion

These new views should greatly improve the experience when monitoring or diagnosing problems with Data Guard. But they are just a part of many improvements we introduced in 23c. Stay tuned for more

—

Ludovico

↧