Author: Emerson .

A Cluster with Solaris and Oracle RAC – Avoid panic when rebooting

You just rebooted a server that you know that runs Oracle RAC and the other node rebooted. This happened because CRS was active and you first need to disable to reboot the other node. You’ll see this message on /var/adm/messages

Jun 5 14:15:19 solaris_rac2 root: Oracle clsomon failed with fatal status 12.
Jun 5 14:15:20 solaris_rac2 root: Oracle CRS failure. Rebooting for cluster integrity.
rebooting…

Run this command to stop CRS. Your path may vary according to where you installed the binary.

root@solaris_rac1:~ # /u01/app/oracle/product/11.1.0/bin/crsctl stop crs
Stopping resources.
This could take several minutes.
Successfully stopped Oracle Clusterware resources
Stopping Cluster Synchronization Services.
Shutting down the Cluster Synchronization Services daemon.
Shutdown request successfully issued.

Check the status with crsctl check crs. It should display the message below

root@solaris_rac1:~ # /u01/app/oracle/product/11.1.0/bin/crsctl check crs
Failure 1 contacting Cluster Synchronization Services daemon
Cannot communicate with Cluster Ready Services
Cannot communicate with Event Manager

After you reestablish the machine, CRS should start automatically

root@solaris_rac1:~ # /u01/app/oracle/product/11.1.0/bin/crsctl check crs
Cluster Synchronization Services appears healthy
Cluster Ready Services appears healthy
Event Manager appears healthy

Installing HP Server Automation client for Solaris

Installing HP Server Automation client for Solaris. For Solaris I included the flag –withrpm to have installation of software from Opsware enabled

root@solaris:/tmp # ./opsware-agent-34c.0.0.135-solaris-5.10 -s -o -c –loglevel info –logfile /var/tmp/log.txt –withrpm –opsw_gw_addr_list 148.95.133.132:3001 –force_new_device

[INFO] Log started.

[WARN] The command-line parameter –clean has been deprecated.
Will check the gateway list for connectivity
[INFO] Opsware Gateway (148.95.133.132:3001) available at ‘148.95.133.132:3001’.

[INFO] legacy config file ‘/var/lc/cogbot/etc/cogbot.args’ not found – skipping force –withrpm check.

UnZipSFX 5.42 of 14 January 2001, by Info-ZIP (Zip-Bugs@lists.wku.edu).
inflating: /tmp/~7119-1.WRK/opsware-agent
inflating: /tmp/~7119-1.WRK/opsware-agent-installer.py
creating: /tmp/~7119-1.WRK/crypto/agent/
inflating: /tmp/~7119-1.WRK/crypto/agent/agent.srv
inflating: /tmp/~7119-1.WRK/README
inflating: /tmp/~7119-1.WRK/install_opswagentrpm.sh
inflating: /tmp/~7119-1.WRK/install_rpm.sh
inflating: /tmp/~7119-1.WRK/install_syncbot.sh
inflating: /tmp/~7119-1.WRK/ISMsyncb0.pkg
inflating: /tmp/~7119-1.WRK/OPSWrpm-32h.0.0.2-1.pkg
inflating: /tmp/~7119-1.WRK/opswagentrpm.rpm
inflating: /tmp/~7119-1.WRK/opswagentrpm.sun4us.rpm

[INFO] Installing Opsware agent into ‘/opt/opsware/agent’.

[INFO] Agent uninstaller script copied successfully.

[INFO] Installation completed successfully.

[INFO] Checking for restorations at ‘/tmp/~7119-1.WRK/cog_extension’

[INFO] Registering server with the Opsware core.

[INFO] Writing no_full_hw_reg_requested file: /var/opt/opsware/agent/no_full_hw_reg_requested

[INFO] Successfully created no_full_hw_reg_requested file: /var/opt/opsware/agent/no_full_hw_reg_requested
INFO: Startup of initial hardware registration at ’05/26/10 11:09:59′
INFO: No crypto found, using bootstrap crypto.
TRACE: bootstrap crypto path: /var/opt/opsware/crypto/agent/bootstrap
TRACE: copy /var/opt/opsware/crypto/agent/bootstrap/agent.srv to /var/opt/opsware/crypto/agent/agent.srv
TRACE: flushing the certmaster cache
INFO: Checking with spin to see about clearing mid and/or crypto…
TRACE: check that ‘/etc/opt/opsware/agent/mid’ exists: 0
INFO: no mid file was found at: ‘/etc/opt/opsware/agent/mid’ assuming new install.
TRACE: bootstrap crypto path: /var/opt/opsware/crypto/agent/bootstrap
TRACE: remove: /var/opt/opsware/crypto/agent/agent.srv
TRACE: copy /var/opt/opsware/crypto/agent/bootstrap/agent.srv to /var/opt/opsware/crypto/agent/agent.srv
TRACE: flushing the certmaster cache
INFO: Initially registering hardware and operating system information…
Retrieved Machine ID is null.
Chassis ID: 847e3198
TRACE: Connecting to ‘https://spin:1004/spinrpc.py’…
Opsware machine ID : ‘415820499’
Received ‘agent-ca.crt’.
Received ‘opsware-ca.crt’.
Received ‘cogbot.srv’.
Storing ‘cogbot.srv’ contents as ‘agent.srv’
Received ‘admin-ca.crt’.

[INFO] Minimal server registration completed successfully.

[INFO] Registering Opsware agent with the Opsware core.

[INFO] Writing no_sw_reg_requested file: /var/opt/opsware/agent/no_sw_reg_requested

[INFO] Successfully created no_sw_reg_requested file: /var/opt/opsware/agent/no_sw_reg_requested

[INFO] Writing do_check_reachability file: /var/opt/opsware/agent/do_check_reachability

[INFO] Successfully created do_check_reachability file: /var/opt/opsware/agent/do_check_reachability

[INFO] Starting Opsware agent.
Starting agent
Daemonbot: Wed May 26 11:10:01 2010: Looks like nothing is listening on :1002
Daemonbot: Wed May 26 11:10:01 2010: Daemonbot confirms that nothing is listening on :1002
Daemonbot: Wed May 26 11:10:01 2010: Daemonbot will try and start a new shadowbot…
Daemonbot: Wed May 26 11:10:01 2010: Starting /opt/opsware/agent/pylibs/shadowbot/daemonbot.pyc…
Daemonbot: Wed May 26 11:10:01 2010: pidpath /var/opt/opsware/agent
Daemonbot: Wed May 26 11:10:01 2010: logpath /var/log/opsware/agent
Daemonbot: Wed May 26 11:10:01 2010: ports [‘1002’]
Daemonbot: Wed May 26 11:10:01 2010: Started process group 7160

[INFO] Opsware agent started.

[INFO] Log ended.

Linux – Password has been used already. Choose another

root@linux:~ # passwd emerson
Changing password for emerson.
New Password:
Reenter New Password:
Password has been used already. Choose another.
Password changed

Linux is keeping the old password stored on /etc/security/opasswd. Delete the line containing the user that you’re trying to change the password

You can also check the file /etc/pam.d/common-password and look for a line with the parameter remember.

password required pam_pwhistory.so use_authtok remember=6 retry=3

End of life information about HP-UX, Solaris, AIX and Linux

If you need to know the if a release of an Unix operating system is still supported by the vendor, check these links for information

End of life information about HP-UX (PDF)

End of life information about Solaris

End of life information about AIX

End of life information about Red Hat Enterprise Linux

End of life information about Suse Linux Enterprise

Solaris passwd: System error: no files password

root@solaris:/ # passwd emerson
New Password:
Re-enter new Password:
passwd: System error: no files password for emerson.
Permission denied

In this case there was a problem in the /etc/passwd file. There was a blank line between two users.

Checking which satellite HP SA is pointing to

To check which satellite the HP Server Automation (formely known as Opsware) is pointing to, check the file /etc/opt/opsware/agent/opswgw.args

root@solaris:/ # cat /etc/opt/opsware/agent/opswgw.args
opswgw.gw_list: 164.56.164.222:3001,164.56.164.221:3001

This server is pointing to two satellites

Problem with HMC – rebooting

hscroot@localhost:~> vtmenu

Retrieving name of managed system(s) . . . 10D400C

———————————————————-
Partitions On Managed System: 10D400C
———————————————————-
1) LPAR1 Not Available:
2) LPAR2 Not Available:

Enter Number of Running Partition (q to quit): q

Bye.

The server with the two LPAR partitions were shut down due to a electric maintenance. I tried to start the partitions but I was having this problem:

hscroot@localhost:~> chsysstate -r lpar -m 10D400C -o on -n LPAR1
Unable to lock the Service Processor. Perform one of the following steps: (1) Check serial cable connection; (2) Check if another Console is communicating with the Service Processor; (3) Perform the Release Lock task; (4) Perform Rebuild task to re-establish the connection.

I tried again and I got a different error.

hscroot@localhost:~> chsysstate -r lpar -m 10D400C -o on -n LPAR1
Command sent to Service Processor failed. Error Response 4.

To reboot the IBM HMC, type the command below

hscroot@localhost:~> hmcshutdown -t now -r

Broadcast message from root (Sun Jun 6 08:35:38 2010):

The system is going down for reboot NOW!

I had problems with the reboot and asked to power off and power on the HMC. After that I had no more problems.

Display the status of a tape drive on Solaris

root@solaris:/ # luxadm probe
No Network Array enclosures found in /dev/es

Found Fibre Channel device(s):
(Removed to show only the tape devices)
Node WWN:500104f0009429c3 Device Type:Tape device
Logical Path:/dev/rmt/0n
Node WWN:500104f0009429c6 Device Type:Tape device
Logical Path:/dev/rmt/7n
Node WWN:500104f0009429c9 Device Type:Tape device
Logical Path:/dev/rmt/9n
Node WWN:500104f0009429cc Device Type:Tape device
Logical Path:/dev/rmt/11n

This drive showed the status of the tape drive

root@solaris:/ # mt -f /dev/rmt/7n status
HP Ultrium LTO 3 tape drive:
sense key(0x0)= No Additional Sense residual= 0 retries= 0
file no= 0 block no= 0

This drive shows that no tape is loaded or the drive is offline.

root@solaris:/ # mt -f /dev/rmt/9n status
/dev/rmt/9n: no tape loaded or drive offline

Filesystem mounted but showing I/O error

I was having some I/O errors in filesystems below /var

root@solaris:~ # df -h
Filesystem size used avail capacity Mounted on
/dev/md/dsk/d0 7.7G 4.9G 2.7G 64% /
/devices 0K 0K 0K 0% /devices
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 39G 1.2M 39G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system/object
/platform/SUNW,Sun-Fire-T200/lib/libc_psr/libc_psr_hwcap1.so.1
7.7G 4.9G 2.7G 64% /platform/sun4v/lib/libc_psr.so.1
/platform/SUNW,Sun-Fire-T200/lib/sparcv9/libc_psr/libc_psr_hwcap1.so.1
7.7G 4.9G 2.7G 64% /platform/sun4v/lib/sparcv9/libc_psr.so.1
fd 0K 0K 0K 0% /dev/fd
/dev/md/dsk/d50 15G 2.2G 13G 15% /var
swap 2.0G 2.2M 2.0G 1% /tmp
df: cannot statvfs /var/run: I/O error
/dev/md/dsk/d140 11G 8.9G 2.0G 82% /sites
/dev/md/dsk/d60 7.7G 2.7G 4.9G 37% /opt
/dev/md/dsk/d90 963M 1.0M 904M 1% /logs
/dev/md/dsk/d85 963M 3.3M 902M 1% /home
/dev/md/dsk/d110 963M 730M 175M 81% /u01
df: cannot statvfs /var/mqm: I/O error
df: cannot statvfs /var/crash: I/O error

I was able to enter the directory /var but I was not able to list anything.

root@solaris:~ # cd /var
cd: error retrieving current directory: getcwd: cannot access parent directories: I/O error

root@solaris:/var # ls -al
.: I/O error

On the console I was seeing the error from makeutx

Jul 22 21:53:41 svc.startd[9669]: makeutx failed, retrying: I/O error
Jul 22 21:53:42 svc.startd[9669]: makeutx failed, retrying: No such file or directory

I decided to umount the filesystem and mount it again and my error disappeared

root@solaris:/root # umount /var
root@solaris:/root # df -h /var
Filesystem size used avail capacity Mounted on
/dev/md/dsk/d0 7.7G 4.9G 2.7G 64% /

root@solaris:/root # mount /var
root@solaris:/root # df -h /var
Filesystem size used avail capacity Mounted on
/dev/md/dsk/d50 15G 2.2G 13G 15% /var

Resetting an LPAR on a Power4 system

First you need to check some information for your system on the HMC. Issue the command vtmenu to get the managed system ID and the names of the partitions

hscroot@localhost:~> vtmenu

Retrieving name of managed system(s) . . . 108F19C

———————————————————-
Partitions On Managed System: 108F19C
———————————————————-
1) MANUFACTURING Running:
2) RETAIL Running:

Enter Number of Running Partition (q to quit): q

Bye.

On this example I tried to soft reset the partition called MANUFACTURING

hscroot@localhost:~> chsysstate -m 108F19C -r lpar -n MANUFACTURING -o reset

Since it didn’t work out as expected, I decided to power off the LPAR

hscroot@localhost:~> chsysstate -m 108F19C -r lpar -n MANUFACTURING -o off

hscroot@localhost:~> vtmenu

Retrieving name of managed system(s) . . . 108F19C

———————————————————-
Partitions On Managed System: 108F19C
———————————————————-
1) MANUFACTURING Ready:
2) RETAIL Running:

Enter Number of Running Partition (q to quit): q

Bye.

I turned the partition on and after that it worked flawlessly

hscroot@localhost:~> chsysstate -r lpar -m 108F19C -o on -n MANUFACTURING

hscroot@localhost:~> vtmenu

Retrieving name of managed system(s) . . . 108F19C

———————————————————-
Partitions On Managed System: 108F19C
———————————————————-
1) MANUFACTURING Starting:
2) RETAIL Running:

Enter Number of Running Partition (q to quit): q

Bye.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: