Tag: fmadm

HPOM – Checking fmadm faulty and there is a pool without name

Node : sc02-app03.setaoffice.com
Node Type : Sun SPARC (HTTPS)
Severity : major
OM Server Time: 2018-02-06 11:11:01
Message : UXMON: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
Msg Group : OS
Application : SOL_mon
Object : FMT
Event Type :
not_found

Instance Name :
not_found

Instruction : “The Fault Management agent has identified a HW or OS related problem with the severity presented by the ticket.
The problem(s) can be viewed and managed with the command – fmdump
To get a better understanding of the problem and on how to resolve it, locate the event that generated
the ticket in the syslog file /var/adm/messages, a URL will be found (http://sun.com/msg/xxx-nnnn-yy),
follow the link using your Oracle portal account for instructions.”
EventDataSource :

Checking fmadm faulty I see a pool without a name

root@sc02-app03:~# fmadm faulty
————— ———————————— ————– ———
TIME EVENT-ID MSG-ID SEVERITY
————— ———————————— ————– ———
Feb 06 15:59:15 4577aa7a-2b00-6eb6-f139-bf2848542fb2 ZFS-8000-HC Major

Problem Status : open
Diag Engine : zfs-diagnosis / 1.0
System
Manufacturer : unknown
Name : –
Part_Number : unknown
Serial_Number : unknown

System Component
Manufacturer : Oracle-Corporation
Name : ORCL,SPARC-T5-8
Part_Number : unknown
Serial_Number : unknown
Host_ID : (null)

—————————————-
Suspect 1 of 1 :
Problem class : fault.fs.zfs.io_failure_wait
Certainty : 100%
Affects : zfs://pool=9fbf9b5d11236d0a
Status : faulted but still in service

Resource
FMRI : “zfs://pool=9fbf9b5d11236d0a”
Status : faulted but still in service

Description : ZFS pool ” has experienced currently unrecoverable I/O failures.

Response : No automated response will occur.

Impact : Read and write I/Os cannot be serviced.

Action : Use ‘fmadm faulty’ to provide a more detailed view of this event.
Make sure the affected devices are connected, then run ‘zpool
clear’. Please refer to the associated reference document at
http://support.oracle.com/msg/ZFS-8000-HC for the latest service
procedures and policies regarding this diagnosis.

To solve this problem, you need to reboot the server

How to clear fmadm faulty entries in Solaris 10

Clear fmadm log

root@sc02-app04:~ # fmadm faulty
————— ———————————— ————– ———
TIME EVENT-ID MSG-ID SEVERITY
————— ———————————— ————– ———
Feb 06 10:17:04 0ad5260a-f9e0-ef7a-c91f-efcc76b9b164 ZFS-8000-HC Major

Host : sc02-app04
Platform : ORCL,SPARC-T5-8 Chassis_id :
Product_sn :

Fault class : fault.fs.zfs.io_failure_wait
Affects : zfs://pool=prd171
faulted but still in service
Problem in : zfs://pool=prd171
faulted but still in service

Description : The ZFS pool has experienced currently unrecoverable I/O
failures.

Response : No automated response will be taken.

Impact : Read and write I/Os cannot be serviced.

Action : Use ‘fmadm faulty’ to provide a more detailed view of this event.
Make sure the affected devices are connected, then run ‘zpool
clear’. Please refer to the associated reference document at
http://sun.com/msg/ZFS-8000-HC for the latest service procedures
and policies regarding this diagnosis.

————— ———————————— ————– ———
TIME EVENT-ID MSG-ID SEVERITY
————— ———————————— ————– ———
Feb 06 10:17:03 fe2302d7-99c9-c0c8-d54b-92495cc94fc9 ZFS-8000-D3 Major

Host : sc02-app04
Platform : ORCL,SPARC-T5-8 Chassis_id :
Product_sn :

Fault class : fault.fs.zfs.device
Affects : zfs://pool=prd171/vdev=5d2cdc446e947471
faulted and taken out of service
Problem in : zfs://pool=prd171/vdev=5d2cdc446e947471
faulted and taken out of service

Description : A ZFS device failed.

Response : No automated response will occur.

Impact : Fault tolerance of the pool may be compromised.

Action : Run ‘zpool status -x’ for more information. Please refer to the
associated reference document at http://sun.com/msg/ZFS-8000-D3
for the latest service procedures and policies regarding this
diagnosis.

root@sc02-app04:~ # fmadm repair 0ad5260a-f9e0-ef7a-c91f-efcc76b9b164
fmadm: recorded repair to 0ad5260a-f9e0-ef7a-c91f-efcc76b9b164
root@sc02-app04:~ # fmadm repair fe2302d7-99c9-c0c8-d54b-92495cc94fc9
fmadm: recorded repair to fe2302d7-99c9-c0c8-d54b-92495cc94fc9

Clear ereports and resource cache

root@sc02-app04:~ # cd /var/fm/fmd
root@sc02-app04:/var/fm/fmd # rm e* f* c*/eft/* r*/*

Clearing out FMA files with no reboot needed

root@sc02-app04:~ # svcadm disable -s svc:/system/fmd:default
root@sc02-app04:~ # cd /var/fm/fmd
root@sc02-app04:/var/fm/fmd # find /var/fm/fmd -type f -exec ls {} \;
/var/fm/fmd/topo/90ab82b5-08eb-6f9f-9a9a-af2975a2808b/hc-topology.xml
/var/fm/fmd/topo/6b4eba63-3576-e155-ac3b-8f6609f0b968/hc-topology.xml
/var/fm/fmd/topo/1badc01d-82b9-6203-9440-9dd440aedaca/hc-topology.xml
/var/fm/fmd/topo/f32b13d0-63a1-4b5a-e811-bfda6bddcba1/hc-topology.xml
/var/fm/fmd/topo/c4824832-ced3-672a-ec69-a9490f94d2c0/hc-topology.xml
/var/fm/fmd/ckpt/etm/etm
/var/fm/fmd/ckpt/zfs-diagnosis/zfs-diagnosis
root@sc02-app04:/var/fm/fmd # find /var/fm/fmd -type f -exec rm {} \;
root@sc02-app04:/var/fm/fmd # svcadm enable svc:/system/fmd:default

Checking fmadm faulty

root@sc02-app04:~ # fmadm faulty
root@sc02-app04:~ #

Reset the fmd serd modules

root@sc02-app04:~ # fmadm reset cpumem-diagnosis
fmadm: failed to reset module cpumem-diagnosis: specified module is not loaded in fault manager
root@sc02-app04:~ # fmadm reset cpumem-retire
fmadm: cpumem-retire module has been reset
root@sc02-app04:~ # fmadm reset eft
fmadm: eft module has been reset
root@sc02-app04:~ # fmadm reset io-retire
fmadm: io-retire module has been reset

Source: https://saifulaziz.com/2011/12/26/how-to-clear-fmadm-log-or-fma-faults-log/

Solaris – UXMON: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major

UXMON: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major

Node : solaris.setaoffice.com
Node Type : Sun SPARC (HTTPS)
Severity : major
OM Server Time: 2017-08-12 10:27:31
Message : UXMON: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
Msg Group : OS
Application : SOL_mon
Object : FMT
Event Type :
not_found

Instance Name :
not_found

Instruction : “The Fault Management agent has identified a HW or OS related problem with the severity presented by the ticket.
The problem(s) can be viewed and managed with the command – fmdump
To get a better understanding of the problem and on how to resolve it, locate the event that generated
the ticket in the syslog file /var/adm/messages, a URL will be found (http://sun.com/msg/xxx-nnnn-yy),
follow the link using your Oracle portal account for instructions.”

After running fmadm faulty, we see that there is a problem with a zpool. Run zpool status -x and then we see pool prd027_software is having problems

root@solaris:~ # zpool status prd027_software
pool: prd027_software
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
Run ‘zpool status -v’ to see device specific details.
see: http://support.oracle.com/msg/ZFS-8000-8A
scan: none requested
config:

NAME STATE READ WRITE CKSUM
prd027_software ONLINE 0 0 14.7K
c0t600507680191818C1000000000000BE9d0 ONLINE 0 0 0
c0t600507680191818C1000000000000BEAd0 ONLINE 0 0 0
c0t600507680191818C1000000000000BEBd0 ONLINE 0 0 0
c0t600507680191818C1000000000000BECd0 ONLINE 0 0 0

errors: 3 data errors, use ‘-v’ for a list

Run zpool scrub prd027_software

root@solaris:~ # zpool scrub prd027_software

root@solaris:~ # zpool status -xv
pool: prd027_software
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://support.oracle.com/msg/ZFS-8000-8A
scan: scrub in progress since Wed Dec 31 21:00:00 1969
50.7M scanned out of 1.08T at 25.3M/s, 12h25m to go
0 repaired, 0.00% done
config:

NAME STATE READ WRITE CKSUM
prd027_software ONLINE 0 0 14.7K
c0t600507680191818C1000000000000BE9d0 ONLINE 0 0 0
c0t600507680191818C1000000000000BEAd0 ONLINE 0 0 0
c0t600507680191818C1000000000000BEBd0 ONLINE 0 0 0
c0t600507680191818C1000000000000BECd0 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

/zones/prd027/root/usr/software/best1/Patrol3/Solaris-2-10-sparc-64/best1/7.4.00/bgs/monitor/log/prd027-bgsagent_6767.als
prd027_software/software027:
prd027_software/software027:

After the pool is scanned, check if there is still a problem

root@solaris:~ # zpool status -xv
all pools are healthy

Repairing fmadm entries

root@solaris:~ # fmadm faulty|grep “Aug”
Aug 12 11:23:22 82fe93a5-8120-657b-9e61-e33252b84d30 ZFS-8000-D3 Major
Aug 12 11:22:01 74c61e33-7c56-4aca-d707-a32ce06a9bd8 ZFS-8000-CS Major

root@solaris:~ # fmadm repair 82fe93a5-8120-657b-9e61-e33252b84d30
fmadm: recorded repair to 82fe93a5-8120-657b-9e61-e33252b84d30

root@solaris:~ # fmadm repair 74c61e33-7c56-4aca-d707-a32ce06a9bd8
fmadm: recorded repair to 74c61e33-7c56-4aca-d707-a32ce06a9bd8

root@solaris:~ # fmadm faulty
root@solaris:~ #

You can’t disable SOL_mon.

These alerts are generated from global hardware policy not from any configuration files. Hence, there is no option to suppress from HPOM side.

Please enable suppression in Jet tool using free style format

Source Type Template Name
———– —————————————————————-
Logfile UXMON_sol_hw_syslog_PRE(1.2)

Message Text
————
UXMON: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major

Custom Message Attributes
————————-
EventSource MS_OVO
EventUniqueID UXMON-HW-000376
condition_name FMD events of Fault type

Share this:

Share this:

Share this: