SetupMonitoringControlerSunfireV20z
Procedure to enable LSI controler monitoring on Sunfire v20z
We encountered the case of a Sunfire v20z server not managed by Quattor on which we wanted to enable monitoring of the disk controlers.
- Get the brand and model of the controler :
cat /proc/scsi/scsi
which gave to us : LSILOGIC Model: 1030 IM.
- Get the version of the operating system :
cat /proc/version
which in our case indicated RHEL4.
- Go to the constructor's website :
http://www.lsi.com/cm/DownloadSearch.do?locale=EN
and download the drivers for your model and operating system.
- Uncompress the downloaded file. Inside, you'll find a directory ../message/fusion. Copy this directory into /usr/src/linux/drivers.
At this stage, you should have a directory /usr/src/linux/drivers/message/fusion containing the following files :
lsi/mpi_type.h lsi/mpi.h lsi/mpi_ioc.h lsi/mpi_cnfg.h lsi/mpi_raid.h mptctl.h
These files will be used when you will compile mpt-status.
- Download the tarball containing the source files for the tool mpt-status at the following address :
http://www.drugphish.ch/~ratz/mpt-status/
- Uncompress the tarball. Inside you'll find a file mpt-status that you must edit in a way to comment the following line :
#include <linux/compiler.h>
and then :
make
Once the make is successfully finished, you should find a binary file mpt-status.
- Before testing the new mpt-status command, you must first load the kernel module mptctl :
modprobe mptctl
- Simply test the mpt-status command by :
mpt-status -s
In our case, we got :
log_id 0 OPTIMAL phys_id 0 ONLINE phys_id 1 ONLINE
- As everything works, you can install the files :
make install cp man/mpt-status.8 /usr/share/man/man8
- To make sure that the kernel module mptctl will be loaded at each startup of the server :
echo "alias char-major-10-220 mptctl" >> /etc/modprobe.conf
- To make the check on the disk controler fully automatic, you can create the following script in /etc/cron.hourly :
#!/bin/sh # Check of controler state ADMIN="grid_admin@listserv.vub.ac.be" # The return code from mpt-status is a bit mask, and can be interepreted # according to the following table (current as of 1.2.0): # Bit Value Meaning # ----------------------------------------------------------------- # 0 1 Abnormal condition / unknown error # 1 2 A logical volume has failed # 2 4 A logical volume is degraded # 3 8 A logical volume is resyncing # 4 16 At least one physical disk failed # 5 32 At least one physical disk is in warning condition /usr/sbin/mpt-status -s >/dev/null if [ $? -ne 0 ] then # Email out to let use know the disk failed echo "" | mail -s "Disk drive failure on $HOSTNAME" ${ADMIN} -- -r gridstorage@ulb.ac.be # Write a message to syslog so big brother can notify operations logger -p daemon.info "STORAGE ERROR: A failure was detected with the LSI Logic RAID controller or one of the disk drives" logger -p daemon.info "STORAGE ERROR: Run /usr/sbin/mpt-status to view the status of the storage subsystem" fi exit 0
And finally, to make the script runnable :
chmod 755 /etc/cron.hourly/checklsi.sh