Friday, March 6, 2015

Recovering from temporary disk failure

============================================================================================================================================================
In this exercise, a temporary disk failure is simulated. The goal is to recover all of the redundant and nonredundant volumes that were on the failed drive.
============================================================================================================================================================

Check the current disks availability on the system:
==================================================

[root@server1 /]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
emc0_dd1     auto:cdsdisk    -            -            online
emc0_dd2     auto:cdsdisk    -            -            online
emc0_dd3     auto:cdsdisk    -            -            online
emc0_dd4     auto:cdsdisk    -            -            online
emc0_dd5     auto:none       -            -            online invalid
emc0_dd6     auto:none       -            -            online invalid
emc0_dd7     auto:none       -            -            online invalid
emc0_dd8     auto:none       -            -            online invalid
emc0_dd9     auto:none       -            -            online invalid
emc0_d10     auto:none       -            -            online invalid
emc0_d11     auto:none       -            -            online invalid
emc0_d12     auto:none       -            -            online invalid
sda          auto:none       -            -            online invalid
sdb          auto:none       -            -            online invalid
[root@server1 /]#


Create a DG:
============

[root@server1 /]# vxdg init testdg testdg01=emc0_dd1 testdg02=emc0_dd2 testdg03=emc0_dd3
[root@server1 /]#
[root@server1 /]#


List the devices and the DG:
============================


[root@server1 /]# vxdisk -o alldgs list                                   
DEVICE       TYPE            DISK         GROUP        STATUS
emc0_dd1     auto:cdsdisk    testdg01     testdg       online
emc0_dd2     auto:cdsdisk    testdg02     testdg       online
emc0_dd3     auto:cdsdisk    testdg03     testdg       online
emc0_dd4     auto:cdsdisk    -            -            online
emc0_dd5     auto:none       -            -            online invalid
emc0_dd6     auto:none       -            -            online invalid
emc0_dd7     auto:none       -            -            online invalid
emc0_dd8     auto:none       -            -            online invalid
emc0_dd9     auto:none       -            -            online invalid
emc0_d10     auto:none       -            -            online invalid
emc0_d11     auto:none       -            -            online invalid
emc0_d12     auto:none       -            -            online invalid
sda          auto:none       -            -            online invalid
sdb          auto:none       -            -            online invalid
[root@server1 /]#


___________________________________________________________________________

Now create the setup and disk failure as follows. Firstly disable the hot relocation daemon.
Then, create two volumes (test1 and test2), create a vxfs file system on each,
and then copy duplicate files to each file system. Both file systems are
then mounted:

* test1 with a mirrored layout
* test2 with a concatenated layout

Then introduce the temporary disk failure and follow the procedure described here to fix it.
___________________________________________________________________________


Killing the hot relocation daemon:
==================================


* Kill the hot relocation daemon "vxrelocd".
vxrelocd : vxrelocd is the hot-relocation daemon that monitors events
that affect data redundancy. If redundancy failures are detected, vxrelocd
automatically relocates affected data from mirrored or RAID-5 subdisks to
spare disks or other free space within the disk group.


* Disabling vxrelocd :
If you do not want automatic subdisk relocation, you can disable the hot-relocation feature by killing the relocation daemon, vxrelocd, and preventing it from restarting. However, do not kill the daemon while it is doing the relocation. To kill the daemon, run the command:
ps -ef
from the command line and find the two entries for vxrelocd. Execute the command:
kill -9 PID1 PID2
(substituting PID1 and PID2 with the process IDs for the two vxrelocd processes). To prevent vxrelocd from being started again, you must comment out the line that starts up vxrelocd in the startup script /etc/init.d/vxvm-recover.

*Note: Just for information: relocation daemon (vxrelocd) can be started as :-  vxrelocd root &
______________________________________________________

Create 2 volumes as discussed above:

        /usr/sbin/vxassist -g testdg make test1 102400 layout=mirror testdg01 testdg02
                The test1 volume was created successfully.
        /usr/sbin/vxassist -g testdg make test2 102400 layout=concat testdg02
                The test2 volume was created successfully.
______________________________________________________

Create 2 file systems and mount them:

        /sbin/mkfs -t vxfs /dev/vx/rdsk/testdg/test1
        mkdir /test1
        /bin/mount -t vxfs /dev/vx/dsk/testdg/test1 /test1

        /sbin/mkfs -t vxfs /dev/vx/rdsk/testdg/test2
        mkdir /test2
        /bin/mount -t vxfs /dev/vx/dsk/testdg/test2 /test2
______________________________________________________

Copy some files inside both the file systems:

        /bin/cp /etc/default/* /test1
        /bin/cp /etc/default/* /test2

______________________________________________________

Fail the device as follows:
        The device to fail is emc0_dd2.
        /usr/lib/vxvm/bin/vxpartinfo /dev/vx/rdmp/emc0_dd2 8 > /tmp/emc0_dd2.part8
        /bin/dd if=/dev/vx/rdmp/emc0_dd2 of=/tmp/emc0_dd2.private bs=128k skip=1 count=256 >/dev/null 2>&1
        Overwriting the private region of emc0_dd2. Please wait...
        /bin/dd if=/dev/zero of=/dev/vx/rdmp/emc0_dd2 bs=128k seek=1 count=2 >/dev/null 2>&1

Force VxVM to detect the failure.
        /sbin/vxdctl disable
        /sbin/vxdctl enable
        /sbin/vxdctl enable
        Restoring the private region of emc0_dd2. Please wait...
        fmthard -s /tmp/emc0_dd2.vtoc /dev/vx/rdmp/emc0_dd2
        /bin/dd if=/tmp/emc0_dd2.private of=/dev/vx/rdmp/emc0_dd2 bs=128k seek=1 count=256 >/dev/null 2>&1
        /sbin/vxdisk scandisks

        A temporary disk failure has now occurred in the testdg disk group.
        Troubleshoot and repair the failure.  Ensure all volumes are started!
        NOTE - for this failure you are using the same disk because this was only
        a temporary failure.


______________________________________________________

Verify the disk has failed:

[root@server1 /]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
emc0_dd1     auto:cdsdisk    testdg01     testdg       online
emc0_dd2     auto:cdsdisk    -            (testdg)     online
emc0_dd3     auto:cdsdisk    testdg03     testdg       online
emc0_dd4     auto:cdsdisk    -            -            online
emc0_dd5     auto:none       -            -            online invalid
emc0_dd6     auto:none       -            -            online invalid
emc0_dd7     auto:none       -            -            online invalid
emc0_dd8     auto:none       -            -            online invalid
emc0_dd9     auto:none       -            -            online invalid
emc0_d10     auto:none       -            -            online invalid
emc0_d11     auto:none       -            -            online invalid
emc0_d12     auto:none       -            -            online invalid
sda          auto:none       -            -            online invalid
sdb          auto:none       -            -            online invalid
-            -         testdg02     testdg       failed was:emc0_dd2


[root@server1 /]# vxprint -g testdg -htu h
DG NAME         NCONFIG      NLOG     MINORS   GROUP-ID
ST NAME         STATE        DM_CNT   SPARE_CNT         APPVOL_CNT
DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE
RV NAME         RLINK_CNT    KSTATE   STATE    PRIMARY  DATAVOLS  SRL
RL NAME         RVG          KSTATE   STATE    REM_HOST REM_DG    REM_RLNK
CO NAME         CACHEVOL     KSTATE   STATE
VT NAME         RVG          KSTATE   STATE    NVOLUME
V  NAME         RVG/VSET/CO  KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
SV NAME         PLEX         VOLNAME  NVOLLAYR LENGTH   [COL/]OFF AM/NM    MODE
SC NAME         PLEX         CACHE    DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
DC NAME         PARENTVOL    LOGVOL
SP NAME         SNAPVOL      DCO
EX NAME         ASSOC        VC                       PERMS    MODE     STATE
SR NAME         KSTATE

dg testdg       default      default  1000     1422525023.21.sym1
dm testdg01     emc0_dd1     auto     32.00m   1.96g    -
dm testdg02     -            -        -        -        NODEVICE
dm testdg03     emc0_dd3     auto     32.00m   1.96g    -

v  test1        -            ENABLED  ACTIVE   50.00m   SELECT    -        fsgen
pl test1-01     test1        ENABLED  ACTIVE   50.00m   CONCAT    -        RW
sd testdg01-01  test1-01     testdg01 0.00     50.00m   0.00      emc0_dd1 ENA
pl test1-02     test1        DISABLED NODEVICE 50.00m   CONCAT    -        RW
sd testdg02-01  test1-02     testdg02 0.00     50.00m   0.00      -        NDEV

v  test2        -            DISABLED ACTIVE   50.00m   SELECT    -        fsgen
pl test2-01     test2        DISABLED NODEVICE 50.00m   CONCAT    -        RW
sd testdg02-02  test2-01     testdg02 50.00m   50.00m   0.00      -        NDEV


______________________________________________________

Attempt to view the files that were copied to mount points /test1 and /test2.
Because the test1 volume is mirrored, the files in the /test1 mount point are
still accessible. When trying to view the files in /test2, you should see the
following error:

/test2: I/O error

[root@server1 /]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             21225712   4709932  15420152  24% /
tmpfs                  1029756         0   1029756   0% /dev/shm
tmpfs                        4         0         4   0% /dev/vx
mgt:/student          21225728   6176640  13953472  31% /student
/dev/vx/dsk/testdg/test1
                         51200      3177     45028   7% /test1
df: `/test2': Input/output error


[root@server1 /]# ls -l /test1
total 4
drwxr-xr-x 2 root root   96 Jan 29 01:53 lost+found
-rw-r--r-- 1 root root 1302 Jan 29 01:53 nss
-rw-r----- 1 root root  525 Jan 29 01:53 sfm_resolv.conf
-rw------- 1 root root  119 Jan 29 01:53 useradd

[root@server1 /]# ls -l /test2
ls: /test2: Input/output error


====================================
Recover from the failure as follows:
====================================

a) If you are using enclosure based naming, identify the OS native name of
the disk that has temporarily failed. You will use this OS disk name while
verifying that the operating system recognizes the device.


[root@server1 /]# vxdisk -e list emc0_dd2
DEVICE       TYPE           DISK        GROUP        STATUS               OS_NATIVE_NAME   ATTR
emc0_dd2     auto:cdsdisk   -            -           online               sde              lun



b) Ensure that the operating system recognizes the device using the
appropriate OS commands.


[root@server1 /]# partprobe /dev/sde
Warning: The disk CHS geometry (261,255,63) reported by the operating system does not match the geometry stored on the disk label (1024,128,32).


**Note: partprobe is a program that informs the operating system kernel of partition table changes, by requesting that the operating system re-read the partition table.

c) Verify that the operating system recognizes the device using the
appropriate OS commands.


[root@server1 /]# fdisk -l /dev/sde
Disk /dev/sde (Sun disk label): 128 heads, 32 sectors, 1022 cylinders
Units = cylinders of 4096 * 512 bytes

   Device Flag    Start       End    Blocks   Id  System
/dev/sde3  u          0      1022   2093056    5  Whole disk
/dev/sde8  u          0      1022   2093056    f  Unknown


d) Force the VxVM configuration daemon to reread all of the drives in the system.

[root@server1 /]# vxdctl enable
[root@server1 /]#
[root@server1 /]#
[root@server1 /]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
emc0_dd1     auto:cdsdisk    testdg01     testdg       online
emc0_dd2     auto:cdsdisk    -            (testdg)     online
emc0_dd3     auto:cdsdisk    testdg03     testdg       online
emc0_dd4     auto:cdsdisk    -            -            online
emc0_dd5     auto:none       -            -            online invalid
emc0_dd6     auto:none       -            -            online invalid
emc0_dd7     auto:none       -            -            online invalid
emc0_dd8     auto:none       -            -            online invalid
emc0_dd9     auto:none       -            -            online invalid
emc0_d10     auto:none       -            -            online invalid
emc0_d11     auto:none       -            -            online invalid
emc0_d12     auto:none       -            -            online invalid
sda          auto:none       -            -            online invalid
sdb          auto:none       -            -            online invalid
-            -         testdg02     testdg       failed was:emc0_dd2



e) Reattach the device to the disk media record using the vxreattach command.

[root@server1 /]# vxreattach
** Below command shows that the disk has come online:
[root@server1 /]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
emc0_dd1     auto:cdsdisk    testdg01     testdg       online
emc0_dd2     auto:cdsdisk    testdg02     testdg       online
emc0_dd3     auto:cdsdisk    testdg03     testdg       online
emc0_dd4     auto:cdsdisk    -            -            online
emc0_dd5     auto:none       -            -            online invalid
emc0_dd6     auto:none       -            -            online invalid
emc0_dd7     auto:none       -            -            online invalid
emc0_dd8     auto:none       -            -            online invalid
emc0_dd9     auto:none       -            -            online invalid
emc0_d10     auto:none       -            -            online invalid
emc0_d11     auto:none       -            -            online invalid
emc0_d12     auto:none       -            -            online invalid
sda          auto:none       -            -            online invalid
sdb          auto:none       -            -            online invalid


** Check the volumes, whether they have been recovered ?

[root@server1 /]# vxprint -g testdg -htu h
DG NAME         NCONFIG      NLOG     MINORS   GROUP-ID
ST NAME         STATE        DM_CNT   SPARE_CNT         APPVOL_CNT
DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE
RV NAME         RLINK_CNT    KSTATE   STATE    PRIMARY  DATAVOLS  SRL
RL NAME         RVG          KSTATE   STATE    REM_HOST REM_DG    REM_RLNK
CO NAME         CACHEVOL     KSTATE   STATE
VT NAME         RVG          KSTATE   STATE    NVOLUME
V  NAME         RVG/VSET/CO  KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
SV NAME         PLEX         VOLNAME  NVOLLAYR LENGTH   [COL/]OFF AM/NM    MODE
SC NAME         PLEX         CACHE    DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
DC NAME         PARENTVOL    LOGVOL
SP NAME         SNAPVOL      DCO
EX NAME         ASSOC        VC                       PERMS    MODE     STATE
SR NAME         KSTATE

dg testdg       default      default  1000     1422525023.21.sym1
dm testdg01     emc0_dd1     auto     32.00m   1.96g    -
dm testdg02     emc0_dd2     auto     32.00m   1.96g    -
dm testdg03     emc0_dd3     auto     32.00m   1.96g    -

v  test1        -            ENABLED  ACTIVE   50.00m   SELECT    -        fsgen
pl test1-01     test1        ENABLED  ACTIVE   50.00m   CONCAT    -        RW
sd testdg01-01  test1-01     testdg01 0.00     50.00m   0.00      emc0_dd1 ENA
pl test1-02     test1        DISABLED IOFAIL   50.00m   CONCAT    -        RW
sd testdg02-01  test1-02     testdg02 0.00     50.00m   0.00      emc0_dd2 ENA

v  test2        -            DISABLED ACTIVE   50.00m   SELECT    -        fsgen
pl test2-01     test2        DISABLED IOFAIL   50.00m   CONCAT    -        RW
sd testdg02-02  test2-01     testdg02 50.00m   50.00m   0.00      emc0_dd2 ENA



f) Recover the volumes using the vxrecover command.
[root@server1 /]# vxrecover


[root@server1 /]# vxprint -g testdg -htu h
DG NAME         NCONFIG      NLOG     MINORS   GROUP-ID
ST NAME         STATE        DM_CNT   SPARE_CNT         APPVOL_CNT
DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE
RV NAME         RLINK_CNT    KSTATE   STATE    PRIMARY  DATAVOLS  SRL
RL NAME         RVG          KSTATE   STATE    REM_HOST REM_DG    REM_RLNK
CO NAME         CACHEVOL     KSTATE   STATE
VT NAME         RVG          KSTATE   STATE    NVOLUME
V  NAME         RVG/VSET/CO  KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
SV NAME         PLEX         VOLNAME  NVOLLAYR LENGTH   [COL/]OFF AM/NM    MODE
SC NAME         PLEX         CACHE    DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
DC NAME         PARENTVOL    LOGVOL
SP NAME         SNAPVOL      DCO
EX NAME         ASSOC        VC                       PERMS    MODE     STATE
SR NAME         KSTATE

dg testdg       default      default  1000     1422525023.21.sym1
dm testdg01     emc0_dd1     auto     32.00m   1.96g    -
dm testdg02     emc0_dd2     auto     32.00m   1.96g    -
dm testdg03     emc0_dd3     auto     32.00m   1.96g    -

v  test1        -            ENABLED  ACTIVE   50.00m   SELECT    -        fsgen
pl test1-01     test1        ENABLED  ACTIVE   50.00m   CONCAT    -        RW
sd testdg01-01  test1-01     testdg01 0.00     50.00m   0.00      emc0_dd1 ENA
pl test1-02     test1        ENABLED  ACTIVE   50.00m   CONCAT    -        RW
sd testdg02-01  test1-02     testdg02 0.00     50.00m   0.00      emc0_dd2 ENA

v  test2        -            DISABLED ACTIVE   50.00m   SELECT    -        fsgen
pl test2-01     test2        DISABLED IOFAIL   50.00m   CONCAT    -        RW
sd testdg02-02  test2-01     testdg02 50.00m   50.00m   0.00      emc0_dd2 ENA


[root@server1 /]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             21225712   4709968  15420116  24% /
tmpfs                  1029756         0   1029756   0% /dev/shm
tmpfs                        4         0         4   0% /dev/vx
mgt:/student          21225728   6176640  13953472  31% /student
/dev/vx/dsk/testdg/test1
                         51200      3177     45028   7% /test1
df: `/test2': Input/output error



g) Use the vxvol command to start the nonredundant volume.
[root@server1 /]# vxvol -g testdg -f start test2
[root@server1 /]#
[root@server1 /]#
[root@server1 /]#
[root@server1 /]# vxprint -g testdg -htu h
DG NAME         NCONFIG      NLOG     MINORS   GROUP-ID
ST NAME         STATE        DM_CNT   SPARE_CNT         APPVOL_CNT
DM NAME         DEVICE       TYPE     PRIVLEN  PUBLEN   STATE
RV NAME         RLINK_CNT    KSTATE   STATE    PRIMARY  DATAVOLS  SRL
RL NAME         RVG          KSTATE   STATE    REM_HOST REM_DG    REM_RLNK
CO NAME         CACHEVOL     KSTATE   STATE
VT NAME         RVG          KSTATE   STATE    NVOLUME
V  NAME         RVG/VSET/CO  KSTATE   STATE    LENGTH   READPOL   PREFPLEX UTYPE
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
SV NAME         PLEX         VOLNAME  NVOLLAYR LENGTH   [COL/]OFF AM/NM    MODE
SC NAME         PLEX         CACHE    DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
DC NAME         PARENTVOL    LOGVOL
SP NAME         SNAPVOL      DCO
EX NAME         ASSOC        VC                       PERMS    MODE     STATE
SR NAME         KSTATE

dg testdg       default      default  1000     1422525023.21.sym1
dm testdg01     emc0_dd1     auto     32.00m   1.96g    -
dm testdg02     emc0_dd2     auto     32.00m   1.96g    -
dm testdg03     emc0_dd3     auto     32.00m   1.96g    -

v  test1        -            ENABLED  ACTIVE   50.00m   SELECT    -        fsgen
pl test1-01     test1        ENABLED  ACTIVE   50.00m   CONCAT    -        RW
sd testdg01-01  test1-01     testdg01 0.00     50.00m   0.00      emc0_dd1 ENA
pl test1-02     test1        ENABLED  ACTIVE   50.00m   CONCAT    -        RW
sd testdg02-01  test1-02     testdg02 0.00     50.00m   0.00      emc0_dd2 ENA

v  test2        -            ENABLED  ACTIVE   50.00m   SELECT    -        fsgen
pl test2-01     test2        ENABLED  ACTIVE   50.00m   CONCAT    -        RW
sd testdg02-02  test2-01     testdg02 50.00m   50.00m   0.00      emc0_dd2 ENA
[root@server1 /]#
[root@server1 /]#
[root@server1 /]#


h) Check the mount points:

[root@server1 /]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             21225712   4709976  15420108  24% /
tmpfs                  1029756         0   1029756   0% /dev/shm
tmpfs                        4         0         4   0% /dev/vx
mgt:/student          21225728   6176640  13953472  31% /student
/dev/vx/dsk/testdg/test1
                         51200      3177     45028   7% /test1
df: `/test2': Input/output error

[root@server1 /]#
[root@server1 /]#
[root@server1 /]#


i) Because this is a temporary failure, the files in the test2 volume (and file
system) are still available. Recover the mount point by performing the
following:


a> Unmount the /test2 mount point.
b> Perform an fsck on the file system.
c> Mount the test2 volume to /test2.

[root@server1 /]# umount /test2
[root@server1 /]#
[root@server1 /]#


[root@server1 /]# fsck -t vxfs /dev/vx/rdsk/testdg/test2
fsck 1.39 (29-May-2006)
log replay in progress
replay complete - marking super-block as CLEAN


[root@server1 /]#
[root@server1 /]#
[root@server1 /]# mount -t vxfs /dev/vx/dsk/testdg/test2 /test2
[root@server1 /]#
[root@server1 /]#
[root@server1 /]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             21225712   4709980  15420104  24% /
tmpfs                  1029756         0   1029756   0% /dev/shm
tmpfs                        4         0         4   0% /dev/vx
mgt:/student          21225728   6176640  13953472  31% /student
/dev/vx/dsk/testdg/test1
                         51200      3177     45028   7% /test1
/dev/vx/dsk/testdg/test2
                         51200      3173     45032   7% /test2


j) Unmount the file systems and delete the test1 and test2 volumes:

[root@server1 /]# umount /test1
[root@server1 /]# umount /test2
[root@server1 /]#
[root@server1 /]#
[root@server1 /]# vxassist -g testdg remove volume test1
[root@server1 /]# vxassist -g testdg remove volume test2
[root@server1 /]#

No comments:

Post a Comment