monochromatic

monochromatic blog: http://blog.z3bra.org
git clone git://z3bra.org/monochromatic
Log | Files | Refs

the-wrong-sysadmin.txt (8229B)


      1 # [The wrong sysadmin](#)
      2 ## — 28 April, 2015
      3 
      4 *NOTE: This was replicated from the [Unix
      5 Diary](http://nixers.net/showthread.php?tid=1539&pid=11836#pid11836) thread at
      6 [http://nixers.net](http://nixers.net)*
      7 
      8 Dear Unix diary,
      9 
     10 today I've been a bad sysadmin.
     11 It just happened. I host my own git repository, and earlier this evening I was working on my crux port tree, when I decided to commit and push my work. But this time, something went wrong and git didn't let me push any reference. Amongst all the messages returned by git, I saw this one:
     12 
     13     remote: fatal: write error: No space left on device
     14 
     15 Fucking shit. I instantly imagine what's happening: my /var partition wasn't correctly sized upon creation. This is where I host my website, gopherhole, git repo, pictures, videos, ... Every 'production' service. And after serving me well for several years, it's now full.
     16 
     17 Hopefully, I had setup all my partitions on top of LVM, and let like 200GiB available, just in case things go wrong. And they did.
     18 
     19 So here am I, staring at my red prompt, typing a few commands:
     20 
     21     root ~# df -h
     22     Filesystem                Size      Used Available Use% Mounted on
     23     mdev                      1.0M         0      1.0M   0% /dev
     24     shm                     499.4M         0    499.4M   0% /dev/shm
     25     /dev/dm-1                 4.0G    797.9M      3.2G  20% /
     26     tmpfs                    99.9M    208.0K     99.7M   0% /run
     27     cgroup_root              10.0M         0     10.0M   0% /sys/fs/cgroup
     28     /dev/sda1                96.8M     14.5M     77.3M  16% /boot
     29     /dev/mapper/vg0-var      50.0G     50.0G     20.0K 100% /var
     30     /dev/mapper/vg0-home    100.0G     12.9G     85.2G  13% /home
     31     /dev/mapper/vg0-data    600.0G    346.7G    252.1G  58% /data
     32     tmpfs                   499.4M         0    499.4M   0% /tmp
     33     tmpfs                   499.4M     32.4M    467.0M   6% /home/z3bra/tmp
     34     /dev/mapper/vg0-data    600.0G    346.7G    252.1G  58% /var/lib/mpd/music
     35 
     36     root ~# mount | grep /var
     37     /dev/mapper/vg0-var on /var type xfs (rw,relatime,attr2,inode64,noquota)
     38 
     39     root ~# lvs
     40       LV   VG   Attr       LSize
     41       data vg0  -wi-ao---- 600.00g
     42       home vg0  -wi-ao---- 100.00g
     43       root vg0  -wi-ao----   4.00g
     44       swap vg0  -wi-ao----   1.00g
     45       var  vg0  -wi-ao----  50.00g
     46 
     47     root ~# vgs
     48       VG   #PV #LV #SN Attr   VSize   VFree
     49       vg0    1   5   0 wz--n- 931.41g 176.41g
     50 
     51 Ok, so it's not the first time this happens, remember? You already grew your /home partition, and it went good! Just do the same with /var! It works without a reboot!
     52 
     53 What was those commands again?
     54 
     55     root ~# lvextend -L +20G vg0/var
     56       Extending logical volume var to 70.00 GiB
     57       63e74d07f000-63e74d2c1000 r-xp 00000000 fd:01 8430401                    /lib/libdevmapper.so.1.02: mlock failed: Out of memory
     58       63e74d2c6000-63e74d4cb000 r-xp 00000000 fd:01 8430404                    /lib/libdevmapper-event.so.1.02: mlock failed: Out of memory
     59       Logical volume var successfully resized
     60       Internal error: Reserved memory (9064448) not enough: used 9084928. Increase activation/reserved_memory?
     61 
     62     root ~# lvs
     63       LV   VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
     64       data vg0  -wi-ao---- 600.00g
     65       home vg0  -wi-ao---- 100.00g
     66       root vg0  -wi-ao----   4.00g
     67       swap vg0  -wi-ao----   1.00g
     68       var  vg0  -wi-ao----  70.00g
     69 
     70     root ~# xfs_growfs -d /var
     71     meta-data=/dev/mapper/vg0-var    isize=256    agcount=4, agsize=3276800 blks
     72              =                       sectsz=4096  attr=2, projid32bit=1
     73              =                       crc=0
     74     data     =                       bsize=4096   blocks=13107200, imaxpct=25
     75              =                       sunit=0      swidth=0 blks
     76     naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
     77     log      =internal               bsize=4096   blocks=6400, version=2
     78              =                       sectsz=4096  sunit=1 blks, lazy-count=1
     79     realtime =none                   extsz=4096   blocks=0, rtextents=0
     80     data blocks changed from 13107200 to 18350080
     81 
     82     root ~# df -h
     83     Filesystem                Size      Used Available Use% Mounted on
     84     mdev                      1.0M         0      1.0M   0% /dev
     85     shm                     499.4M         0    499.4M   0% /dev/shm
     86     /dev/dm-1                 4.0G    797.9M      3.2G  20% /
     87     tmpfs                    99.9M    208.0K     99.7M   0% /run
     88     cgroup_root              10.0M         0     10.0M   0% /sys/fs/cgroup
     89     /dev/sda1                96.8M     14.5M     77.3M  16% /boot
     90     /dev/mapper/vg0-var      70.0G     50.0G     20.0G  71% /var
     91     /dev/mapper/vg0-home    100.0G     12.9G     85.2G  13% /home
     92     /dev/mapper/vg0-data    600.0G    346.7G    252.1G  58% /data
     93     tmpfs                   499.4M         0    499.4M   0% /tmp
     94     tmpfs                   499.4M     32.4M    467.0M   6% /home/z3bra/tmp
     95     /dev/mapper/vg0-data    600.0G    346.7G    252.1G  58% /var/lib/mpd/music
     96 
     97 Phew... I'm safe now! So what the hell was going on? I decided to investigate a bit further, to see what I should watch next time.
     98 That's how I realised that I did a **HUGE** mistake...
     99 
    100     root ~# cd /var/
    101     root var# du -sh *
    102     48.5G   backup
    103     156.7M  cache
    104     0       db
    105     0       empty
    106     228.8M  git
    107     5.7M    gopher
    108     4.5G    lib
    109     0       local
    110     0       lock
    111     7.9M    log
    112     0       mail
    113     0       run
    114     40.0K   spool
    115     0       tmp
    116     1.1G    www
    117 
    118     root var# cd backup/
    119 
    120     root backup# du -sh *
    121     12.0K   bin
    122     20.0K   etc
    123     48.5G   out
    124     20.0K   usr
    125     84.0K   var
    126 
    127     root backup# mountpoint out
    128     out is not a mountpoint
    129 
    130     root backup# cd out/
    131 
    132     root out# ll
    133     total 50841516
    134     drwxr-sr-x    2 backup   users       4.0K Apr 28 02:11 ./
    135     drwxr-sr-x    8 backup   users       4.0K Feb  2 20:24 ../
    136     -rw-r--r--    1 backup   users       5.3G Apr 25 07:43 data
    137     -rw-r--r--    1 backup   users          0 Apr 25 07:43 data.0.BAK
    138     -rw-r--r--    1 backup   users      12.0G Apr 26 04:37 homedir
    139     -rw-r--r--    1 backup   users      12.0G Apr 22 04:43 homedir.0.BAK
    140     -rw-r--r--    1 backup   users      12.0G Apr 25 05:00 homedir.1.BAK
    141     -rw-r--r--    1 backup   users      44.0K Apr 26 04:42 homedir.2.BAK
    142     -rw-r--r--    1 backup   users       1.2G Apr 28 02:11 production
    143     -rw-r--r--    1 backup   users       1.2G Apr 21 02:10 production.0.BAK
    144     -rw-r--r--    1 backup   users       1.2G Apr 22 02:11 production.1.BAK
    145     -rw-r--r--    1 backup   users       1.2G Apr 23 02:11 production.2.BAK
    146     -rw-r--r--    1 backup   users       1.2G Apr 24 02:11 production.3.BAK
    147     -rw-r--r--    1 backup   users       1.2G Apr 25 02:12 production.4.BAK
    148     -rw-r--r--    1 backup   users          0 Apr 26 02:11 production.5.BAK
    149     -rw-r--r--    1 backup   users       5.3M Apr 27 02:12 production.6.BAK
    150     -rw-r--r--    1 backup   users          0 Apr 28 02:11 production.7.BAK
    151 
    152 My backup system doesn't check wether it saves to a mountpoint or not. Shit.
    153 For a whole week, all my backups where created in my /var partition instead of a backup USB drive meant for this purpose. And it filled it up pretty quickly.
    154 
    155 My backup system send me a mail after each backup, explaining me how it went. The fact it's saving to a mountpoint or not is written in it. I just stopped checking. Silly me.
    156 
    157 I realise that this issue could have been easily solved by mounting my backup disk elsewhere, then moving the files, and remounting where it should be. But I didn't. Instead, I grew a partition that didn't need to be (the backups filled 48GiB out of 50Gib allocated to /var), and this partition can't be shrinked anymore, as it's an XFS filesystem.
    158 
    159 So today I learnt two things, the hard way:
    160 
    161 1. Don't do anything until you know what's going on
    162 2. Configure systems checks and READ THEM
    163 
    164 I hope you'll learn from my mistakes. For now I think I'll just print this over my desktop, as a reminder:
    165 
    166     root ~# df -h /var/
    167     Filesystem                Size      Used Available Use% Mounted on
    168     /dev/mapper/vg0-var      70.0G      1.5G     68.5G   2% /var
    169 
    170 <!-- vim: set ft=markdown ts=4 et tw=80: -->