DCache

monitoring

SRM monitor http://maite.iihe.ac.be:8098/billing/xml/
dcache head page http://maite.iihe.ac.be:2288
SRM manual https://sdm.lbl.gov/srm-wg/doc/SRM.v2.2.html

postgres tuning

benchmark

http://edoceo.com/liber/db-postgresql-benchmark
switch to postgres user

su - postgres

initialise the benchmark db
- -s for maximum of active clients

createdb testdb
pgbench -i testdb -s 25

run the benchmark

pgbench -c 10 -t 100 testdb

#!/bin/bash

## number of clients
clients="1 10 25 50 75 100"
#clients="1 2"
## number of transactions
trans="100 500 1000 2000 5000 10000 50000"
#trans="100 200"
## scaling factor used for intialisastion 
scaling_fact="1 5 10 15 20"
#scaling_fact="1 2"

## number of retries
stats=100
#stats=5

for sc in $scaling_fact
do
 dropdb testdb >& /dev/null
 createdb testdb >& /dev/null
 pgbench -i testdb -s $sc >& /dev/null
 for c in $clients
 do
   for t in $trans
   do
     tot_i=0
     for s in <tt>seq 1 $stats</tt>
     do
       pgbench -c $c -t $t testdb >& /tmp/out-pgbench
       echo "pgbench -c $c -t $t testdb" >> /tmp/out-pgbench
       tps=<tt>cat /tmp/out-pgbench|grep tps|grep including|sed 's/.* \([0-9]\+\).*/\1/'</tt>
       echo "TPS_INTERMEDIATE= $tps Run $s Clients $c Transfers $t Scaling $sc"
       tot_i=$(($tot_i + $tps))
     done
     tot_i=$(($tot_i/$stats))
     echo "TPS= $tot_i Clients $c Transfers $t Scaling $sc"
   done  
  done
done

tuning

disk layout

WAL logs on a different disk or not?
- 4 disk setup
- 1 device in 4 disk RAID10
  - higher throughput, interesting for reading (i guess)
- 2 devices in 2 * 2 disk RAID1
  - WAL/OS on a separate disk, WAL on own partition
  - OS/other software doesn't do a lot of IO (or lets hope it's not all fsync'ed)
  - write-through for WAL device
  - small sized partition for WAL is fine
    - http://www.postgresql.org/docs/8.2/interactive/wal-configuration.html
    - current production system (postgres 8.1) has 7 of these files (as in documentation)
  - no journalling on WAL partition
  - mount with noatime and symlink to pg_xlog
  - DB on another
  - write-back on
  - XFS

disk benchmarks

small script to test iozone performance.

#!/bin/bash

## size in MB
size_step=5
size_steps_start=2
size_steps_nr=2

## max number of files, step of 1
steps=2
dest="/var/lib/pgsql/test /var/tmp/test"

out=iozone-out-<tt>date +%s</tt>

for d in $dest
do
  for j in <tt>seq $size_steps_start $(($size_steps_start+$size_steps_nr))</tt>
  do
    size=$(($j*$size_step))
    ioz="iozone -eM -s${size}M -r64k -i0 -i1 "

    for i in <tt>seq 1 $steps</tt>
    do
      files=""
      rm -Rf $d >& /dev/null
      mkdir -p $d >& /dev/null
      for t in <tt>seq 1 $i</tt>
      do
        files="$files $d/out$t"
      done
      echo "BEGIN" >> $out 
      echo <tt>date</tt> >> $out    
      echo "$ioz -t $i -F $files" >> $out
      $ioz -t $i -F $files >> $out 2>&1
      echo <tt>date</tt> >> $out 
      echo  "END" >> $out   
    done
    rm -Rf $d >& /dev/null
  done
done

on solaris, replace seq sequences with

perl -e "\$,=' '; print 1 .. $i;"

links

postgres DBs

dump

pg_dump

This is compressed and allows usage of pg_restore (eg to only restore the data).

pg_dump -F c -f <destinationfile> <name of db>

pg_dumpall

brute force
no selective restore (easily) possible
the gzip part is for compression

pg_dumpall | gzip > <destination file>

recover

pg_restore

- example 1: only restore data
- assumes the schema is there through some other means
- name of DB is extracted from filedump
- needs output from pg_dump with -F c

pg_restore -a -F c -d <name of DB> <name of file>

from pg_dumpall

needs minimal postgres setup
- best to stop postgresql server, move /var/lib/pgsql/data and reinitialise
as user postgres

su - postgres
zcat <path to gzipped full dump> | psql postgres

overwrite current DBs

db=test1
pg_dump $db > ${db}-new
psql $db < /path/to/$db

dropping current DBs

this can cause problems. reverted to previous state with the -new saves.

db=test1
pg_dump $db > ${db}-new
dropdb $db
createdb -T template0 $db
psql $db < /path/to/$db

Upgrade dcache 1.7 to 1.8 on same machine for quattor users

trigger a stop for ncm-dcache

touch /opt/d-cache/DONT_RUN_NCM_DCACHE

stop dcache/pnfs

/etc/init.d/dcache-pool stop
/etc/init.d/dcache-core stop
/etc/init.d/pnfs stop

make a backup
- first step of procedure below
make the upgrade
- build and notify the new configuration
- ncm-dcache (version > 3.0.0-11) won't do anything.
do all remaining steps below.
before rerunning ncm-dcache, remove the file /opt/d-cache/DONT_RUN_NCM_DCACHE

Move to new server using quattor/ncm-postgresql/ncm-dcache

Do a full dump on the original host (compressed for space savings):

mkdir -p /var/post_backup/
chown postgres.postgres /var/post_backup/
dest=/var/post_backup/full_<tt>date '+%s'</tt>
tar cvf ${dest}-pnfsdb.tar /opt/pnfsdb
su - postgres -c "pg_dumpall | gzip > $dest.gz"

On the new machine:
- setup exact same config (so don't add new pnfs-dbs yet!)
- stop dcache, pnfs and postgresql
- remove postgres DBs

cd /var/lib/pgsql
mv data data-unused-<tt>date +%s</tt>

- reinitialise DBs
- postgres < 8.2

/etc/init.d/postgresql start

- postgres >= 8.2

/etc/init.d/postgresql initdb
/etc/init.d/postgresql start

- copy postgres DB to machine and inject real data

su - postgres
zcat <full_...gz> | psql postgres

- if this is also a major upgrade
- major means eg from 1.7.0-X to 1.8.0-Y (unless release notes say otherwise)
- drop non-pnfs DBs
  - become postgres user

su - postgres

- - list all DBs

psql -l

- - look for dcache DBs that are not pnfs DBs
  - typical examples are dcache,billing and replicas
  - the companion DB should be left alone
  - pnfs DBs are admin,data1, whatever added yourself
  - base postgres DBs (should be left alone!) postgres,template0,template1
  - you can always redo this if you still have the postgres dump
  - drop the dcache non-pnfs DBs

for i in dcache billing replicas
do
  dropdb $i
done

- run ncm-postgres (as root), it will recreate the required dcache DBs

ncm-ncd --co postgresql

- restore the pnfsdb files (as root)

cd /
mv /opt/pnfsdb /opt/pnfsdb-orig-<tt>date +%s</tt>
tar xvf <path to pnfsdb.tar file>

- you should be able to start pnfs and list the content of the directories again
- you can now also rerun ncm-dcache

problems

Staging forever

Try to locate the pools of the file
- is the pool still up
- is the file stil in pool/data
- can you copy it
- LAST MEASURE: restart dcache-pool service
in PoolManager
- rc ls <pnfsid>.*
- what state is it in
  - something with Pool2Pool
  - wait, it's transferring
  - if stays there, go to destination pool and try to kill the p2p transfer
  - if that fail, try with rc destroy <the whole pnfsid identifier in rc ls> ed rc destroy 000C00000000000000BEEFE0@0.0.0.0/0.0.0.0-*/*

TransferManager error too many transfers!

  02/13 14:40:48 Cell(SRM-maite@srm-maiteDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.SRMException: TransferManager errortoo many transfers!

  cd RemoteGsiftpTransferManager
  set max transfers <#max transfers>

99 Repository got lost

When the pool is gone, it can be due to a lot of things.

XFS problem
- check dmesg or /var/log/messages for error messages like

XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1561 of file /usr/src/redhat/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_alloc.c.  Caller 0xffffffff88251392

xfs_force_shutdown(sdb1,0x8) called from line 4267 of file /usr/src/redhat/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_bmap.c.  Return address = 0xffffffff8825e2d0
Filesystem "sdb1": Corruption of in-memory data detected.  Shutting down filesystem: sdb1
Please umount the filesystem, and rectify the problem(s)

- if it's the case do the following (and in this order)
- stop dcache on the node

/etc/init.d/dcache-pool stop
/etc/init.d/dcache-core stop

- unmount the storage

umount /storage/1

- try to remount it

mount /storage/1

- check dmesg if it was succesful. the last line should read something like

XFS mounting filesystem sdb1
Ending clean XFS mount for filesystem: sdb1

- if not, run xfs_repair

xfs_repair /dev/sdb1

- - if it was succesful, remount the storage and check dmesg if the mount was succesful
  - restart dcache on the node

/etc/init.d/dcache-pool start
/etc/init.d/dcache-core start

- - if not, it can end with eg

ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

- - then rerun xfs_reapir with -L option (wyou already retried mounting a few steps ago)

xfs_repair -L /dev/sdb1

- - after it's done, remount and check dmesg for succesful mount
  - restart dcache

/etc/init.d/dcache-pool start
/etc/init.d/dcache-core start

Inner working of the PSU

From an email by Patrick Fuhrmann:

  Ok here we go :
  The procedure to find all matching pools :
  The request comes with a precise
    (StorageClass,dCacheClass,IPnumber)
  In the future there will be the protocol as well.
  Now we have to find the corresponding unit :
    Store Unit :
      We try to find an exact match of the incoming 
      StorageClass in the list of storage classes (unit -store).
      If not found we try to find some combinations of
      wildcards ( *@osm *.* ).
    dCache unit :
      Same as Store Unit but not used by anybody yet.
    Network Unit :
      The incoming IP number is compared with all
      net units. We start with the 255.255.255.255 and
      proceed with smaller masks until we reach the
      0.0.0.0 or we find a match. (like ip routers do).
      If we find a match we stop. As said in the previous e-mail,
      if 131.169.1.1 comes in and there are two net-units
      0.0.0.0/0.0.0.0 and 131.169.0.0/255.255.0.0 we choose
      131.169.0.0/255.255.0.0 (At this point we don't know
      anything about links yet). If there is no 131.1.. net-unit
      the result of this step is the 0.0.0.0/0.0.0.0 net-unit.

  Now we try to find ALL links which require a Store-Unit
  and a NetUnit AND match the units found in the above step.
  Next we add those links which only require Store-Unit or
  Net-Unit and match the unit found above. (As a matter of fact
  the is this additional criterium, which is the transfer 
  direction but this don't help us to understand the mechasm)
  Now we take all links found above and sort according to the
  preference of this transfer direction. For big setups this
  results in a matrix of links
    pref 100 : link1 link2 link3
    pref  50 : link4 link5
    pref ....
  Now we resolve the links into pools (via pool groups)
  Result is 
   pref100  : pool1, pool2 .....
   .....
  If at least one pool of the highest preference is available
  we send the resulting set of pools (of highest preference
  to the CM) rest is clear
  If non of highest preference is available we step to the next
  preference (a.s.o)
  This mechanism allows to construct all possible setups.
  As said before : The treatment of the net-unit seems
  strange but is really necessary if you think about it
  for awhile.

Transfer tests

See all remote transfers:

cd RemoteGsiftpTransferManager
ls

Problems and Solutions

While runing ExDigiStatistics with some PU data, the system is really tested.

ulimit -n

Some of the jobs gave the following error message:

 Server error message for [35]: "Unexpected Exception :
 java.net.SocketException: Too many open files" (errno 33).
 Failed open file in the dCache.

Apparently this is due to a a restricted number of file descriptors. You can find out the total amount with

 ulimit -n

A tip from Michael Ernst: (ulimit -n)/(# of active movers) = 5 or more (now it's 16 ;). The best thing is to put this in the dcache-pool script and the restart the pool.

  mv /opt/d-cache/bin/dcache-pool /opt/d-cache/bin/dcache-pool-orig;cat /opt/d-cache/bin/dcache-pool-orig|sed              
  's/start)/start)\n\tulimit -n 16384/' > /opt/d-cache/bin/dcache-pool;chmod +x 
  /opt/d-cache/bin/dcache-pool;/etc/init.d/dcache-pool restart

Detected Tx Unit Hang

One of the pools got the following kernel message:

 e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang

This resulted in a redetection of the network with the wrong network speed. Obviously, the adapter was a bit overloaded. A network restart fixed the problem.

Channel bonding

Using 2 e1000 pci adapters, the maximum number of packets rose from 70k to 90k. Obviously the hardware was a bottleneck (3 adapters on pci, single P4), but still some result nonetheless. The setup was pretty basic and easy:

following the setup from <kernel>/Documentation/networking/bonding.txt
SL3 supports bonding config with initscripts. It's pretty easy and straightforward.
don't forget to recompile ifenslave when using non-standard kernel.
i tried bonding-mode abl (wich needs no special switch settings), but the network driver requires set_dev_mac_address (not on the r8169!!), this also needs miimon setting.

--

Links

IIHE DCache http://maite.iihe.ac.be:2288/

Template:TracNotice

DCache

Contents

monitoring

postgres tuning

benchmark

tuning

disk layout

disk benchmarks

links

postgres DBs

dump

pg_dump

pg_dumpall

recover

pg_restore

from pg_dumpall

overwrite current DBs

dropping current DBs

Upgrade dcache 1.7 to 1.8 on same machine for quattor users

Move to new server using quattor/ncm-postgresql/ncm-dcache

problems

Staging forever

TransferManager error too many transfers!

99 Repository got lost

Inner working of the PSU

Transfer tests

Problems and Solutions

ulimit -n

Detected Tx Unit Hang

Channel bonding

Links

Navigation menu

DCache

monitoring

postgres tuning

benchmark

tuning

disk layout

disk benchmarks

links

postgres DBs

dump

pg_dump

pg_dumpall

recover

pg_restore

from pg_dumpall

overwrite current DBs

dropping current DBs

Upgrade dcache 1.7 to 1.8 on same machine for quattor users

Move to new server using quattor/ncm-postgresql/ncm-dcache

problems

Staging forever

TransferManager error too many transfers!

99 Repository got lost

Inner working of the PSU

Transfer tests

Problems and Solutions

ulimit -n

Detected Tx Unit Hang

Channel bonding

Links

Navigation menu

Search