DCache

From T2B Wiki
Jump to navigation Jump to search

monitoring

Chimera commands

Find all files that are supposed to be on a certain pool

psql --no-align --tuples-only --username postgres --dbname chimera --command "select inode2path(ipnfsid)  from t_inodes  t1, t_locationinfo t2  where   t1.inumber=t2.inumber and t2.ilo
cation='behar081_5';" --output list-of-files-that-should-be-on-behar081_5-according-to-Chimera

Far more chimera shortcuts can be found in this talk by Dmitry Litvintsev

postgres tuning

benchmark

su - postgres
  • initialise the benchmark db
    • -s for maximum of active clients
createdb testdb
pgbench -i testdb -s 25
  • run the benchmark
pgbench -c 10 -t 100 testdb
#!/bin/bash

## number of clients
clients="1 10 25 50 75 100"
#clients="1 2"
## number of transactions
trans="100 500 1000 2000 5000 10000 50000"
#trans="100 200"
## scaling factor used for intialisastion 
scaling_fact="1 5 10 15 20"
#scaling_fact="1 2"

## number of retries
stats=100
#stats=5

for sc in $scaling_fact
do
 dropdb testdb >& /dev/null
 createdb testdb >& /dev/null
 pgbench -i testdb -s $sc >& /dev/null
 for c in $clients
 do
   for t in $trans
   do
     tot_i=0
     for s in <tt>seq 1 $stats</tt>
     do
       pgbench -c $c -t $t testdb >& /tmp/out-pgbench
       echo "pgbench -c $c -t $t testdb" >> /tmp/out-pgbench
       tps=<tt>cat /tmp/out-pgbench|grep tps|grep including|sed 's/.* \([0-9]\+\).*/\1/'</tt>
       echo "TPS_INTERMEDIATE= $tps Run $s Clients $c Transfers $t Scaling $sc"
       tot_i=$(($tot_i + $tps))
     done
     tot_i=$(($tot_i/$stats))
     echo "TPS= $tot_i Clients $c Transfers $t Scaling $sc"
   done  
  done
done


tuning

disk layout

  • WAL logs on a different disk or not?
    • 4 disk setup
    • 1 device in 4 disk RAID10
      • higher throughput, interesting for reading (i guess)
    • 2 devices in 2 * 2 disk RAID1
      • WAL/OS on a separate disk, WAL on own partition
      • OS/other software doesn't do a lot of IO (or lets hope it's not all fsync'ed)
      • write-through for WAL device
      • small sized partition for WAL is fine
      • no journalling on WAL partition
      • mount with noatime and symlink to pg_xlog
      • DB on another
      • write-back on
      • XFS

disk benchmarks

  • small script to test iozone performance.
#!/bin/bash

## size in MB
size_step=5
size_steps_start=2
size_steps_nr=2

## max number of files, step of 1
steps=2
dest="/var/lib/pgsql/test /var/tmp/test"

out=iozone-out-<tt>date +%s</tt>

for d in $dest
do
  for j in <tt>seq $size_steps_start $(($size_steps_start+$size_steps_nr))</tt>
  do
    size=$(($j*$size_step))
    ioz="iozone -eM -s${size}M -r64k -i0 -i1 "

    for i in <tt>seq 1 $steps</tt>
    do
      files=""
      rm -Rf $d >& /dev/null
      mkdir -p $d >& /dev/null
      for t in <tt>seq 1 $i</tt>
      do
        files="$files $d/out$t"
      done
      echo "BEGIN" >> $out 
      echo <tt>date</tt> >> $out    
      echo "$ioz -t $i -F $files" >> $out
      $ioz -t $i -F $files >> $out 2>&1
      echo <tt>date</tt> >> $out 
      echo  "END" >> $out   
    done
    rm -Rf $d >& /dev/null
  done
done

  • on solaris, replace seq sequences with
perl -e "\$,=' '; print 1 .. $i;"

links



postgres DBs

dump

pg_dump

  • This is compressed and allows usage of pg_restore (eg to only restore the data).
pg_dump -F c -f <destinationfile> <name of db>

pg_dumpall

  • brute force
  • no selective restore (easily) possible
  • the gzip part is for compression
pg_dumpall | gzip > <destination file>

recover

pg_restore

    • example 1: only restore data
    • assumes the schema is there through some other means
    • name of DB is extracted from filedump
    • needs output from pg_dump with -F c
pg_restore -a -F c -d <name of DB> <name of file>

from pg_dumpall

  • needs minimal postgres setup
    • best to stop postgresql server, move /var/lib/pgsql/data and reinitialise
  • as user postgres
su - postgres
zcat <path to gzipped full dump> | psql postgres

overwrite current DBs

db=test1
pg_dump $db > ${db}-new
psql $db < /path/to/$db

dropping current DBs

  • this can cause problems. reverted to previous state with the -new saves.
db=test1
pg_dump $db > ${db}-new
dropdb $db
createdb -T template0 $db
psql $db < /path/to/$db

Upgrade dcache 1.7 to 1.8 on same machine for quattor users

  • trigger a stop for ncm-dcache
touch /opt/d-cache/DONT_RUN_NCM_DCACHE
  • stop dcache/pnfs
/etc/init.d/dcache-pool stop
/etc/init.d/dcache-core stop
/etc/init.d/pnfs stop
  • make a backup
    • first step of procedure below
  • make the upgrade
    • build and notify the new configuration
    • ncm-dcache (version > 3.0.0-11) won't do anything.
  • do all remaining steps below.
  • before rerunning ncm-dcache, remove the file /opt/d-cache/DONT_RUN_NCM_DCACHE

Move to new server using quattor/ncm-postgresql/ncm-dcache

  • Do a full dump on the original host (compressed for space savings):
mkdir -p /var/post_backup/
chown postgres.postgres /var/post_backup/
dest=/var/post_backup/full_<tt>date '+%s'</tt>
tar cvf ${dest}-pnfsdb.tar /opt/pnfsdb
su - postgres -c "pg_dumpall | gzip > $dest.gz"
  • On the new machine:
    • setup exact same config (so don't add new pnfs-dbs yet!)
    • stop dcache, pnfs and postgresql
    • remove postgres DBs
cd /var/lib/pgsql
mv data data-unused-<tt>date +%s</tt>
    • reinitialise DBs
    • postgres < 8.2
/etc/init.d/postgresql start
    • postgres >= 8.2
/etc/init.d/postgresql initdb
/etc/init.d/postgresql start
    • copy postgres DB to machine and inject real data
su - postgres
zcat <full_...gz> | psql postgres
    • if this is also a major upgrade
    • major means eg from 1.7.0-X to 1.8.0-Y (unless release notes say otherwise)
    • drop non-pnfs DBs
      • become postgres user
su - postgres
      • list all DBs
psql -l
      • look for dcache DBs that are not pnfs DBs
      • typical examples are dcache,billing and replicas
      • the companion DB should be left alone
      • pnfs DBs are admin,data1, whatever added yourself
      • base postgres DBs (should be left alone!) postgres,template0,template1
      • you can always redo this if you still have the postgres dump
      • drop the dcache non-pnfs DBs
for i in dcache billing replicas
do
  dropdb $i
done
    • run ncm-postgres (as root), it will recreate the required dcache DBs
ncm-ncd --co postgresql
    • restore the pnfsdb files (as root)
cd /
mv /opt/pnfsdb /opt/pnfsdb-orig-<tt>date +%s</tt>
tar xvf <path to pnfsdb.tar file>
    • you should be able to start pnfs and list the content of the directories again
    • you can now also rerun ncm-dcache

problems

Staging forever

  • Try to locate the pools of the file
    • is the pool still up
    • is the file stil in pool/data
    • can you copy it
    • LAST MEASURE: restart dcache-pool service
  • in PoolManager
    • rc ls <pnfsid>.*
    • what state is it in
      • something with Pool2Pool
      • wait, it's transferring
      • if stays there, go to destination pool and try to kill the p2p transfer
      • if that fail, try with rc destroy <the whole pnfsid identifier in rc ls> ed rc destroy 000C00000000000000BEEFE0@0.0.0.0/0.0.0.0-*/*


TransferManager error too many transfers!

  02/13 14:40:48 Cell(SRM-maite@srm-maiteDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.SRMException: TransferManager errortoo many transfers!
  cd RemoteGsiftpTransferManager
  set max transfers <#max transfers>

99 Repository got lost

When the pool is gone, it can be due to a lot of things.

  • XFS problem
    • check dmesg or /var/log/messages for error messages like
XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1561 of file /usr/src/redhat/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_alloc.c.  Caller 0xffffffff88251392
xfs_force_shutdown(sdb1,0x8) called from line 4267 of file /usr/src/redhat/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_bmap.c.  Return address = 0xffffffff8825e2d0
Filesystem "sdb1": Corruption of in-memory data detected.  Shutting down filesystem: sdb1
Please umount the filesystem, and rectify the problem(s)
    • if it's the case do the following (and in this order)
    • stop dcache on the node
/etc/init.d/dcache-pool stop
/etc/init.d/dcache-core stop
    • unmount the storage
umount /storage/1
    • try to remount it
mount /storage/1
    • check dmesg if it was succesful. the last line should read something like
XFS mounting filesystem sdb1
Ending clean XFS mount for filesystem: sdb1
    • if not, run xfs_repair
xfs_repair /dev/sdb1
      • if it was succesful, remount the storage and check dmesg if the mount was succesful
      • restart dcache on the node
/etc/init.d/dcache-pool start
/etc/init.d/dcache-core start
      • if not, it can end with eg
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
      • then rerun xfs_reapir with -L option (wyou already retried mounting a few steps ago)
xfs_repair -L /dev/sdb1
      • after it's done, remount and check dmesg for succesful mount
      • restart dcache
/etc/init.d/dcache-pool start
/etc/init.d/dcache-core start


Inner working of the PSU

From an email by Patrick Fuhrmann:

  Ok here we go :
  The procedure to find all matching pools :
  The request comes with a precise
    (StorageClass,dCacheClass,IPnumber)
  In the future there will be the protocol as well.
  Now we have to find the corresponding unit :
    Store Unit :
      We try to find an exact match of the incoming 
      StorageClass in the list of storage classes (unit -store).
      If not found we try to find some combinations of
      wildcards ( *@osm *.* ).
    dCache unit :
      Same as Store Unit but not used by anybody yet.
    Network Unit :
      The incoming IP number is compared with all
      net units. We start with the 255.255.255.255 and
      proceed with smaller masks until we reach the
      0.0.0.0 or we find a match. (like ip routers do).
      If we find a match we stop. As said in the previous e-mail,
      if 131.169.1.1 comes in and there are two net-units
      0.0.0.0/0.0.0.0 and 131.169.0.0/255.255.0.0 we choose
      131.169.0.0/255.255.0.0 (At this point we don't know
      anything about links yet). If there is no 131.1.. net-unit
      the result of this step is the 0.0.0.0/0.0.0.0 net-unit.

  Now we try to find ALL links which require a Store-Unit
  and a NetUnit AND match the units found in the above step.
  Next we add those links which only require Store-Unit or
  Net-Unit and match the unit found above. (As a matter of fact
  the is this additional criterium, which is the transfer 
  direction but this don't help us to understand the mechasm)
  Now we take all links found above and sort according to the
  preference of this transfer direction. For big setups this
  results in a matrix of links
    pref 100 : link1 link2 link3
    pref  50 : link4 link5
    pref ....
  Now we resolve the links into pools (via pool groups)
  Result is 
   pref100  : pool1, pool2 .....
   .....
  If at least one pool of the highest preference is available
  we send the resulting set of pools (of highest preference
  to the CM) rest is clear
  If non of highest preference is available we step to the next
  preference (a.s.o)
  This mechanism allows to construct all possible setups.
  As said before : The treatment of the net-unit seems
  strange but is really necessary if you think about it
  for awhile.

Transfer tests

See all remote transfers:

cd RemoteGsiftpTransferManager
ls

Problems and Solutions

While runing ExDigiStatistics with some PU data, the system is really tested.

ulimit -n

Some of the jobs gave the following error message:

 Server error message for [35]: "Unexpected Exception :
 java.net.SocketException: Too many open files" (errno 33).
 Failed open file in the dCache.

Apparently this is due to a a restricted number of file descriptors. You can find out the total amount with

 ulimit -n

A tip from Michael Ernst: (ulimit -n)/(# of active movers) = 5 or more (now it's 16 ;). The best thing is to put this in the dcache-pool script and the restart the pool.

  mv /opt/d-cache/bin/dcache-pool /opt/d-cache/bin/dcache-pool-orig;cat /opt/d-cache/bin/dcache-pool-orig|sed              
  's/start)/start)\n\tulimit -n 16384/' > /opt/d-cache/bin/dcache-pool;chmod +x 
  /opt/d-cache/bin/dcache-pool;/etc/init.d/dcache-pool restart

Detected Tx Unit Hang

One of the pools got the following kernel message:

 e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang

This resulted in a redetection of the network with the wrong network speed. Obviously, the adapter was a bit overloaded. A network restart fixed the problem.

Channel bonding

Using 2 e1000 pci adapters, the maximum number of packets rose from 70k to 90k. Obviously the hardware was a bottleneck (3 adapters on pci, single P4), but still some result nonetheless. The setup was pretty basic and easy:

  • following the setup from <kernel>/Documentation/networking/bonding.txt
  • SL3 supports bonding config with initscripts. It's pretty easy and straightforward.
  • don't forget to recompile ifenslave when using non-standard kernel.
  • i tried bonding-mode abl (wich needs no special switch settings), but the network driver requires set_dev_mac_address (not on the r8169!!), this also needs miimon setting.

--

Links


Template:TracNotice