DCache
monitoring
- SRM monitor http://maite.iihe.ac.be:8098/billing/xml/
- dcache head page http://maite.iihe.ac.be:2288
- SRM manual https://sdm.lbl.gov/srm-wg/doc/SRM.v2.2.html
postgres tuning
benchmark
- http://edoceo.com/liber/db-postgresql-benchmark
- switch to postgres user
su - postgres
- initialise the benchmark db
- -s for maximum of active clients
createdb testdb pgbench -i testdb -s 25
- run the benchmark
pgbench -c 10 -t 100 testdb
#!/bin/bash ## number of clients clients="1 10 25 50 75 100" #clients="1 2" ## number of transactions trans="100 500 1000 2000 5000 10000 50000" #trans="100 200" ## scaling factor used for intialisastion scaling_fact="1 5 10 15 20" #scaling_fact="1 2" ## number of retries stats=100 #stats=5 for sc in $scaling_fact do dropdb testdb >& /dev/null createdb testdb >& /dev/null pgbench -i testdb -s $sc >& /dev/null for c in $clients do for t in $trans do tot_i=0 for s in <tt>seq 1 $stats</tt> do pgbench -c $c -t $t testdb >& /tmp/out-pgbench echo "pgbench -c $c -t $t testdb" >> /tmp/out-pgbench tps=<tt>cat /tmp/out-pgbench|grep tps|grep including|sed 's/.* \([0-9]\+\).*/\1/'</tt> echo "TPS_INTERMEDIATE= $tps Run $s Clients $c Transfers $t Scaling $sc" tot_i=$(($tot_i + $tps)) done tot_i=$(($tot_i/$stats)) echo "TPS= $tot_i Clients $c Transfers $t Scaling $sc" done done done
tuning
disk layout
- WAL logs on a different disk or not?
- 4 disk setup
- 1 device in 4 disk RAID10
- higher throughput, interesting for reading (i guess)
- 2 devices in 2 * 2 disk RAID1
- WAL/OS on a separate disk, WAL on own partition
- OS/other software doesn't do a lot of IO (or lets hope it's not all fsync'ed)
- write-through for WAL device
- small sized partition for WAL is fine
- http://www.postgresql.org/docs/8.2/interactive/wal-configuration.html
- current production system (postgres 8.1) has 7 of these files (as in documentation)
- no journalling on WAL partition
- mount with noatime and symlink to pg_xlog
- DB on another
- write-back on
- XFS
disk benchmarks
- small script to test iozone performance.
#!/bin/bash ## size in MB size_step=5 size_steps_start=2 size_steps_nr=2 ## max number of files, step of 1 steps=2 dest="/var/lib/pgsql/test /var/tmp/test" out=iozone-out-<tt>date +%s</tt> for d in $dest do for j in <tt>seq $size_steps_start $(($size_steps_start+$size_steps_nr))</tt> do size=$(($j*$size_step)) ioz="iozone -eM -s${size}M -r64k -i0 -i1 " for i in <tt>seq 1 $steps</tt> do files="" rm -Rf $d >& /dev/null mkdir -p $d >& /dev/null for t in <tt>seq 1 $i</tt> do files="$files $d/out$t" done echo "BEGIN" >> $out echo <tt>date</tt> >> $out echo "$ioz -t $i -F $files" >> $out $ioz -t $i -F $files >> $out 2>&1 echo <tt>date</tt> >> $out echo "END" >> $out done rm -Rf $d >& /dev/null done done
- on solaris, replace seq sequences with
perl -e "\$,=' '; print 1 .. $i;"
links
- http://edoceo.com/liber/db-postgresql-performance
- http://www.postgresql.org/docs/8.2/interactive/kernel-resources.html
- http://www.powerpostgresql.com/PerfList
- http://www.gmod.org/wiki/index.php/PostgreSQL_Performance_Tips
postgres DBs
dump
pg_dump
- This is compressed and allows usage of pg_restore (eg to only restore the data).
pg_dump -F c -f <destinationfile> <name of db>
pg_dumpall
- brute force
- no selective restore (easily) possible
- the gzip part is for compression
pg_dumpall | gzip > <destination file>
recover
pg_restore
- example 1: only restore data
- assumes the schema is there through some other means
- name of DB is extracted from filedump
- needs output from pg_dump with -F c
pg_restore -a -F c -d <name of DB> <name of file>
from pg_dumpall
- needs minimal postgres setup
- best to stop postgresql server, move /var/lib/pgsql/data and reinitialise
- as user postgres
su - postgres zcat <path to gzipped full dump> | psql postgres
overwrite current DBs
db=test1 pg_dump $db > ${db}-new psql $db < /path/to/$db
dropping current DBs
- this can cause problems. reverted to previous state with the -new saves.
db=test1 pg_dump $db > ${db}-new dropdb $db createdb -T template0 $db psql $db < /path/to/$db
Upgrade dcache 1.7 to 1.8 on same machine for quattor users
- trigger a stop for ncm-dcache
touch /opt/d-cache/DONT_RUN_NCM_DCACHE
- stop dcache/pnfs
/etc/init.d/dcache-pool stop /etc/init.d/dcache-core stop /etc/init.d/pnfs stop
- make a backup
- first step of procedure below
- make the upgrade
- build and notify the new configuration
- ncm-dcache (version > 3.0.0-11) won't do anything.
- do all remaining steps below.
- before rerunning ncm-dcache, remove the file /opt/d-cache/DONT_RUN_NCM_DCACHE
Move to new server using quattor/ncm-postgresql/ncm-dcache
- Do a full dump on the original host (compressed for space savings):
mkdir -p /var/post_backup/ chown postgres.postgres /var/post_backup/ dest=/var/post_backup/full_<tt>date '+%s'</tt> tar cvf ${dest}-pnfsdb.tar /opt/pnfsdb su - postgres -c "pg_dumpall | gzip > $dest.gz"
- On the new machine:
- setup exact same config (so don't add new pnfs-dbs yet!)
- stop dcache, pnfs and postgresql
- remove postgres DBs
cd /var/lib/pgsql mv data data-unused-<tt>date +%s</tt>
- reinitialise DBs
- postgres < 8.2
/etc/init.d/postgresql start
- postgres >= 8.2
/etc/init.d/postgresql initdb /etc/init.d/postgresql start
- copy postgres DB to machine and inject real data
su - postgres zcat <full_...gz> | psql postgres
- if this is also a major upgrade
- major means eg from 1.7.0-X to 1.8.0-Y (unless release notes say otherwise)
- drop non-pnfs DBs
- become postgres user
su - postgres
- list all DBs
psql -l
- look for dcache DBs that are not pnfs DBs
- typical examples are dcache,billing and replicas
- the companion DB should be left alone
- pnfs DBs are admin,data1, whatever added yourself
- base postgres DBs (should be left alone!) postgres,template0,template1
- you can always redo this if you still have the postgres dump
- drop the dcache non-pnfs DBs
for i in dcache billing replicas do dropdb $i done
- run ncm-postgres (as root), it will recreate the required dcache DBs
ncm-ncd --co postgresql
- restore the pnfsdb files (as root)
cd / mv /opt/pnfsdb /opt/pnfsdb-orig-<tt>date +%s</tt> tar xvf <path to pnfsdb.tar file>
- you should be able to start pnfs and list the content of the directories again
- you can now also rerun ncm-dcache
problems
Staging forever
- Try to locate the pools of the file
- is the pool still up
- is the file stil in pool/data
- can you copy it
- LAST MEASURE: restart dcache-pool service
- in PoolManager
- rc ls <pnfsid>.*
- what state is it in
- something with Pool2Pool
- wait, it's transferring
- if stays there, go to destination pool and try to kill the p2p transfer
- if that fail, try with rc destroy <the whole pnfsid identifier in rc ls> ed rc destroy 000C00000000000000BEEFE0@0.0.0.0/0.0.0.0-*/*
TransferManager error too many transfers!
02/13 14:40:48 Cell(SRM-maite@srm-maiteDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.SRMException: TransferManager errortoo many transfers!
cd RemoteGsiftpTransferManager set max transfers <#max transfers>
99 Repository got lost
When the pool is gone, it can be due to a lot of things.
- XFS problem
- check dmesg or /var/log/messages for error messages like
XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1561 of file /usr/src/redhat/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_alloc.c. Caller 0xffffffff88251392
xfs_force_shutdown(sdb1,0x8) called from line 4267 of file /usr/src/redhat/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_bmap.c. Return address = 0xffffffff8825e2d0 Filesystem "sdb1": Corruption of in-memory data detected. Shutting down filesystem: sdb1 Please umount the filesystem, and rectify the problem(s)
- if it's the case do the following (and in this order)
- stop dcache on the node
/etc/init.d/dcache-pool stop /etc/init.d/dcache-core stop
- unmount the storage
umount /storage/1
- try to remount it
mount /storage/1
- check dmesg if it was succesful. the last line should read something like
XFS mounting filesystem sdb1 Ending clean XFS mount for filesystem: sdb1
- if not, run xfs_repair
xfs_repair /dev/sdb1
- if it was succesful, remount the storage and check dmesg if the mount was succesful
- restart dcache on the node
/etc/init.d/dcache-pool start /etc/init.d/dcache-core start
- if not, it can end with eg
ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.
- then rerun xfs_reapir with -L option (wyou already retried mounting a few steps ago)
xfs_repair -L /dev/sdb1
- after it's done, remount and check dmesg for succesful mount
- restart dcache
/etc/init.d/dcache-pool start /etc/init.d/dcache-core start
Inner working of the PSU
From an email by Patrick Fuhrmann:
Ok here we go : The procedure to find all matching pools : The request comes with a precise (StorageClass,dCacheClass,IPnumber) In the future there will be the protocol as well. Now we have to find the corresponding unit : Store Unit : We try to find an exact match of the incoming StorageClass in the list of storage classes (unit -store). If not found we try to find some combinations of wildcards ( *@osm *.* ). dCache unit : Same as Store Unit but not used by anybody yet. Network Unit : The incoming IP number is compared with all net units. We start with the 255.255.255.255 and proceed with smaller masks until we reach the 0.0.0.0 or we find a match. (like ip routers do). If we find a match we stop. As said in the previous e-mail, if 131.169.1.1 comes in and there are two net-units 0.0.0.0/0.0.0.0 and 131.169.0.0/255.255.0.0 we choose 131.169.0.0/255.255.0.0 (At this point we don't know anything about links yet). If there is no 131.1.. net-unit the result of this step is the 0.0.0.0/0.0.0.0 net-unit. Now we try to find ALL links which require a Store-Unit and a NetUnit AND match the units found in the above step. Next we add those links which only require Store-Unit or Net-Unit and match the unit found above. (As a matter of fact the is this additional criterium, which is the transfer direction but this don't help us to understand the mechasm) Now we take all links found above and sort according to the preference of this transfer direction. For big setups this results in a matrix of links pref 100 : link1 link2 link3 pref 50 : link4 link5 pref .... Now we resolve the links into pools (via pool groups) Result is pref100 : pool1, pool2 ..... ..... If at least one pool of the highest preference is available we send the resulting set of pools (of highest preference to the CM) rest is clear If non of highest preference is available we step to the next preference (a.s.o) This mechanism allows to construct all possible setups. As said before : The treatment of the net-unit seems strange but is really necessary if you think about it for awhile.
Transfer tests
See all remote transfers:
cd RemoteGsiftpTransferManager ls
Problems and Solutions
While runing ExDigiStatistics with some PU data, the system is really tested.
ulimit -n
Some of the jobs gave the following error message:
Server error message for [35]: "Unexpected Exception : java.net.SocketException: Too many open files" (errno 33). Failed open file in the dCache.
Apparently this is due to a a restricted number of file descriptors. You can find out the total amount with
ulimit -n
A tip from Michael Ernst: (ulimit -n)/(# of active movers) = 5 or more (now it's 16 ;). The best thing is to put this in the dcache-pool script and the restart the pool.
mv /opt/d-cache/bin/dcache-pool /opt/d-cache/bin/dcache-pool-orig;cat /opt/d-cache/bin/dcache-pool-orig|sed 's/start)/start)\n\tulimit -n 16384/' > /opt/d-cache/bin/dcache-pool;chmod +x /opt/d-cache/bin/dcache-pool;/etc/init.d/dcache-pool restart
Detected Tx Unit Hang
One of the pools got the following kernel message:
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
This resulted in a redetection of the network with the wrong network speed. Obviously, the adapter was a bit overloaded. A network restart fixed the problem.
Channel bonding
Using 2 e1000 pci adapters, the maximum number of packets rose from 70k to 90k. Obviously the hardware was a bottleneck (3 adapters on pci, single P4), but still some result nonetheless. The setup was pretty basic and easy:
- following the setup from <kernel>/Documentation/networking/bonding.txt
- SL3 supports bonding config with initscripts. It's pretty easy and straightforward.
- don't forget to recompile ifenslave when using non-standard kernel.
- i tried bonding-mode abl (wich needs no special switch settings), but the network driver requires set_dev_mac_address (not on the r8169!!), this also needs miimon setting.
--
Links
- IIHE DCache http://maite.iihe.ac.be:2288/