Discussion:
[Samba] ctdb: Strange behaviour after upgrade
eisofen
2010-11-18 15:14:06 UTC
Permalink
Hi,

last weekend I've updated samba and ctdb on my 2-node cluster. Samba is
now on 3.5.6 (from 3.3.4), ctdb on 1.0.114 (from 1.0.84). Both installed
from repo via yum and ctdb-packages.

After restarting both nodes everything was fine, we could access files on
the cluster.

On monday I noticed that the nodes didn't had their initial adresses:

Node 1:
hostname dscln01, public IP 10.0.0.41/8, now 10.0.0.42/8
/etc/sysconfig/network-scripts/ifcfg-bond0:

DEVICE=bond0
BOOTPROTO=none
IPADDR=10.0.0.41
NETWORK=10.0.0.0
BROADCAST=10.0.0.255
NETMASK=255.0.0.0
ONBOOT=yes
USERCTL=no



Node 2:
hostname dscln02, public IP 10.0.0.42/8, now 10.0.0.41/8
/etc/sysconfig/network-scripts/ifcfg-bond0:

DEVICE=bond0
BOOTPROTO=none
IPADDR=10.0.0.42
NETWORK=10.0.0.0
BROADCAST=10.0.0.255
NETMASK=255.0.0.0
ONBOOT=yes
USERCTL=no

Yesterday it felt over so we had to reboot both nodes and the IP where
still mixed up.

log.ctdb got some interesing entries after reboot:

2010/11/17 09:48:02.613807 [ 4383]: killed 30 TCP connections to released
IP 10.0.0.42
2010/11/17 09:48:02.633263 [ 4383]: re-adding secondary address
10.0.0.41/8 to dev bond0
2010/11/17 09:48:02.646140 [ 4383]: /etc/ctdb/interface_modify.sh: line
71: /etc/ctdb/state/interface_modify/bond0.readd.d/10.0.0.41.8/*: No such
file or
directory
2010/11/17 09:48:02.646446 [ 4383]:
/etc/ctdb/state/interface_modify/bond0.readd.d/10.0.0.41.8/* 'bond0'
'10.0.0.41' '8' - failed - 127
2010/11/17 09:48:02.646514 [ 4383]: call
/etc/ctdb/state/interface_modify/bond0.readd.d/10.0.0.41.8/* 'bond0'
'10.0.0.41' '8'
2010/11/17 09:48:02.647412 [ 4383]: Failed to del 10.0.0.42 on dev bond0
2010/11/17 09:48:02.649354 [ 4383]: server/ctdb_daemon.c:688 waitpid()
returned error. errno:10

I also notice, or lets say user reports, slow performance when shutting
down their PC. When it comes to closing time load climbs to ~70 on both
nodes. with high CPU load on ctdbd and mmfsd. OK, 220 PC writing back their
profiles..

Could ctdb the blocking element when writing to it's persistent DB, since
the local disks are not that super fast?

Both nodes are hooked up to an infortrend SAN, connected up via FC-AL, FS
is GPFS, running on CentOS 5.3.
Did I do something wrong after or before upgrading?


Matthias
Michael Adam
2010-11-18 20:44:10 UTC
Permalink
Moin Eisofen!
Post by eisofen
Hi,
last weekend I've updated samba and ctdb on my 2-node cluster. Samba is
now on 3.5.6 (from 3.3.4), ctdb on 1.0.114 (from 1.0.84). Both installed
from repo via yum and ctdb-packages.
After restarting both nodes everything was fine, we could access files on
the cluster.
hostname dscln01, public IP 10.0.0.41/8, now 10.0.0.42/8
DEVICE=bond0
BOOTPROTO=none
IPADDR=10.0.0.41
NETWORK=10.0.0.0
BROADCAST=10.0.0.255
NETMASK=255.0.0.0
ONBOOT=yes
USERCTL=no
hostname dscln02, public IP 10.0.0.42/8, now 10.0.0.41/8
DEVICE=bond0
BOOTPROTO=none
IPADDR=10.0.0.42
NETWORK=10.0.0.0
BROADCAST=10.0.0.255
NETMASK=255.0.0.0
ONBOOT=yes
USERCTL=no
Yesterday it felt over so we had to reboot both nodes and the IP where
still mixed up.
That is merely cosmetic actually.
When using public addresses with ctdb, you should not rely on a
specific node having a specific IP address.
It seems that in some release between 1.0.84 and 1.0.114
(and I do currently not know exactly when) the algorithm for
distributing ips across nodes has been reversed.
It think this has also been discussed on the #ctdb irc channel
some weeks or even months ago.

Your clients should only ever access the cluster by it's name, to
which the whole pool of public ip addresses is assigned, so it
should really not matter to which node an address is assigned.
Post by eisofen
2010/11/17 09:48:02.613807 [ 4383]: killed 30 TCP connections to released
IP 10.0.0.42
2010/11/17 09:48:02.633263 [ 4383]: re-adding secondary address
10.0.0.41/8 to dev bond0
2010/11/17 09:48:02.646140 [ 4383]: /etc/ctdb/interface_modify.sh: line
71: /etc/ctdb/state/interface_modify/bond0.readd.d/10.0.0.41.8/*: No such
file or
directory
/etc/ctdb/state/interface_modify/bond0.readd.d/10.0.0.41.8/* 'bond0'
'10.0.0.41' '8' - failed - 127
2010/11/17 09:48:02.646514 [ 4383]: call
/etc/ctdb/state/interface_modify/bond0.readd.d/10.0.0.41.8/* 'bond0'
'10.0.0.41' '8'
2010/11/17 09:48:02.647412 [ 4383]: Failed to del 10.0.0.42 on dev bond0
2010/11/17 09:48:02.649354 [ 4383]: server/ctdb_daemon.c:688 waitpid()
returned error. errno:10
Hmmm. Did you assign the public addesses 10.0.0.41 and 10.0.0.42
to the nodes statically? This is not good. If you need static Ip
addresse on the public interfaces (e.g. for login etc), you should
use a different set of addresses.

Anyways, the above is a sign of a bug in the interface_modify.sh
script. Not sure that this is very bad though.

There is a patch in the master branch though for this and I think
It should apply to the 1.0.114 version:

http://gitweb.samba.org/?p=sahlberg/ctdb.git;a=commit;h=e665cfde03fc9ec2264e99512ed5470872a2fd04

But we need to get clear about the pool vs. static IPs first.
Post by eisofen
I also notice, or lets say user reports, slow performance when shutting
down their PC. When it comes to closing time load climbs to ~70 on both
nodes. with high CPU load on ctdbd and mmfsd. OK, 220 PC writing back their
profiles..
Has that been slow before?
Has the workload changed or just the samba+ctdb versions?
Workload of course also changes when profiles grow...
Post by eisofen
Could ctdb the blocking element when writing to it's persistent DB, since
the local disks are not that super fast?
Depends on what the workload really looks right, but I guess rather not.
Post by eisofen
Both nodes are hooked up to an infortrend SAN, connected up via FC-AL, FS
is GPFS, running on CentOS 5.3.
Did I do something wrong after or before upgrading?
I can't say for sure.
I'd need to look at your configs (ctdb + samba).

Cheers - Michael
Post by eisofen
Matthias
--
To unsubscribe from this list go to the following URL and read the
instructions: https://lists.samba.org/mailman/options/samba
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba/attachments/20101118/a8f40805/attachment.pgp>
Loading...