Discussion:
[Samba] NT_STATUS_NO_LOGON_SERVERS errors sporadically occurring
Jason Haar
2007-11-26 02:58:30 UTC
Permalink
Hi there

I have samba-3.0.27a rolled out over a large number of servers, and
every once in a while one of them will start failing to allow people to
connect, with winbind reporting NT_STATUS_NO_LOGON_SERVERS, and
ntlm_auth failing with "NT_STATUS_NO_LOGON_SERVERS: No logon servers".
The same problem occurred with earlier versions too.

I think I've tracked down the cause of the problem as being "our fault",
but Samba really isn't handling it well. We have a 10.* network, and
servers with dual Ethernet cards, and sometimes/somehow the IP address
of the unused 2nd card (a 192.168.* address) starts getting broadcast
onto our Active Directory as being a domain controller IP. Then if
winbind decides to choose that address, it all starts failing, as that
address space isn't reachable.

If I do a "nslookup domain.AD" I get a listing of all our valid DC 10.*
addresses back - plus the unwanted 192.168 address - but it appears that
sometimes winbind decides that is the valid address, and won't try any
of the other addresses? And then you get the NT_STATUS_NO_LOGON_SERVERS
- as it isn't reachable.

Here's some excepts from /var/log/samba/log.wb-DOMAIN


ads_find_dc: looking for realm 'domain.AD'
get_sorted_dc_list: attempting lookup for name domain.AD (sitename
NULL) using [ads]
sitename_fetch: Returning sitename for domain.AD: "correct-sitename"
name domain.AD#20 found
get_dc_list: negative entry domain.AD removed from DC list
get_dc_list: returning 1 ip addresses in an ordered list
get_dc_list: 192.168.234.235:389


those last two lines imply why this problem occurs, but this problem
isn't being noticed within AD itself - I think Microsoft actually uses
ICMP pings to test DCs are reachable? Does Samba? Also, I have no idea
why it returns only one, invalid IP - nslookup shows this particular
domain has 13 domain controller IPs listed - including the one 192.168 one.

Obviously to fix it I just have to whine at our AD people until they
clean out this bogus DC IP - but shouldn't Samba work its way around
this? As an added advantage, ping tests could even ensure Samba connects
to the closest DC by measuring the latency...?

Thanks!
--
Cheers

Jason Haar
Information Security Manager, Trimble Navigation Ltd.
Phone: +64 3 9635 377 Fax: +64 3 9635 417
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1
Jeremy Allison
2007-11-26 03:35:37 UTC
Permalink
Post by Jason Haar
If I do a "nslookup domain.AD" I get a listing of all our valid DC 10.*
addresses back - plus the unwanted 192.168 address - but it appears that
sometimes winbind decides that is the valid address, and won't try any
of the other addresses? And then you get the NT_STATUS_NO_LOGON_SERVERS
- as it isn't reachable.
Here's some excepts from /var/log/samba/log.wb-DOMAIN
ads_find_dc: looking for realm 'domain.AD'
get_sorted_dc_list: attempting lookup for name domain.AD (sitename
NULL) using [ads]
sitename_fetch: Returning sitename for domain.AD: "correct-sitename"
name domain.AD#20 found
get_dc_list: negative entry domain.AD removed from DC list
get_dc_list: returning 1 ip addresses in an ordered list
get_dc_list: 192.168.234.235:389
those last two lines imply why this problem occurs, but this problem
isn't being noticed within AD itself - I think Microsoft actually uses
ICMP pings to test DCs are reachable? Does Samba? Also, I have no idea
why it returns only one, invalid IP - nslookup shows this particular
domain has 13 domain controller IPs listed - including the one 192.168 one.
Obviously to fix it I just have to whine at our AD people until they
clean out this bogus DC IP - but shouldn't Samba work its way around
this? As an added advantage, ping tests could even ensure Samba connects
to the closest DC by measuring the latency...?
We should notice this address is bad and add it to the negative
connection cache once we fail to connect - we actually use a lot
of techniques to ensure we don't get stuck on a bad DC (server
affinity cache, negative connection cache etc.). Is there a
chance you can get me a debug level 10 when you're running into
this problem so I can see what is going on ?

Jeremy.

Loading...