PDA

View Full Version : High number of collisions = intermitent network time outs


jaltuve
2004-07-21, 21:20 PM
Hi,

I've ALWAYS had collisions on the network interface and I used to think they were completely normal (they are in half-duplex environments). I also got used to frequent network time outs when accesing my server by both ssh and http, this was not a big deal, because the problems were simply broken images (http) and the page not loading/stalling or ssh not connecting some times, but a quick second or third try always seemed to solve the problem, no big deal then.

Ocasionally, I received one email or two a week from a complaining user about my site not loading or trying to repeteadly try the address in Internet Explorer before connecting. I'm receiving more and more complains about users reporting exactly what I've been experiencing and what other users have reported. I'm refering to frecuent network time outs regardless of connection type, speed or location. Not just http, but ssh as well.

Last month, I contracted a third server on another hosting provider to do off site backups of my data to be sure and safe that should a failure occur, an alternate site was ready and available. The backup site is a mirror site of what I currently have with serverbeach (even similar price). However, for the past three weeks I've not experience a single drop, time out or page stall when accesing this new server, What is really puzzling me though, is the fact that there is not one single collision on any of network interfaces, and this is a 10Mb environment. (half duplex I guess?)

And yes.. there's LOT of traffic between the servers, because the way I keep them in sync is via rsync, My Serverbeach server CPU, memory and swap utilitzation levels are VERY low, my site is 99% static HTML running apache 2. (I use Nagios / MRTG to keep tabs on all the server subsystems, and the ONLY occasional alarms I get are collisions and the ping to the default router reporting packet loss every 2-3 days)

I tried to compile some statistics, so I tried pinging BOTH servers and none reported packet loss on a ping loop lasting 4 hours. So, I tried 30 ssh connections for EACH during a 24 hour period (two days) 15 connections each day and logging the results. 30 out of 30 ssh connections were succesfully established the FIRST try when connecting to the backup server, unfortunately, when doing this test using the ServerBeach server it failed 10 times out of 30!!!! By connection fail I mean ssh just hangs.. after a few seconds it reports a time out, if Y try the connection immediatly after some times it will connect. I tried to do some more testing downloading big files and I noticed the same trend, the initial connection some times results in a time-out, however, once a connection has been established it NEVER drops it. The problem seems to be establishing connections for the first time.

Can someone please explain to me, what is happening? what can I or serverbeach do to prevent it? I really DO like Serverbeach, I'v e been a client since 2003, you have always responded promptly to my requests, I just want to know what's happening and have someone fix it. If it costs more money I'll pay, If you'll charge me the troubleshooting time, do it. I just want the problem fixed regardless of who's at fault.

bow-wow
2004-07-22, 00:20 AM
Originally posted by jaltuve
... I just want to know what's happening and have someone fix it. If it costs more money I'll pay, If you'll charge me the troubleshooting time, do it. I just want the problem fixed regardless of who's at fault. In that case open a trouble ticket and have SB support look at it. Tell them everything you told us. As a rule, SB support does not monitor the forums.

csteelatgburg
2004-07-22, 08:21 AM
I don't have any experience with Nagios, although I have seen demostrations of it in use. I do however have a good deal of mrtg experience.

Are you tracking collisions with mrtg? If so, would you mind sharing the config information that does it? I have experienced similar timeout problems, and I would like to compare notes, but I haven't been tracking collisions.

Thanks.

jaltuve
2004-07-22, 08:52 AM
Simply poll the network interface and grep the collisions [0-9] field using a regex every 5 min then graph it.

Number of collisions is a completely irrelevant value though, you need to pair it with transfer to make sense out of it. You you want is % of collisions, which is total transfer vs collision total.

The timeouts in my case were common since day one, but now that I have a backup provider to compare with i'm realizing there's something not right with serverbeach. I was not expecting someone to tell me to open a ticket (I'll do it anyway).

I was not really expecting a support person to take a look at this, but a network engineer, someone from netops should look into this since this might be a network design problem you have for all your customers. Myself, being one of them, and since what I want is a stable network connection I'm trying to provide as much info as I can so this can be troubleshooted.

Rgds,

P.D I will contact support opening a ticket, but consider the possibility to upgrade your clients to 100Mb or VLAN's with their own collision domains. Network timeouts will be common in environment when a high number of collisions occur.

pindividual
2004-07-30, 12:29 PM
I'm having the same problem with ftp and ssh timing out. It has happened occasionally in the past but has got much worse in the last couple of days (july 28/29). I have also opened a ticket with SB (I read your post afterwards).

I, too, like SB but if they have a major network issue then I hope they are making it a top priority to address it.

I find it interesting that your post is on the same day as this SB network notice:

http://forums.serverbeach.com/showthread.php?s=&threadid=3877

That should only have effected the San Antonio datacenter and I'm in the Virginia datacenter and have had problems since.

While it's nice to be offered 2000 GB bandwidth perhaps the heavy usage by some is clogging up the pipes?

-pi

P.S. Can you give more detail on how you're gathering collision/transfer stats?

jaltuve
2004-07-30, 12:54 PM
simply do a /sbin/ifconfig -a

That will show you data transfer and collisions. Time them. See how much you transfer in a specific period of time and how much collisions are generated.

There's no way you can prevent the collisions, they will simply happen because of the way the network for Server Beach is setup (half duplex)

The problem SB is having (And they know it, that's why you don't see them refuting my posts with the network issues claims) is because this is a discount provider. You are connected to a cheap hub with 8 ports, where ALL ports are sharing a 10MB connection. THat's why it's half duplex, there's a lot of traffic going on in every hub. Problems arise when one of the clients in a specific hub generates a lot of traffic. since the environment is NOT swtiched then throughput and reliability IS affected.

Ways to solve this? simply using VPN's and quality switches where you can still limit the port to 10Mb, without the collisions and/or reliability problems.

Overall, Server Beach is a very good provider. Is very reliable and when I had critical problems with my servers they were solved. They need to take a look at their network infraestructure ASAP, otherwise clients will start looking somewhere else. My problems STILL PERSIST, I get frequent time outs when doing ssh connections and also when connecting to the web server. I did as SB suggested and opened a ticket but I've not received any answer yet. I imagine they're still troubleshooting the problem.

Next step is a call to Richard Yoo, however I DO think that this problem can be solved.. after all, I've done exactly as SB has advised me to do, strange thing is that technical support has yet to take care of the ticket I opened last week.. I wonder why is taking so long.

knightfoo
2004-07-31, 02:34 AM
Originally posted by jaltuve
There's no way you can prevent the collisions, they will simply happen because of the way the network for Server Beach is setup (half duplex)

The problem SB is having (And they know it, that's why you don't see them refuting my posts with the network issues claims) is because this is a discount provider. You are connected to a cheap hub with 8 ports, where ALL ports are sharing a 10MB connection. THat's why it's half duplex, there's a lot of traffic going on in every hub. Problems arise when one of the clients in a specific hub generates a lot of traffic. since the environment is NOT swtiched then throughput and reliability IS affected.

Ahem. There is not a single hub in any ServerBeach datacenter and it would be appreciated that you do not make assumptions about our operations that may not be true. Every connection to every customer server is 100% switched, from the NIC to the edge routers. The reason you see collisions is the 10Mbit half duplex switch port, and you will see collisions on any half duplex switch port. The uplinks between customer switches and the aggregate switches is no less than 100Mbit full duplex, so you are not seeing collisions because of other customers


Ways to solve this? simply using VPN's and quality switches where you can still limit the port to 10Mb, without the collisions and/or reliability problems.

Adding the complexity and protocol overhead of a VPN is not going to solve a network collision problem. In addition, we only use quality Cisco hardware that has proven to be reliable over many years of service. If you consider the cost of an upgrade from a 10Mbit switch to a 100Mbit switch, it really does not make sense in a discount environment. This does not mean that the switches are poor quality, they are simply not top-of-the-line GigE speed demons. :)

It is acceptable to get up to 10% collision rate on a very busy server (6-8Mbit of traffic), especially if there is a lot of interactive traffic. It should not adversely affect your server unless it gets above 10%, which could indicate other problems.

-knightfoo