Topic Options
#1450 - 08/22/05 01:46 PM Help with reading a log and email problem
Rusty Offline


Registered: 08/22/05
Posts: 4
Loc: Southwest Washington
I'm working on a peer to peer network where we are having problems with email stalls. When a particular user checks their email at the remote email server---there is a connection and it just stalls. In Outlook Express it says connected and then just sits there. I've run a lot of plots from two different machines to see if there is a clue in them. Here is a 24 hour snapshot from one of the machines that has problems--I have multiple time windows on this same setup. The problem is very intermittent---most of the time it works fine or if you wait a bit and come back it's ok.

Thanks for the help,
Rusty


Attachments
1470-debbie13.png



Top
#1451 - 08/22/05 02:00 PM Re: Help with reading a log and email problem [Re: Rusty]
Rusty Offline


Registered: 08/22/05
Posts: 4
Loc: Southwest Washington
Here's another plot with a shorter window---same machine...... I also have some plots from a different machine on the network to the same ip.

Thanks,
Rusty


Attachments
1471-debbie14.png



Top
#1452 - 08/22/05 02:36 PM Re: Help with reading a log and email problem [Re: Rusty]
Pete Ness Offline



Registered: 08/30/99
Posts: 1106
Loc: Boise, Idaho
That's not pretty!

You've got a couple of instances of possible problems here, but some more clear than others.

First, it looks like you have an internal router of some kind at hop 1 (ARIN thinks you own that IP). No packet loss - that's good.

Hop 2 is probably your border router to Eschelon Telecommunications - probably inside their complex someplace. The bit of packet loss here is slightly disturbing. I can't really see when that happened, but hop 4 is getting sparatic packet loss across the whole time in the picture, so hop 2 is probably seeing something similar. In fact, hop 12 is seeing similar patterns as well - some lost packets all over.

The inconsistent latency jumps (example: 3:35pm, 11:15am, 2:05pm) look to be bandwidth limitations at hop 4 or the hop 3->4 link (I'm guessing since I can't see the time graph for hop 3). That's pretty normal - you occasionally use your bandwidth, and when it's saturated, latency increases a bit. You'd *feel* some of this, but it shouldn't cause your email to stall.

Now, as you probably expect, the major issue starts at hop 8. Both hops 7 and hop 8 are owned by Alter.net. It's odd that it would be so absolutely obvious - this is probably a pretty core backbone for them, and problems like this would certainly be noticed.

This *COULD* be a return route problem. Because the hop 7 -> hop 8 is a ALTER.NET backbone (and problems of this magnitude and duration are a lot more likely at borders than they are inside the core of an established network backbone provider - I would expect them to see this and not allow it to happen), I would suspect that this is actually a problem with the return route. If, once your packets are in Atlanta, Alter.NET decides that there's another network that can route your packets back more effectively than they can, then we might see something like this.

The only way to know this for sure is actually to trace from the remote server back to you and see how the packets flow. Ideally, you'd be able to run PingPlotter from that network, but any traceroute would be able to tell you about the network that is being used. Especially interesting would be to do this during a period where you're seeing problems (ie: trace from the remote side back to you).

If I were you, I would start out by calling Eschelon (I assume you're buying your bandwidth from Eschelon) and ask them about this huge packet loss problem at hop 8. They can then contact Alter.net (or any other peers that may be routing data - in case the return route is different) to find out what's going on (or they may already know what's going on).

When you're seeing big periods of packet loss in PingPlotter, you might try opening a browser to that same target. There appears to be a web server running there with a relatively light-weight web page. This would be a good test to make sure the PingPlotter results are correllated with real-world problems (like inability to connect with POP and/or timing out web pages). It's always important to verify PingPlotter results against real-world impacts. If you see 80% packet loss but everything's working snappy and fast, then the network may be routing ICMP packets different than TCP packets - we need to make sure we're chasing the right problem.

The periods of red show a a few packets through, which means you might get the page to load, but it would be horribly slow. If that's the results you see, then you know the red in PingPlotter does actually correlate to network problems and you're pretty certain where the problem is: the Seattle -> Atlanta link inside Alter.net. The latency jump here is normal (that's quite a few miles), but the packet loss is not.

So, my summary:

* It looks like the major problem is the hop 7 -> hop 8 link.
* Verify this by loading the web page at the target IP during times with low packet loss and high packet loss - make sure there is a noticeable difference.
* If you can, trace from the remote side back to you and see what networks are participating in returning data to you.
* If you can confirm the problem as shown in PingPlotter is really what's happening, report this to your ISP, along with your collected PingPlotter data / graphs, and ask them what's going on.

That's the way I'd go after this.

Good luck! Let us know what you find out. Also, if you find out any additional information (like the return route) you want us to comment on, post back here.

- Pete

Top
#1453 - 08/22/05 03:14 PM Re: Help with reading a log and email problem [Re: Pete Ness]
Rusty Offline


Registered: 08/22/05
Posts: 4
Loc: Southwest Washington
Thanks Pete!! Especially for the quick reply!!

I'll get started on doing some homework. Your suggestion to verify will be a bit tougher since I'm not actually on site all the time----also the problem when it shows up is very sporadic---the only place we see it is when an email stalls. I can probably get the reverse plot done pretty easily. A couple of other tidbits that might help us along the way:

1) I've run a same time plot from another machine on the network and it looks completely different--any ideas there? We haven't seen any problems inside the building.
2) I've done extensive internal plotting and pinging and have seen no problems at all
3) I've been in the middle of a email stall on a particular machine and able to check the internet connection and speed at the same time. Speed from CNET tests at all times is pretty consistent.
4) Yes you are correct on the router. They use a Netgear unit and then it connects to the Eschelon equipment---which is in the same room and rack for the whole building. Eschelon is also carrying their phone service on the same equipment.

Thanks again!!!

Have a Great Day,
Rusty

Top
#1454 - 08/23/05 01:05 AM Re: Help with reading a log and email problem [Re: Rusty]
Pete Ness Offline



Registered: 08/30/99
Posts: 1106
Loc: Boise, Idaho
Quote:
1) I've run a same time plot from another machine on the network and it looks completely different--any ideas there? We haven't seen any problems inside the building.

What does "completely different" mean? Different route? Different results? This is almost certainly worth looking in to - as differences in results can often lead to clues.

Quote:
2) I've done extensive internal plotting and pinging and have seen no problems at all

That's what I would expect - since there are no lost packets on your internal router.

Quote:
3) I've been in the middle of a email stall on a particular machine and able to check the internet connection and speed at the same time. Speed from CNET tests at all times is pretty consistent.

The packet loss starting at hop 8 would almost certainly not happen against *all* sites - it's probably specific to a certain subset of IP addresses from a specific area of the country. A trace back from the other mail server back to your network would help us understand this - especially if we could do this during a problem period. If you have control of that server (and can install a perl script), contact us at support@pingplotter.com and we might be able to help you trace from that machine remotely with PingPlotter.

- Pete

Top
#1455 - 08/31/05 04:01 PM Re: Help with reading a log and email problem [Re: Pete Ness]
Rusty Offline


Registered: 08/22/05
Posts: 4
Loc: Southwest Washington
Thanks Pete,

We've managed to get some plots back from the other end using different programs. The same alter.net connection shows up on them as well. Let me see if I can paste them here:

1 0 0 0 216.46.228.229 port-216-3073253-es128.devices.datareturn.com
2 0 0 0 64.29.192.145 port-64-1949841-zzt0prespect.devices.datareturn.com
3 0 0 0 64.29.192.226 daa.g921.ispb.datareturn.com
4 0 0 0 168.215.241.133 hagg-01-ae0-1001.dlfw.twtelecom.net
5 0 0 0 66.192.253.124 core-02-ge-3-1-0-504.dlfw.twtelecom.net
6 8 0 0 66.192.246.53 peer-02-so-0-0-0-0.dlfw.twtelecom.net
7 1 1 1 63.218.23.30 uunet.pos6-2.cr02.dal01.pccwbtn.net
8 8 1 1 152.63.97.57 0.so-1-0-0.xl1.dfw9.alter.net
9 49 49 49 152.63.38.82 0.so-6-0-0.xl1.sea1.alter.net
10 49 49 48 152.63.106.226 pos5-0.xr1.sea1.alter.net
11 49 49 49 152.63.105.201 195.atm6-0.gw3.sea1.alter.net
12 105 49 49 157.130.190.130 eschelon-gw1.customer.alter.net
13 53 129 122 64.65.171.10 access0-atm0-0-0-200.pdx.eschelon.com
14 54 53 55 64.65.188.98

1 0 2 0 216.46.228.229 port-216-3073253-es128.devices.datareturn.com
2 0 0 0 64.29.192.145 port-64-1949841-zzt0prespect.devices.datareturn.com
3 0 0 0 64.29.192.226 daa.g921.ispb.datareturn.com
4 0 0 0 168.215.241.133 hagg-01-ae0-1001.dlfw.twtelecom.net
5 0 0 0 66.192.253.124 core-02-ge-3-1-0-504.dlfw.twtelecom.net
6 0 0 0 66.192.246.53 peer-02-so-0-0-0-0.dlfw.twtelecom.net
7 1 1 1 63.218.23.30 uunet.pos6-2.cr02.dal01.pccwbtn.net
8 1 58 2 152.63.97.57 0.so-1-0-0.xl1.dfw9.alter.net
9 49 49 49 152.63.38.82 0.so-6-0-0.xl1.sea1.alter.net
10 49 49 49 152.63.106.226 pos5-0.xr1.sea1.alter.net
11 49 58 49 152.63.105.201 195.atm6-0.gw3.sea1.alter.net
12 49 49 49 157.130.190.130 eschelon-gw1.customer.alter.net
13 60 54 55 64.65.171.10 access0-atm0-0-0-200.pdx.eschelon.com
14 54 53 54 64.65.188.98
15 63 61 61 66.213.199.209

Trace complete

1 0 0 0 216.46.228.229 port-216-3073253-es128.devices.datareturn.com
2 0 0 0 64.29.192.145 port-64-1949841-zzt0prespect.devices.datareturn.com
3 0 0 0 64.29.192.226 daa.g921.ispb.datareturn.com
4 0 0 0 168.215.241.133 hagg-01-ae0-1001.dlfw.twtelecom.net
5 0 0 0 66.192.253.124 core-02-ge-3-1-0-504.dlfw.twtelecom.net
6 0 0 * 66.192.246.53 peer-02-so-0-0-0-0.dlfw.twtelecom.net
7 1 1 1 63.218.23.30 uunet.pos6-2.cr02.dal01.pccwbtn.net
8 1 1 1 152.63.97.57 0.so-1-0-0.xl1.dfw9.alter.net
9 49 49 49 152.63.38.82 0.so-6-0-0.xl1.sea1.alter.net
10 49 48 49 152.63.106.226 pos5-0.xr1.sea1.alter.net
11 49 49 49 152.63.105.201 195.atm6-0.gw3.sea1.alter.net
12 49 49 49 157.130.190.130 eschelon-gw1.customer.alter.net
13 58 54 53 64.65.171.10 access0-atm0-0-0-200.pdx.eschelon.com
14 54 54 67 64.65.188.98
15 81 150 135 66.213.199.209

Trace complete

1 0 0 0 216.46.228.229 port-216-3073253-es128.devices.datareturn.com
2 0 0 0 64.29.192.145 port-64-1949841-zzt0prespect.devices.datareturn.com
3 0 0 0 64.29.192.226 daa.g921.ispb.datareturn.com
4 0 0 0 168.215.241.133 hagg-01-ae0-1001.dlfw.twtelecom.net
5 0 0 0 66.192.253.124 core-02-ge-3-1-0-504.dlfw.twtelecom.net
6 0 0 0 66.192.246.53 peer-02-so-0-0-0-0.dlfw.twtelecom.net
7 1 1 1 63.218.23.30 uunet.pos6-2.cr02.dal01.pccwbtn.net
8 1 1 1 152.63.97.57 0.so-1-0-0.xl1.dfw9.alter.net
9 49 49 49 152.63.38.82 0.so-6-0-0.xl1.sea1.alter.net
10 49 49 49 152.63.106.226 pos5-0.xr1.sea1.alter.net
11 49 49 49 152.63.105.201 195.atm6-0.gw3.sea1.alter.net
12 49 49 49 157.130.190.130 eschelon-gw1.customer.alter.net
13 54 115 54 64.65.171.10 access0-atm0-0-0-200.pdx.eschelon.com
14 58 53 115 64.65.188.98
15 62 62 62 66.213.199.209

Trace complete

1 0 4 0 0.4 ms 66.36.240.2 c-vl102-d1.acc.dca2.hopone.net.
2 0 0 0 0.4 ms [+0ms] 66.224.226 AS0 core1.dca2.hopone.net.
3 2 1 1 1.9 ms [+1ms] 66.36.224.18 AS3 ge3-0.core1.iad1.hopone.net.
4 3 3 4 3.8 ms [+1ms] 207.228.224.189 AS0 pos5-1.core2.dca1.hopone.net.
5 4 4 4 4.0 ms [+0ms] 65.207.95.197 AS0 500.pos5-1.gw3.dca8.alter.net.
6 4 4 9 4.0 ms [+0ms] 152.63.37.30 AS0 0.so-4-0-0.xl1.dca8.alter.net.
7 62 64 61 61 ms [+57ms] 152.63.145.237 AS0 0.so-0-1-0.xl1.sea1.alter.net.
8 68 70 62 62 ms [+0ms] 152.63.106.226 AS0 pos5-0.xr1.sea1.alter.net.
9 76 64 65 62 ms [+0ms] 152.63.105.193 AS0 195.atm5-0.gw3.sea1.alter.net.
10 69 62 62 62 ms [+0ms] 157.130.190.130 AS1942 eschelon-gw1.customer.alter.net.
11 77 70 68 66 ms [+3ms] 64.65.171.238 AS0 access0-atm1-0-150.pdx.eschelon.com.
12 77 67 66 66 ms [+0ms] 64.65.188.98 AS0 unknown.eschelon.com
13 436 429 * 429 ms [+362ms] 66.213.199.209 AS4999 [Reached Destination]66.213.199.209 [No PTR]

Does this confirm we have the problem at the same spot---coming from either direction?? We're also working to get more info from the other end.

Have a Great Day,
Rusty

Top

Search

Who's Online
0 registered (), 7 Guests and 0 Spiders online.
Key: Admin, Global Mod, Mod