Windstream's Downstream Throughput

This is Windstream's downstream throughput (in Mbps, with a linear regression line that ignores zeros) in Arroyo del Agua, NM (87012). The blue grid line is the nominal rate; the regression line is red. The purple ghost bars represent the throughput from s2.

The original page for this site describes the abysmal situation that prompted its creation in 2010. It was clear when we started that our local access point's congestion was the main cause of the problem, but a lot has changed in 9 years. Recently, while reworking the charts to deal with Google killing the Image Charts API, it became clear that Windstream's last-mile bandwidth is no longer the problem. What has come to be the bottleneck at 25 Mbps is the performance of the ordinary backbone, the congestion and latency of the Tier 1 providers. While many net-neutrality advocates were focused elsewhere, our long-standing fear of internet fast & slow lanes has become a reality through the proliferation of privileged IXPs, so don't get too excited if you're planning to get gigabit fiber for the last mile; it probably won't help much. Sure, it will be useful when you're dealing with Google's edge PoPs and their cold-potato routing, or close-proximity sub-10-ms-latency speed tests and CDN endpoints, or if you share your LAN with bandwidth-hungry roommates or family members, but for the part of the internet in the hot-potato slow lanes, you'll probably be disappointed.

The Problem: Hot-Potato Routing Through a Congested Backbone

Three Machines

Some Facts

All 3 machines are running recent Linux kernels with modern TCP stacks and automatic TCP window scaling. Bear in mind that Google's Premium Tier egress is extremely expensive, currently (June 2019) between $0.12 and $0.23 per GB, depending on the endpoints. Let's look at some typical throughput ('rtt' is latency, and the throughput test runs for 8 seconds, so 'total bytes' is the throughput in bits per second).

First, to establish that our socket pair is fast, here's local calling local (~24 Gbps):

rtt: 0.392 ms; total bytes: 24,296,794,752

Now, to see it's not a problem with s1 being slow, here's s2 calling s1 (~675 Mbps):

rtt: 17.6 ms; total bytes: 675,334,784

Now let's look at our Windstream access point throughput, local calling s2:

rtt: 47.854 ms; total bytes: 25,752,320

And local calling s1:

rtt: 73.008 ms; total bytes: 25,934,400

Wait! you say. The so-called slow lane is faster! And yes —good catch— in this particular instance the throughput from s1 is a little better than that of s2, despite the s1 route's latency being consistently about 50% higher. That latency is at least partly introduced by the route always going the wrong way from New Mexico, "tromboning" through LA on it's way to St. Louis. Let's look at some typical traceroute hops, piped through a geolocation database. Here are some hops to s1.
4 40.138.81.154 ae2-0.agr03.albq01-nm.us.windstream.net ---Albuquerque, New Mexico, United States--- 9.373 ms 10.098 ms 10.264 ms
5 40.138.82.38 ae4-0.cr02.lsaj01-ca.us.windstream.net ---Los Angeles, California, United States--- 36.089 ms 36.091 ms 33.286 ms
6 69.174.17.201 ae23.cr4-lax2.ip4.gtt.net ---Paris, Île-de-France, France--- 52.278 ms 48.856 ms 51.670 ms
7 4.68.62.1 ae7.edge1.LosAngeles6.Level3.net ---Los Angeles, California, United States--- 44.908 ms 46.808 ms 47.107 ms
8  * * *
9 4.7.52.38 ae5.cr-atlas.stl1 ---St Louis, Missouri, United States--- 74.239 ms 73.478 ms 73.994 ms
The first thing to notice is that geolocation databases often incorrectly report the same city for big blocks of IP addresses owned by companies like GTT and Google, even when the actual nodes are in different data centers. Paris is obviously not possibe for 69.174.17.201 because, just doing a quick estimate using RTT in seconds through glass ≈ (fiber distance / (c * .67)) * 2, a smooth air arc RTT from Albuquerque to Paris with no equipment lag would be (8371 / (299792.458 * 0.67)) * 2 ≈ 0.08335 seconds, or 83.35 ms. Often the reverse DNS is more useful; "lax2" and the surrounding LA locations are a pretty good sign we stayed in LA.
Now here's a typical traceroute to s2, where the packets are handed off directly from Windstream to Google in Dallas (the above database caveat also applies to Ashburn, Virginia).
7 173.189.57.97 h97.57.189.173.static.ip.windstream.net ---Dallas, Texas, United States--- 37.149 ms 37.231 ms 37.406 ms
8 108.170.252.131 108.170.252.131 ---Ashburn, Virginia, United States--- 34.170 ms 34.282 ms 34.377 ms

When everything goes perfectly, reasonably high latency is not a big deal because the two ends of a long fat pipe can tune the connection automatically, but when something goes wrong (e.g. packets need to be resent) performance can drop off radically. Any particular stream of packets on the internet can get lucky, so the pattern of fluctuations only appears over time. Notice how much more generally consistent Google's private network is than the public backbone, and how 14:00-15:00 UTC and 03:00 UTC seem to be particulary rough times for hard-working packets to get ahead out there.

Some History

In June of 2017 Windstream upgraded me from 12 Mbps to 25 Mbps, but with both twisted pairs from the DSLAM, so potentially as high as 50 Mbps. Here's a typical speed test, done just now with a server 50 miles away.

speedtest

Life was sweet until the middle of February, 2018, when things started to go bad. Worried that the problem might be with s1 itself, I set up s2 in March 2018. Throughput looked much better, so I blamed it on s1 and went back to looking for what hears. A year later, when the GCE free trial ran out and (coincidentally) Google killed the Image Charts API, I revisited things. The test above showed that s1 itself was in fact fine, so I switched back to s1 in April of 2019. The result was not very good.

Digging more deeply into the current state of Linux's TCP congestion control, I discovered that these days the first 8 seconds of a TCP connection are often not a good measure of the real throughput, so I started skipping the first 8 seconds and measuring the next 8. That change increased the number of fast times, but still left the slowest very slow. Today (190604) I changed the socket client to use whichever 8-second chunk is better, and restarted testing with s2. The throughput from s2 now appears as a purple ghost on the chart, and the bar info now includes the s2 - s1 delta where available (a negative delta means s1 is better).

Conclusion

At its best the hot-potato backbone handles a packet stream at least as well as (and occasionally much better than) most cold-potato networks. The problem has become its extreme variability. The same pattern that once afflicted Windstream's network now afflicts the public backbone, and it seems (June 2019) to be getting worse. ◊ contact me ◊

 

τῇ καλλίστῃ