Over the last few months, I’ve been working with a UK Government customer to move them from a legacy managed services contract with a systems integrator to a disaggregated solution built around SaaS services and a virtual datacentre in Azure. I’d like to write a blog post on that but will have to be careful about confidentiality and it’s probably better that I wait until (hopefully) a risual case study is created.
One of the challenges we came across in recent weeks was application performance to a third-party-hosted solution that is accessed via a site-to-site VPN from the virtual datacentre in Azure.
My understanding is that outside access to Microsoft services hits a local point of presence (using geographically-localised DNS entries) and then is routed across the Microsoft global network to the appropriate datacentre.
The third-party application in Bedford (UK) and the virtual datacentre is in West Europe (Netherlands) so the data flows should have just been in Europe. Even so, a traceroute from the third-party provider’s routers to our VPN endpoint suggested several long (~140ms) hops once traffic hit the Microsoft network. These long hops were adding significant latency and reducing application performance.
I logged a call under the customer’s Azure support contract and after several days of looking into the issue, then identifying a resolution, Microsoft came back and said words to the effect of “it should be fixed now – can you try again?”. Sure enough, ping times (not the most accurate performance test it should be said) were significantly reduced and a traceroute showed that the last few hops on the route were now down to a few milliseconds (and some changes in the route). And overnight reports that had been taking significantly longer than previously came down to a fraction of the time – a massive improvement in application performance.
I asked Microsoft what had been done and they told me that the upstream provider was an Asian telco (Singtel) and that Microsoft didn’t have direct peering with them in Europe – only in Los Angeles and San Francisco, as well as in Asia.
The Microsoft global network defaults to sending peer routes learned in one location to the rest of the network. Since the preference of the Singtel routes on the West Coast of the USA was higher than the preference of the Singtel routes learned in Europe, the Microsoft network preferred to carry the traffic to the West Coast of the US. Because most of Singtel’s customers are based in Asia, it generally makes sense to carry traffic in that direction.
The resolution was to reconfigure the network to stop sending the Singtel routes learned in North America to Europe and to use one of Singtel’s local transit providers in Europe to reach them.
So, if you’re experiencing poor application performance when integrating with services in Azure, the route taken by the network traffic might just be something to consider. Getting changes made in the Microsoft network may not be so easy – but it’s worth a try if something genuinely is awry.