We Hate Singapore!
Nobody actually hates Singapore…
Well, nobody in this story, anyway.
The e-commerce web server cluster was due for an upgrade. New hardware was purchased. Windows and IIS were installed. Microsoft's NLB clustering was configured and the various e-commerce websites and services were deployed to the new two-node cluster.
And everything seemed just fine.
…but you couldn't really tell from how the website behaved
A few weeks after the new cluster was rolled out, a customer from Singapore contacted support to say that they were unable to purchase anything on the store. Dutifully, a couple of developers tried to reproduce the issue:
I add an item to my cart and click the shopping cart icon to check out. When I arrive at the 'cart' page, all of my items are there. But, when I click 'proceed' to go to the billing page, nothing happens.
The checkout path was fairly straightforward: cart -> billing -> shipping -> confirmation -> order summary.
From looking at the logs, it was clear that the server was receiving the HTTP request to load the billing page, but, surprisingly, was redirecting the customer back to the cart. Based on the code, such a redirect would only be returned if the cart was empty. So, the cart page displayed a list of items but, once the customer clicked to proceed to the billing page, those items disappeared. When the cart page loaded again, though, the items reappeared.
A few days later, the customer emailed to say that he had been able to complete his order, so the issue was chalked up to an issue on the customer's end.
Over the next few months, a customer would email to report this problem a few more times—usually from Singapore1. But we had plenty of customers from Singapore and, based on their order histories, most of them were able to place their orders without problem. The IIS server logs were reviewed to find more incidences of the cart -> billing page -> cart redirection pattern, but nothing really stood out and the issue was relatively infrequent and was put on the back burner accordingly.
But, what if the problem is with the customer's ISP?
Approximately six months after the new cluster was deployed, a handful of the customers—relatively big spenders from Singapore—had become quite frustrated and escalated their concerns. Tracking down the issue became a top priority.
During this round of investigation, a seemingly interesting detail emerged: all of the customers from Singapore were coming from a small number of IPs on a single subnet. Further, the IP address usage overlapped. So, it appeared that they were sharing these IPs in some kind of network address translation pool.
At this point, the team investigating it speculated that a there was a misbehaving caching proxy somewhere in Singapore—beyond their control. Again, they considered the issue essentially "resolved" because the problem was believed to be with a component beyond their control. 2.
If that were true, wouldn't the customers have problems like this frequently and with other sites?
I joined the team responsible for most of the company's web applications— including the e-commerce websites—around the time the issue was escalated. I had a bit of a chance to weigh in with my new teammates as they worked through the issue, but I was the new guy and my background involved a minimal amount of ASP.NET/IIS on Windows. I had quite a lot of web experience, but felt a bit rusty having spent the past 2 years working on Windows and Mac OS desktop apps.
After they declared the problem beyond their control a second time, the emails kept coming in. The group's director was frustrated, but felt stuck—he didn't want to tell the customer tough luck. I asked him if I could have a few days to dig into the issue. I could bring a fresh set of eyes and it would be a useful exercise for accelerating my learning of the new-to-me technology stack. That, and I had a nagging sense in the back of my mind that the problem was on our side. If this was really an issue on the customer's end, it seemed like they'd have experienced it elsewhere, too. By all accounts, they only experienced it on our site.
Reading The Fucking Manual™
As noted, I was relatively new to ASP.NET/IIS. My background was building web applications with primarily Java, PHP, Python, and, regrettably, some Cold Fusion. So, I started by reading the manual and auditing our current configuration against the documentation's standard recommendations.
One thing became abundantly clear, immediately—the documentation expected all servers to be configured with identical configurations. The documentation-prescribed way to do this was to place the configuration on a highly-available network share and point all cluster members at that shared configuration so that they were all literally using the same config file. Our servers were not setup like this; they each had their own, local configuration files.
Upon inspection, the files were very, very similar, but not identical. In particular, the
id property of a given site on one server did not necessarily match the
id property for the same site on the other server. This was notable because it was incorrect, but I didn't yet know if, how, or why it related to the problem at hand.
With that config file discrepancy in the back of my mind, I dug into the logs and code to see if there was anything my fresh eyes would see that had previously gone unnoticed.
Now, the logging system wasn't setup quite the way I'd done it in the past—errors were logged to a mailing list (yes, it was terrible) and everything else just went to the local disk of the individual servers. I typically preferred to merge logs or forward to a centralized logging server, which allowed seeing all requests to all servers in a unified history. The lack of a unified view of the logs masked an interesting detail for a little while, until I merged the log files I was reviewing.
The logs revealed that most customers' browsers would submit all requests to the same server for their entire session—product page, add to cart, cart, billing, shipping, order summary, etc. I suspected, and then confirmed, that we had configured "Class C" affinity on the NLB cluster. Once I unified the logs, I noticed that the customers from Singapore who were reporting the problem stayed on the same server for most of their session, but there was a single request to the other server in the cluster when loading the billing page. Given the discrepancy between site ids in the configuration files I'd noticed above, this turned what had been a yellow flag orange.
Spelunking Through the Code
Reviewing the code, there were two preconditions enforced by the billing page: it required that the page was loaded via an HTTPS URI and that the cart contained at least one item.
The HTTPS precondition's implementation was a bit awkward. At the time, the website enforced HTTPS for "sensitive" pages like billing and account settings; but it also enforced unsecured HTTP for everything else.3 If it received a request via HTTP for a "sensitive" page, it would return an HTTP 302 redirect to the same URI using HTTPS. So, when a typical user clicked the "proceed" button from the cart page, what happened was:
1 HTTP GET to http:/foo.com/checkout/billing 302 Found w/ Location: https://foo.com/checkout/billing 2 HTTP GET to https://foo.com/checkout/billing 200 OK
But for the affected Singapore users, it looked more like this:
1 GET http:/foo.com/checkout/billing 302 Found w/ Location: https://foo.com/checkout/billing 2 GET https://foo.com/checkout/billing 302 Found w/ Location: https://foo.com/cart 3 GET https://foo.com/cart 302 Found w/ Location: http://foo.com/cart 4 GET http://foo.com/cart 200 OK
It was clear from the web server logs that the billing page was being loaded over HTTPS, so that was being enforced.
The failing precondition was the requirement that there be at least one item in the cart. The cart page and the billing page considered different data sources when evaluating the contents of the cart and a single integer—
quoteId—stored in ASP.NET Session State related the two data sources. We could see from application logs that, in the failing case,
0—which was an invalid value.
But why did it only misplace the
quoteId for people from Singapore?
Back to The Documentation
It turns out that
SqlSessionStateStore incorporates the IIS site
id into the session key it generates. So, if the site ids don't match across all servers in your farm, you'll seemingly lose session data when people jump from one server to another.
But wasn't the NLB cluster configured to use "Class C" affinity? It turns out—from inspecting the logs again—that requests for the affected customer sessions were originating from one IP for unsecured HTTP and a different IP on a separate subnet for HTTPS. If, by chance, both IPs were affined to the same server, everything would "just work". But, if they happened to become affined to a separate servers, customers using that Singapore ISP could not proceed through checkout because ASP.NET Session state could not be shared between servers if the site ids did not match up across the cluster.
Once the site ids were aligned across both servers, the problem was actually solved. The remaining problems—relying on affinity, making the entire site use HTTPS, eliminating (or, at least, reducing) usage of ASP.NET Sessions, etc—were addressed separately.
Take the time to truly understand the way the software you're using works
This is my favorite takeaway because it's the one I've seen people waste the most time on. More often than not, I see people mistakenly rule out some source based on an untested speculation rather than analysis. In this case, some hypothesized that a mysteriously misbehaving proxy or middlebox was at fault, but that hypothesis wasn't really tested. Mere plausibility is not an adequate threshold.
Other times, it's an unwarranted confidence that a software vendor (open source or commercial) wouldn't do something that foolish (yes, they would).
In other cases, it's more subtle—a bias toward looking for the source of a problem in the areas where the practitioner/investigator is most expert or stigmatizing technology that is foreign, unfamiliar, or has a bad reputation in general.
Instead, read the documentation and use tools like profilers, tracing frameworks, decompilers, and, if available, source code, to inspect and understand the software you're running.
Log Everything (or, at least a lot more than you do now)
I've begun to take centralized logging for granted, but not long ago, I heard from a friend and former co-worker who started at a new company where there was little, if any, logging setup for their production systems.
I rarely encounter people who are logging too much. In this case, the only logs I had were the IIS server logs and some really thin application logs. Had this issue come up a couple of years later, after we'd deployed centralized logging and application monitoring, it would have been solved much more easily.
Be willing to audit your code and config and revisit past decisions
Be willing to audit even seemingly basic things like the web server configuration to make sure things are actually configured properly.
Be willing to revise a decision made in the past that seemed reasonable given the context and information available at the time but is no longer appropriate given new information or changed context.