Take The Time To Truly Understand

The e-commerce web server cluster was due for an upgrade. New hardware was purchased. Windows and IIS were installed. Microsoft's NLB clustering was configured and the various e-commerce websites and services were deployed to the new two-node cluster.

And everything seemed just fine…

…and it was—for most folks, in most places.

A few weeks after the new cluster was rolled out, a customer from Singapore contacted support to say that they were unable to purchase anything on the store. Dutifully, a couple of developers tried to reproduce the issue:

I add an item to my cart and click the shopping cart icon to check out. When I arrive at the 'cart' page, all of my items are there. But, when I click 'proceed' to go to the billing page, nothing happens.

The checkout path was fairly straightforward: cart -> billing -> shipping -> confirmation -> order summary.

From looking at the logs, it was clear that the server was receiving the HTTP request to load the billing page, but, surprisingly, was redirecting the customer back to the cart. Based on the code, such a redirect would only be returned if the cart was empty. So, the cart page displayed a list of items but, once the customer clicked to proceed to the billing page, those items disappeared. When the cart page loaded again, though, the items reappeared.

A few days later, the customer emailed to say that he had been able to complete his order, so the issue was chalked up to an issue on the customer's end.

Over the next few months, a customer would email to report this problem a few more times—usually from Singapore¹. But we had plenty of customers from Singapore and, based on their order histories, most of them were able to place their orders without problem. The IIS server logs were reviewed to find more incidences of the cart -> billing page -> cart redirection pattern, but nothing really stood out and the issue was relatively infrequent and was put on the back burner accordingly.

What if the problem is with the customer's ISP?

Approximately six months after the new cluster was deployed, a handful of the customers—relatively big spenders from Singapore—had become quite frustrated and escalated their concerns to the CEO. Tracking down the issue became a top priority.

During this round of investigation, a seemingly interesting detail emerged: all of the customers from Singapore were coming from a small number of IPs on a single subnet. Further, the IP address usage overlapped. So, it appeared that they were sharing these IPs in some kind of network address translation pool.

At this point, the team investigating it speculated that a there was a misbehaving caching proxy somewhere in Singapore—beyond their control. Again, they considered the issue essentially "resolved" because the problem was believed to be with a component beyond their control. ².

If that were true, wouldn't the customers have problems like this frequently and with other sites?

I joined the team responsible for most of the company's web applications—including the e-commerce websites—around the time the issue was escalated. I had a bit of a chance to weigh in with my new teammates as they worked through the issue, but I was the new guy and my background involved a minimal amount of ASP.NET/IIS on Windows. I had quite a lot of web experience, but felt a bit rusty having spent the past 2 years working on Windows and Mac OS desktop apps.

Even they declared the problem beyond their control a second time, the customer emails still kept coming in. The group's director was frustrated, but felt stuck—he didn't want to tell the customer tough luck. I asked him if we could have a few days to dig into the issue. I could bring a fresh set of eyes and it would be a useful exercise for accelerating my learning of the new-to-me technology stack. That, and I had a nagging sense in the back of my mind that the problem was on our side. If this was really an issue on the customer's end, it seemed like they'd have experienced it elsewhere, too. By all accounts, they only experienced it on our site.

Reading The 🤬 Manual™

As noted, I was relatively new to ASP.NET/IIS. My background was building web applications with primarily Java, PHP, Python, and, regrettably, some Cold Fusion. So, I started by reading the IIS documentation and auditing our current configuration against the documentation's standard recommendations.

One thing became clear immediately—the documentation expected all servers to be configured with identical configurations. The documentation-prescribed way to do this was to place the configuration on a highly-available network share and point all cluster members at that shared configuration so that they were all using literally the same config file. Our servers were not setup like this; they each had their own, local configuration files.

Upon inspection, the files were very, very similar, but not identical. In particular, the id property of a given site on one server did not necessarily match the id property for the same site on the other server. This was notable because it was incorrect, but we didn't yet know if, how, or why it related to the problem at hand.

Reviewing Logs

With that config file discrepancy in the back of my mind, we dug into the logs and code to see if there was anything my fresh eyes would see that had previously gone unnoticed.

Now, the logging system was setup a bit crudely—errors were logged to a mailing list (yes, it was terrible) and everything else just went to the local disk of the individual servers. The lack of a unified view of the logs masked an interesting detail for a little while, until we merged the log files.

The logs showed that most customers' browsers would submit all requests to the same server for their entire session—product page, add to cart, cart, billing, shipping, order summary, etc. We suspected, and then confirmed, that we had configured "Class C" affinity on the NLB cluster. Once we merged the logs, we noticed that the customers from Singapore who were reporting the problem stayed on the same server for most of their session, but there was a single request to the other server in the cluster when loading the billing page. Given the discrepancy between site ids in the configuration files we'd noticed above, this turned what had been a yellow flag orange.

Spelunking Through the Code

Reviewing the code, there were two preconditions enforced by the billing page: it required that the page was loaded via an HTTPS URI and that the cart contained at least one item.

The HTTPS precondition's implementation was a bit awkward. At the time, the website enforced HTTPS for "sensitive" pages like billing and account settings; but it also enforced unsecured HTTP for everything else.³ If it received a request via HTTP for a "sensitive" page, it would return an HTTP 302 redirect to the same URI using HTTPS. So, when a typical user clicked the "proceed" button from the cart page, what happened was:

HTTP GET to http:/foo.com/checkout/billing
Found w/ Location: https://foo.com/checkout/billing
HTTP GET to https://foo.com/checkout/billing
OK

But for the affected Singapore users, it looked more like this:

GET http:/foo.com/checkout/billing
Found w/ Location: https://foo.com/checkout/billing
GET https://foo.com/checkout/billing
Found w/ Location: https://foo.com/cart
GET https://foo.com/cart
Found w/ Location: http://foo.com/cart
GET http://foo.com/cart
OK

It was clear from the web server logs that the billing page was being loaded over HTTPS, so that was being enforced.

The failing precondition was the requirement that there be at least one item in the cart. The cart page and the billing page considered different data sources when evaluating the contents of the cart and a single integer—quoteId—stored in ASP.NET Session State related the two data sources. We could see from application logs that, in the failing case, quoteId was 0—which was an invalid value.

But why did it only misplace the quoteId for people from Singapore?

Back to The Documentation

It turns out that SqlSessionStateStore incorporates the IIS site id into the session key it generates. So, if the site ids don't match across all servers in your farm, you'll seemingly lose session data when people jump from one server to another.

But wasn't the NLB cluster configured to use "Class C" affinity? It turns out—from inspecting the logs again—that requests for the affected customer sessions were originating from one IP for unsecured HTTP and a different IP on a separate subnet for HTTPS. If, by chance, both IPs were affined to the same server, everything would "just work". But, if they happened to become affined to a separate servers, customers using that Singapore ISP could not proceed through checkout because ASP.NET Session state could not be shared between servers if the site ids did not match up across the cluster.

Once the site ids were aligned across both servers, the problem was actually solved. The remaining problems—relying on affinity, making the entire site use HTTPS, eliminating (or, at least, reducing) usage of ASP.NET Sessions, etc—were addressed separately.

Takeaways

Take the time to truly understand the way the software you're using works

This is my favorite takeaway because it's the one I've seen lead to the most wasted time and effort. Often, we settle on an explanation based on an untested speculation rather than analysis. In this case, some hypothesized that a mysteriously misbehaving proxy or middlebox was at fault, but that hypothesis was never really validated. Mere plausibility is not an adequate threshold to stop investigating.

Other times, it's an unwarranted confidence that a software vendor (open source or commercial) wouldn't do something that seemingly foolish (yes, they would).

In other cases, it's more subtle—a bias toward looking for the source of a problem in the areas where the practitioner/investigator is most expert or stigmatizing technology that is foreign, unfamiliar, or has a bad reputation in general.

Instead, we should read the documentation and use tools like profilers, tracing frameworks, decompilers, and, if available, source code, to inspect and understand the software we are running.

Log Everything (or, at least a lot more than you do now)

I've begun to take centralized logging for granted, but not long ago, I heard from a friend and former co-worker who started at a new company where there was little, if any, logging setup for their production systems.

I rarely encounter people who are logging too much. In this case, the only logs we had were the IIS server logs and some really thin application logs. Had this issue come up a couple of years later, after we'd deployed centralized logging and application monitoring, it would have been solved much more easily.

Be willing to audit your code and config and even revisit past decisions

Be willing to audit even seemingly basic things like the web server configuration to make sure things are actually configured properly.

Always be willing to revise a decision made in the past that seemed reasonable given the context and information available at the time but is no longer appropriate given new information or changed context.

There was one report from someone using HughesNet, too. ↩
It's around this time that "We hate Singapore!" became a tongue-in-cheek way of talking about the issue, particularly in light of the initial conclusion that there was no way to solve it. Editor's Note: "We Hate Singapore!" was the original title of this post. ↩
This is bad. And we later fixed it. ↩