The e-commerce web server cluster was due for an upgrade. New hardware was purchased. Windows and IIS were installed. Microsoft’s NLB clustering was configured and the various e-commerce websites and services were deployed to the new two-node cluster.
And everything seemed just fine.
…but you couldn’t really tell from how the website behaved
A few weeks after the new cluster was rolled out, a customer from Singapore contacted support to say that they were unable to purchase anything on the store. Dutifully, a couple of developers tried to reproduce the issue:
I add an item to my cart and click the shopping cart icon to check out. When I arrive at the ‘cart’ page, all of my items are there. But, when I click ‘proceed’ to go to the billing page, nothing happens.
The checkout path was fairly straightforward: cart -> billing -> shipping -> confirmation -> order summary.
From looking at the logs, it was clear that the server was receiving the HTTP request to load the billing page, but, surprisingly, was redirecting the customer back to the cart. Based on the code, such a redirect would only be returned if the cart was empty. So, the cart page displayed a list of items but, once the customer clicked to proceed to the billing page, those items disappeared. When the cart page loaded again, though, the items reappeared.
A few days later, the customer emailed to say that he had been able to complete his order, so the issue was chalked up to an issue on the customer’s end.
Over the next few months, a customer would email to report this problem a few more times—usually from Singapore1. But we had plenty of customers from Singapore and, based on their order histories, most of them were able to place their orders without problem. The IIS server logs were reviewed to find more incidences of the cart -> billing page -> cart redirection pattern, but nothing really stood out and the issue was relatively infrequent and was put on the back burner accordingly.
But, what if the problem is with the customer’s ISP?
Approximately six months after the new cluster was deployed, a handful of the customers—relatively big spenders from Singapore—had become quite frustrated and escalated their concerns. Tracking down the issue became a top priority.
During this round of investigation, a seemingly interesting detail emerged: all of the customers from Singapore were coming from a small number of IPs on a single subnet. Further, the IP address usage overlapped. So, it appeared that they were sharing these IPs in some kind of network address translation pool.
At this point, the team investigating it speculated that a there was a misbehaving caching proxy somewhere in Singapore—beyond their control. Again, they considered the issue essentially “resolved” because the problem was believed to be with a component beyond their control. 2.
If that were true, wouldn’t the customers have problems like this frequently and with other sites?
I joined the team responsible for most of the company’s web applications—
including the e-commerce websites—around the time the issue was escalated. I had a bit of a chance to weigh in with my new teammates as they worked through the issue, but I was the new guy and my background involved a minimal amount of ASP.NET/IIS on Windows. I had quite a lot of web experience, but felt a bit rusty having spent the past 2 years working on Windows and Mac OS desktop apps.
After they declared the problem beyond their control a second time, the emails kept coming in. The group’s director was frustrated, but felt stuck—he didn’t want to tell the customer tough luck. I asked him if I could have a few days to dig into the issue. I could bring a fresh set of eyes and it would be a useful exercise for accelerating my learning of the new-to-me technology stack. That, and I had a nagging sense in the back of my mind that the problem was on our side. If this was really an issue on the customer’s end, it seemed like they’d have experienced it elsewhere, too. By all accounts, they only experienced it on our site.
Reading The Fucking Manual™
As noted, I was relatively new to ASP.NET/IIS. My background was building web applications with primarily Java, PHP, Python, and, regrettably, some Cold Fusion. So, I started by reading the manual and auditing our current configuration against the documentation’s standard recommendations.
One thing became abundantly clear, immediately—the documentation expected all servers to be configured with identical configurations. The documentation-prescribed way to do this was to place the configuration on a highly-available network share and point all cluster members at that shared configuration so that they were all literally using the same config file. Our servers were not setup like this; they each had their own, local configuration files.
Upon inspection, the files were very, very similar, but not identical. In particular, the id property of a given site on one server did not necessarily match the id property for the same site on the other server. This was notable because it was incorrect, but I didn’t yet know if, how, or why it related to the problem at hand.
With that config file discrepancy in the back of my mind, I dug into the logs and code to see if there was anything my fresh eyes would see that had previously gone unnoticed.
Now, the logging system wasn’t setup quite the way I’d done it in the past—errors were logged to a mailing list (yes, it was terrible) and everything else just went to the local disk of the individual servers. I typically preferred to merge logs or forward to a centralized logging server, which allowed seeing all requests to all servers in a unified history. The lack of a unified view of the logs masked an interesting detail for a little while, until I merged the log files I was reviewing.
The logs revealed that most customers’ browsers would submit all requests to the same server for their entire session—product page, add to cart, cart, billing, shipping, order summary, etc. I suspected, and then confirmed, that we had configured “Class C” affinity on the NLB cluster. Once I unified the logs, I noticed that the customers from Singapore who were reporting the problem stayed on the same server for most of their session, but there was a single request to the other server in the cluster when loading the billing page. Given the discrepancy between site ids in the configuration files I’d noticed above, this turned what had been a yellow flag orange.
Spelunking Through the Code
Reviewing the code, there were two preconditions enforced by the billing page: it required that the page was loaded via an HTTPS URI and that the cart contained at least one item.
The HTTPS precondition’s implementation was a bit awkward. At the time, the website enforced HTTPS for “sensitive” pages like billing and account settings; but it also enforced unsecured HTTP for everything else.3 If it received a request via HTTP for a “sensitive” page, it would return an HTTP 302 redirect to the same URI using HTTPS. So, when a typical user clicked the “proceed” button from the cart page, what happened was:
1 HTTP GET to http:/foo.com/checkout/billing
302 Found w/ Location: https://foo.com/checkout/billing
2 HTTP GET to https://foo.com/checkout/billing
But for the affected Singapore users, it looked more like this:
1 GET http:/foo.com/checkout/billing
302 Found w/ Location: https://foo.com/checkout/billing
2 GET https://foo.com/checkout/billing
302 Found w/ Location: https://foo.com/cart
3 GET https://foo.com/cart
302 Found w/ Location: http://foo.com/cart
4 GET http://foo.com/cart
It was clear from the web server logs that the billing page was being loaded over HTTPS, so that was being enforced.
The failing precondition was the requirement that there be at least one item in the cart. The cart page and the billing page considered different data sources when evaluating the contents of the cart and a single integer—quoteId—stored in ASP.NET Session State related the two data sources. We could see from application logs that, in the failing case, quoteId was 0—which was an invalid value.
But why did it only misplace the quoteId for people from Singapore?
But wasn’t the NLB cluster configured to use “Class C” affinity? It turns out—from inspecting the logs again—that requests for the affected customer sessions were originating from one IP for unsecured HTTP and a different IP on a separate subnet for HTTPS. If, by chance, both IPs were affined to the same server, everything would “just work”. But, if they happened to become affined to a separate servers, customers using that Singapore ISP could not proceed through checkout because ASP.NET Session state could not be shared between servers if the site ids did not match up across the cluster.
Once the site ids were aligned across both servers, the problem was actually solved. The remaining problems—relying on affinity, making the entire site use HTTPS, eliminating (or, at least, reducing) usage of ASP.NET Sessions, etc—were addressed separately.
Take the time to truly understand the way the software you’re using works
This is my favorite takeaway because it’s the one I’ve seen people waste the most time on. More often than not, I see people mistakenly rule out some source based on an untested speculation rather than analysis. In this case, some hypothesized that a mysteriously misbehaving proxy or middlebox was at fault, but that hypothesis wasn’t really tested. Mere plausibility is not an adequate threshold.
Other times, it’s an unwarranted confidence that a software vendor (open source or commercial) wouldn’t do something that foolish (yes, they would).
In other cases, it’s more subtle—a bias toward looking for the source of a problem in the areas where the practitioner/investigator is most expert or stigmatizing technology that is foreign, unfamiliar, or has a bad reputation in general.
Instead, read the documentation and use tools like profilers, tracing frameworks, decompilers, and, if available, source code, to inspect and understand the software you’re running.
Log Everything (or, at least a lot more than you do now)
I’ve begun to take centralized logging for granted, but not long ago, I heard from a friend and former co-worker who started at a new company where there was little, if any, logging setup for their production systems.
I rarely encounter people who are logging too much. In this case, the only logs I had were the IIS server logs and some really thin application logs. Had this issue come up a couple of years later, after we’d deployed centralized logging and application monitoring, it would have been solved much more easily.
Be willing to audit your code and config and revisit past decisions
Be willing to audit even seemingly basic things like the web server configuration to make sure things are actually configured properly.
Be willing to revise a decision made in the past that seemed reasonable given the context and information available at the time but is no longer appropriate given new information or changed context.
There was one report from someone using HughesNet, too. ↩
It’s around this time that “We hate Singapore!” became a tongue-in-cheek way of talking about the issue, particularly in light of the initial conclusion that there was no way to solve it. ↩
The software industry is constantly evolving, and you must actively pursue skill and knowledge growth in order to stay relevant. Some small amount of growth will occur as a a natural side effect of your daily work, but that isn’t enough. The best engineers I have worked with over the past 20 years are constantly reading1 and aggressively developing their skills.
In recent years, I increasingly encounter engineers who are unfamiliar with some subject that comes up in a discussion, or, more commonly (and tragically), their knowledge of the subject is very narrow or, at times, cargo cultish.
In response, I’ve created a reading list for software engineers2 that covers a number of concepts that I increasingly consider fundamental. The list, of course, reflects my biases. I work primarily with web technologies in enterprise or enterprise-like contexts. Most of the list should be applicable to any moderately complex system, particularly those that involve communication between applications over the internet3.
One book a month seems more than reasonable and I expect that many people could finish this list in half that time or faster. The recommended books are listed in the order I recommend reading them.
This seems like a reasonable minimum set of reading for the first year in addition to whatever language or platform-specific stuff you have to learn to actually perform your job.
I focus on reading because it is my preferred medium for information acquisition and there is a plethora of good material. If you’re into audio or video, and you’re able to find enough good, relevant content, excellent. ↩
The primary target audience is software engineers in their first year on the job. In many cases, this isn’t just their first year working on a given team, it is also their first full-time software job. ↩
The proper capitalization of “internet” is not universally agreed upon. More info here, here and here. I choose not to capitalize it. ↩
Today I called Microsoft to order some Windows 2000 Terminal Services CALs.
Once I get over just how foreign it is to have to call a company to acquire a license in order to let someone access the server in a slightly different fashion (e.g. via Terminal Services), it seems like the mechanics should be simple enough. In fact, they even appear to make it easy for you. Launch the Terminal Services Licensing application and select the menu item that says “Add Client Access Licenses” or something like that. It gives you a phone number and some crazy long License Server ID to give the nice Hyderabadi on the other end of the line. This is where it gets interesting…
Dial the number. “Welcome to the Windows XP Activation Service.” Huh? Keep listening. “…and I can also help you with Windows Terminal Services Licensing.” Finally. What, Microsoft can’t afford to dedicate an 800 number to this thing? O.K., step through 2 levels of voice tree only to be placed on hold. Wait time was short enough; so far, so good.
“Hello, my name is [unintelligible]. How can I help you today?”
“Yes, I’d like to obtain some Windows Terminal Services CALs.”
“Please read me the license key.”
“You mean the License Server ID?”
“You’re wanting to install Terminal Server CALs, correct?”
“Yes, but I don’t have the license key yet. That’s why I’m calling. I need to obtain them first. It says to call this number to do that.”
“Oh, o.k. read me the License Server ID.”
“[read really long alphanumeric string over the phone]”
“What is your volume licensing program id?”
“Huh? What’s that? We don’t use volume licensing.”
“O.K., you need to contact the Microsoft Reseller.”
“Which Microsoft Reseller? The product says to obtain the licenses from this phone number.”
“Oh, please tell me your product ID.”
“You mean the Product ID for Windows 2000? I don’t have it with me. Shouldn’t the License Server ID cover that? I mean, isn’t that why you require Terminal Server activation and all that?”
“It should be on the product packaging.”
“Right. I don’t usually keep the product packaging handy for a product I originally purchased 4 years ago. I’m really going to have to dig that out of wherever to do this?”
“Great. Thanks a million.”
“Thank you for calling Microsoft. Have a great day.”
So, now to go dig up the product key and then repeat the process. I’m still confused as to how this model of software constitutes a greater value to small businesses than Free Software. Knowing Microsoft, I’ll call back only to be told that Windows 2000 is an EOL product, so I can only buy additional CALs if I spend hundreds of dollars to upgrade to Windows 2003 Server or Windows Server 2003. All so that I can let one additional computer interact with the server.
The word "brand" has so many meanings now, some more whacked-out than others, that using it has ceased to be useful. — Hugh Macleod
Some marketers wrongly believe that the quality and effectiveness of a brand is a singular function of the evocative name they assign it and the nifty messaging they supply alongside it. Unfortunately (for them), this isn’t the case. Marketers may be able to pick the name, but nothing influences a brand more than the execution behind it. No amount of messaging will convince people that Microsoft is tops when it comes to security. And Microsoft is certainly not pushing the idea that the Zune is a clunky turd, but that’s the word on the street. No amount of messaging or naming can overcome poor execution.
Additionally, the names associated with the brand need not be spectacular. Sure, occasionally, someone comes up with a truly great name, and it certainly helps. And, of course, there are truly terrible names — names that are hard to remember, hard to pronounce or just plain confusing.
If you’re really lucky, your customers will help to choose your name for you (FedEx). if they do, embrace it. Or, if you give your customers multiple potential names to associate with your products and services (say, a product name AND a corporate name) and one of them is the clear winner in the marketplace, adopt the one they embrace and consider leaving the other(s) behind (as Oracle, Motorola, and Xerox did).
A less-than-perfect name, backed by stellar execution, is dramatically more valuable than a clever name and average or weak execution. Instead of arguing endlessly about naming strategies, adopt a name customers understand (assuming it’s not derogatory) and then focus your energies on executing the hell out of your products.
When you are the founder of a company, you want to skimp on frills; they seem like a waste of money to you. That’s fine. But don’t think that candidates interviewing at your company will have the same emotional attachment; they won’t. They are looking for a nice place to work.
…[I]f you want to be successful in the software business, you have to have a management team that thoroughly understands and loves programming, but they have to understand and love business, too. Finding a leader with strong aptitude in both dimensions is difficult, but it’s the only way to avoid making one of those fatal mistakes…
It’s an old article, but it fits with the theme of stuff I’ve been thinking about lately.
I’ve heard it said many times that the march of progress means that business people take over from the pioneers, but I’ve observed the opposite. When the boom is finished, the technology will still be here, and while progress may have suffered during the euphoria (the money is rarely used to fund new ideas), the ball never really stops rolling while everyone is focused on the money-obsessed. When the boom is over, we’ll still be here, pushing new ideas forward.
It struck me that someone who complains constantly should be marked down as remarkably optimistic. The complainer believes that people actually might care.
I liked this quote, but I’m not sure I entirely agree with the rest of his sentiment, which is that the most pessimistic people are the non-complainers. I imagine a large group of the non-complainers are pessimists, but there’s a sizable group of non-complainers that are just unaware that something is wrong. A smaller group of non-complainers is also made up of discouraged former complainers (but still optimists at heart), bitter and jaded that nobody ever listens.
When we say that one kind of work is overpaid and another underpaid, what are we really saying? In a free market, prices are determined by what buyers want. People like baseball more than poetry, so baseball players make more than poets. To say that a certain kind of work is underpaid is thus identical with saying that people want the wrong things.
The principles from this essay can readily be extrapolated to other related topics, including the relative value of employees to an organization.
Before you embrace your wonderful solution to the marketplace’s problem, first decide how many of consumers are choosing to listen to messages like yours. Are they listening in a medium you can afford?
This highlights a style of software design shared by Microsoft and the open source movement, in both cases driven by a desire for consensus and for “Making Everybody Happy,” but it’s based on the misconceived notion that lots of choices make people happy, which we really need to rethink.
“The only way to change the world is to imagine it different than the way it is today,” he says. “Apply too much of the wisdom and knowledge that got us here, and you end up right where you started. Take a fresh look from a new perspective, and get a new result.”
The Internet is inherently seditious. It undermines unthinking respect for centralized authority, whether that “authority” is the neatly homogenized voice of broadcast advertising or the smarmy rhetoric of the corporate annual report.
No, this isn’t new. And no, it isn’t the first time I’ve read it. But I’m reading it again, and this sentence (again) resonated with me as I read it.
I’m not sure who comes up with these jewels, but it’s probably the same crowd that thinks vacuous motivational posters and their platitudes are actually effective. This argument suggests that dressing up will positively affect our performance. Notice, I didn’t say “reflect”, I said “affect”. These people actually believe that employee productivity is directly proportional to the “professionalism” of their attire. This may be true for a few people, but it isn’t universally true because it is wholly dependent on the unspoken assumption that everyone associates “well-dressed” with “successful”. I’m not talking about interaction with the public or customers here. I’m talking about how people feel about themselves. The degree to which someone associates “well-dressed” with “successful” is largely a function of their cultural upbringing. A person that grew up in the Midwestern United States in the 50s and 60s is going to feel very differently about this than someone raised on the West Coast during the 80s and 90s. It turns out that this argument is little more than the cultural preference of regional and generational subgroups masquerading as a universal principle. In many cases, this approach to dress also reflects cultural socio-economic prejudices. In past decades, particularly east of the Rocky Mountains, jeans and t-shirts were the traditional garb of the working class.
“But”, they say, “people will take you more seriously if you dress up more!” Maybe. That probably says a lot more about them than it does about anyone else. If you’re dealing face-to-face with the public or your customers, great, dress up. I understand that. At that point, how you dress becomes a part of the public-facing corporate image—the brand. What one wears while sitting at a desk, far, far away from customers does not.
This argument is especially difficult to swallow if one is also forced to sit at second-rate desks held up by stacks of books. How can what the employees wear possibly matter when the office equipment and furniture are the shabbiest thing around the office? The quality of the tools and equipment speak volumes more about the nature of the company than employees’ clothing choices.
Your Attire Reflects Your Character?
This argument probably pains me more than any other. It is ultimately a classist, socio-economically prejudiced position. This suggests that people well-schooled in “proper” attire and fashion are somehow of greater character and/or discipline than those who have not been similarly schooled. Putting shoulder pads and a helmet on Miss America wouldn’t make her a football player any more than putting a nice dress, makeup and a wig on Tom Brady would make him a beautiful woman. In order to make character judgments about people based on their attire, we must assume that the associations between various types of attire and character are well-understood and agreed upon by both the wearer and those evaluating his dress. Again, these assumptions derive not from a single, well-established, objective source, but instead from the culture (or sub-culture) in which each person was raised. Those cultural inputs are both regional and socio-economic. Neither one’s region or socio-economic status of origin are a legitimate primary basis for character evaluation. Additionally, it is both lazy and ignorant to place the bulk of the burden for bridging the potential expectation gap on the shoulders of the wearer, since neither person is objectively correct.
This view is particularly destructive because many (most?) people are more comfortable in the attire that comes naturally to them. Of course, there are limits and there is a legitimate baseline for cleanliness and hygiene, but needlessly forcing someone who grew up in a cultural environment where jeans were a wardrobe staple to wear slacks/khakis all the time is no less obnoxious than needlessly requiring someone to wear jeans every day if they grew up believing that a good suit or is the hallmark of the wearer.
Uniform Dress Is More Equitable?
This is an argument I’ve heard more of recently, and, while less distressing than the first two, is equally flawed. This argument is predicated on the idea that sameness and equality are the same thing. Deconstructing that bit of nonsense requires a post unto itself, but suffice it to say that while the principle seems to be popular among equality activists, it’s a pretty significant fallacy.
In the workplace, such thinking leads to all kinds of bad decisions. For instance, it would suggest that everyone should work in the same physical environment and be provided the same equipment as well. That’s ridiculous. Different jobs require different environments and different tools. People who travel frequently need notebook computers; people who don’t may not. People who are on-call should have company-funded cell phones; for people who aren’t, they are optional. People who need quiet and the ability to focus on complex problems (programmers, for instance) benefit greatly from private (or at least shared/semi-private) offices; call center workers and administrative staff may not. People who interact face-to-face with customers should dress accordingly during those interactions; people who aren’t interacting face-to-face with customers shouldn’t be required to dress as if they were.
It’s easy to address the cries of “Unfair! Unfair! Why do programmers get to wear jeans and t-shirts?” Let everybody wear jeans and t-shirts if their job doesn’t involve face-to-face contact with customers. Heck, even if it does, let them dress as they like on days when they’re not meeting with customers.
Impact on Recruiting and Culture
Depending on your location, dress code may also be a factor in your recruiting efforts. If all of the software companies around you allow jeans and t-shirts, then it’s probably going to hurt recruiting if you don’t allow them.
I guess for me it all comes back to the idea that the dress code, and the corporate culture overall, should be focused on creating the environment that is most conducive to employee productivity. If you’re a software company, that means programmer productivity. If you’re a software company that doesn’t recognize that your company’s success is directly proportional to the quality of your software (and therefore, the quality and productivity of your programmers), you probably have deeper problems than your dress code.
If people are more comfortable in jeans and t-shirts, let them wear jeans and t-shirts.