Monday, July 9, 2012

Two things that piss me off.

1.  Our last release was complicated, chaotic, and poorly estimated.  Nevertheless, it made it into production without a single significant defect.  Rather than attributing this to cross-functional teamwork, thoughtful implementation by development or (my favorite of course), well-executed risk-based testing (with much help from other areas), I was told that the successful release was due to "luck."  (By someone who should know better.)

2.  Story problem: 

5 days in a sprint.
5 hours of allocation to sprint work per day.
One day of the week needs to be subtracted from allocation due to an offsite meeting.
One QA resource is assigned to two teams to cover vacationing analysts.
Said individual committed to 28 hours for one team.

Now I'm not good at math, but...

And why didn't anybody call him out on it during planning?

Thursday, July 5, 2012

The problem with root cause analysis.

I started riding Metra regularly into downtown Chicago about 22 years ago, back in the days of smoking cars, bar cars, and dozens of Chicago Tribunes blossoming from the seats.  The seats were orange and smelled of tobacco and old ketchup, and the conductors (usually) announced the stops manually as we approached each one.

The changes that have taken place on Metra since then have been both good (no smoking, newer cars) and bad (getting rid of the bar car, eye strain as we all stare at our iPhones for 70 minutes each way), but overall Metra has successfully, if slowly, rumbled into the 21st century.  Although it was barely 2 or 3 years ago that we could start buying tickets via credit card (let alone on line), the trains now use GPS to call out locations based on our route and position.  There have been a few mistakes I've heard, skipped announcements or the wrong station name being called out, but the conductors usually correct those immediately.

When the trains are delayed, which is a somewhat rare occurrence but not unheard of, a similar system will make a general automated announcement to the riders.  "This train is delayed (X) minutes due to (Y)."  Y usually is something like "track construction," "freight train interference," or "waiting on other trains."

On a recent morning, one of the trains on my line hit a pedestrian riding his bike.  Although this is sadly not an uncommon situation, the overall impact to commuters tends to be low and limited to the train involved in the incident.  Today, however, the entire line seemed to be in a cluster; trains were late inbound and outbound, many delayed by nearly an hour, and some runs were cancelled altogether.  After waiting on the platform nearly an hour, I boarded the first train that arrived.  I board at nearly the opposite end of the line from Chicago and am usually one of the first people to board inbound trains, but I could hardly find a seat on this one.  As we headed eastward the train filled to standing room only capacity quickly, people filled the aisles, vestibules and stairways.

As we neared Union Station, an automated voice announced:

"This train is running...50...minutes late...due to...passenger loading."

The crowded train erupted in laughter.

I wondered if Metra does root cause analysis of train delays, like airlines do, and if that particular train's delay was officially chalked up to a sudden and unexpected onslaught of passengers rather than the tragic incident that took place hours before.  There was no reason to cover up the initial incident as Metra was not at fault, but the reported reason hardly told the whole story.

And here comes the segue.

I see this all the time in QA when we're doing root cause analysis of production problems.  The last 5 production hotfixes I have investigated were chalked up to:

1.  Miswritten requirement
2.  Hardware issue on new server
3.  Known problem that we opted to hotfix after production release
4.  Marketing request
5.  External issue not seen in QA

QA managers like to do root cause analysis because the hope is that we can deflect some of the "why didn't you catch this" interrogation.  (To be fair, my current culture doesn't work that way but I do have some scars from old employers.)  But how "root" are my root causes?  Let's review...

1.  Miswritten requirement - ok fair enough, we implemented the requirement as stated and as stated it was wrong.  But how could we have corrected this?  Would a few minutes of extra conversation spurred the product owner to correct this information?  Was the product owner herself misinformed on what the customer needed?  Does the tool we use allow for sufficient input of the *right* information?

2.  Hardware issue on new server.  Again, QA was not directly responsible for this, but it affected production nonetheless.  Should QA have been aware of this new server and involved in testing before launching it?  Do we need to make ourselves more available to do extra checks of this sort?  Have we pigeonholed ourselves as software testers when we should be supporting our TechOps teams also? What skills do we need to build to be able to support hardware testing?

3.  Known problem that we opted to hotfix later - what in our process is flawed where we would rather put out code we know is bad than try to correct it and risk creating a bigger problem?  Would putting out smaller, more focused releases have kept the risk to enough of a minimum where we could avoid shipping code we know is bad?

4.  Marketing request - similar to the above.  Could more regular, more focused releases support marketing initiatives so that we would not have to put the label "hotfix" on what was actually a desire to get a potentially money-making feature into production?  How can QA support getting these types of releases to production with minimal risk?

5.  External issue not seen in QA.  We love these don't we?  Don't we feel a little self-satisfaction when we can throw these back?  But this is where we can potentially do the most good.  No system is an island, and with automated unit testing becoming de rigeur, the moving parts are the parts we need to be focusing on, not the actual lines of code.  If we're not catching this stuff in QA, why?  Any difference between QA and prod, whether intentional or not, needs to be accounted for in testing and some risk assessment should be taking place.  Do we need to improve or alter our performance tests to account for some production scenario?  Are there configuration differences that we need to take into account or that we need to try to simulate?  Is there a third party we need to work more closely with in our test environments?

Creating and maintaining a quality system, a quality user experience, means creating a quality process.  We need to do more than be able to explain how production issues aren't our fault (even when they aren't).  Production issues are what happens when good people meet bad processes, and it is our responsibility to apply what we know about the entire system, from an organizational level on down, to eliminate those issues.