Diagnosing Software Defects from Screenshots

Gary Sieling posted this on support-costs

When I was in school for computer science, I never appreciated the nuances of error handling: sure, you have to catch exceptions to close database connections, but what else? Assertions are a common tool for checking preconditions; you may write them feeling secure that they will never be violated, but this is not always the case.

 

A common problem is how to deal with diagnosing an error message, using only a screenshot. In the simplest case, where you worked on code and the error message is clear (e.g. null pointer exception at line 10), you can figure it out from just looking. The more erratic the error, obfuscated the code, complex the code path, or ambiguous the line number, the more difficult it is to narrow down the error. This translates into increased support costs, increased QA costs, and quality defects that are missed, or go for a long time unresolved. If the software is installed in a client environment, this is compounded by navigating interactions between groups, and will often have communication costs.

 

In an ideal world, only a screenshot of an error message would be needed to identify a problem. Armed with this knowledge, you could trivially reproduce and fix each error reported by a user. We know, however, that this isn’t the case, and end-user support is often quite challenging. There are clear, simple features of a good software system that can ease this, such as never using the same error message. This works well as long as it doesn’t cause training issues for end users, and is compounded by a frequent desire to avoid security breaches. One workaround is to provide an option for a user to email the error to a support queue, where the email contains detailed logging. I worked on product that had an "email the support team" button which allowed people to report problems or didn't think the screen was showing the right data. It worked pretty well because it provided some information about what a user was doing for the email. It also allowed me fill an email with details on behalf of the user (e.g. what their browser type is). This solution has the added benefit of making the user feel like they are working towards a solution to their problem.

 

Code with hundreds or thousands of assertions will commonly throw errors for many reasons. In a web application, this might indicate a security breach; in an enterprise environment, it may mean that support personnel have attempted to do data corrections and failed. Corporate firewalls and proxies can be configured to modify HTTP headers, to remove cross-site request headers, or to add caching. While the ideal answer is that these should be fixed, these problems must still be detected, and in corporate environments this is often easier said than done.

 

When faced with this type of situation, you can begin with simple strategies:

  • Arm yourself with as many tools for detecting problems as possible.

  • Provide read-only access to production databases at the basic level; this can get you a long way towards resolving support questions.

  • Maintain logs of all table modifications (inserts/updates/deletes) to see who did what and when.

  • Give each error message in the system a unique number, like in the old windows help days; this lets you pinpoint the exact source of known errors in the absence of stack traces. This can be very valuable in specific situations, such as if you’re on the phone with a user, and are required to file a ticket with a third party group to get access to server logs.

 

In a complex system with many subsystems (e.g. say you have multiple databases, an LDAP data source, Sharepoint, Solr, etc), you will need to use more complex strategies to reach a solution:

  • Ensure that each has self-contained logging, with clear information about what’s happening. For instance, with database queries, you can add comments to the query “-- query to get documents” at the end of each - if these are unique and clear, they can be a massive help correlating actions and tracking things over time, especially if you engage a specific group of people who do database work only. This gives them a name for the query they’re looking at, as typically these groups have deeper database understanding but only nominal knowledge of the application behavior.

  • Log as much as possible when an assertion fails. For instance, if possible, log the HTTP request that was sent to the server. The reason is that assertions are typically used to represent invariants; assuming your code is reasonably correct, you will still see these errors, but they will occur in unexpected scenarios (e.g. caching headers added without your permission, implicit but non-enforced database constraints that have been violated, network connectivity issues, browser defects that corrupt requests, etc.).

  • Trace APIs. This can be a huge help, even in the presence of obfuscated libraries, as even knowing the depth of a call stack can help a developer guess where to look.

 

One of the largest challenges are problems in Javascript browser- or environment-specific errors, especially those which occur erratically. Timeouts are a special case of this: if you have a library or application which notifies a user when a timeout occurs, it is important to notify them which request timed out; in a rich application that works primarily through DOM manipulation, a user has typically moved on mentally by the time a timeout occurs, and the timeouts give them the impression they are experiencing random error messages. These should be logged to the server as well, as this can help explain how a user gets an inconsistent client-side state. There are javascript libraries which attempt to log javascript errors to the server. The big weakness is an inability to save stack traces and some state, which is possibly somewhat rectifiable if the code is minified and there are only a few variables, but, these issues will decrease with time as browsers mature and add new features.

 

There are a wide variety of error messages you may encounter when dealing with enterprise software in a client environment; but, if you ensure that the software is constructed well, and if you can convince the client to allow you to build in strategies for troubleshooting (such as logging and monitoring), you can save a lot of money and frustration for you and your client, and everyone will be happier in the long run.

14-day free trial. No credit card required. Sign Up for a Free Trial!