First thing this morning I got an email with ‘high’ priority. The subject was ‘The Learning Center seems to be down’. We’ve been having a lot of problems with the product that underlies our ‘Learning Center’ application lately (we’re currently making a major switch) so I thought it was one of the run-of-the-mill problems we knew exactly how to fix. How wrong I was. Before I explain the story, I should give you some background info…
Cue dreamy effect and go back in time…
About 3 years ago, we bought a product as the basis for our learning management system. The learning system coordinator didn’t want the users accessing it directly (having to learn a whole new UI) so we wrote a few custom portlets to browse the catalog and track your progress using the APIs of the product. The APIs seemed kinda clunky, but they worked.
Fast forward back to this morning
After trying all of our known ‘fixes’ to take care of the problem without success I had to dig. I found some funniness in the logs. Our app was getting a java.rmi.RemoteException from the web service when trying to execute the API calls. Basically the error message was zero help. We had made no changes to the code. We had not updated the product in any way. So why is this happening? We determined is was definitely an API issue only – we could use the out of the box client and do all functions just fine. We called in a support ticket to the vendor. Luckily, they could tell us that the exception we were seeing would only be thrown in a scenario where we were providing either a null value or a zero length string as a parameter to the API method call. I knew that we hadn’t changed any of the code or the product for a looooong time so I was confident that we would prove to the support tech that she was wrong and we were sending in a value so I added some logging statements to the app and redeployed it.
Shock and dismay
Immediately, the logs proved that we were sending in a null value.
“What the #$%!! This has worked for 6 months with no changes. The only thing we did was change the logging level 2 days ago from DEBUG to ERROR and that should have no affect on this kind of thing. Its worked since then… right?”
Never trust other people 🙂
Just to be sure, I checked with the originator of the error (the guy who gets the calls when something is wrong or a change is needed – and a good all around guy):
Me: “Hey… this has been working OK since Monday, right?”.
Him: “Yeah. It was fine yesterday”.
…another 15 minutes pass trying to figure out how it could have ever worked since we’re passing a null value as a parameter.
Finally I came out of my funk and decided I just had to fix it with what I know now rather than trying to figure out what changed.
The realization – Not crazy, just stupid
As I looked into what the problem was, I found that a value for an attribute that we get from calling a utility library was null rather than a value pertaining to the logged in user. “How could this be null? Its been working forever!”. I seriously thought logic was not prevailing, that something had gone awry in not one, but three servers and was conspiring to make me insane. After another 5 minutes, I found that there was a VERY KEY line of code inside of an if block where the condition was ‘logger.isDebugEnabled’. So changing the log level from DEBUG to ERROR caused a major issue in the program since it skipped that key line of code.
Immediately I tried to understand why it took me so long to figure out the problem. I reckoned if I’d have known right after the change that something was wrong I could have easily tracked it to the logging issue (and still kicked myself in the arse for such a silly bug). But since I heard from someone that it just broke ‘this morning’ I went on a wild goose chase.
This bug, had ‘slept’ for nearly 3 years without notice since the logging level was left at DEBUG on the QA and production servers.
– Test with your production logging level
– Systems that aren’t mission critical can have an issue for a good long time before someone actually complains about it
– Dont trust what anyone else says when it comes to observation of a software problem (This is why when you call support for any product you start at step 1, even though you are a seasoned computer person and dont need to get instructions on what alt-ctrl-delete means – apparently they’ve already learned the lesson I learned AGAIN today)
– Accept that some bugs are just plain silly and no matter how experienced you are, you can create them.