I finished reading The Curious Incident of the Dog in the Night-time less than twenty-four hours ago, so it was a pleasant surprise when, this morning, I was presented with my very own puzzle-mystery to solve. Better still for one who has just read “The Curious Incident”, the players in this puzzle-mystery are not people, but computer programs!
We have a production process,
logsrv, that takes log records from IBM MQ, interprets each one and then inserts it into a database table. A second process,
publisher, provides a SOAP web service that reads the database when requested by clients.
+------+ +--------+ +-----------+ +-----------+ | MQ | --> | logsrv | --> | SQLServer | --> | publisher | +------+ +--------+ +-----------+ +-----------+ | v +--------+ | client | +--------+
This is all written in Java and deployed to Windows.
publisher run as Windows services (using JavaService), and the database is Microsoft SQLServer. While it is a little complex – and there are good reasons for that, trust me – this system is rock-solid stable. The
publisher process, for instance, has a record up-time of 89 days, with no sign of memory leaks or performance. (I think it could run for a year or longer, but unfortunately had to take it down recently, to put in a one-line code change. Oh well.)
All was going fine, until this morning, when I got The Email That Changed The Course Of My Day. Log messages shown at the client were timestamped for ten hours later than they should have been.
If you’re a Java developer, and you live on the east coast of Australia, then that ten hours ought to make you suspicious. Ten hours is the time difference between Australian Eastern Standard Time and UTC.
So the Simon The Sys Admin and I had a look in the database and found A Clue. The database records both
WHEN_OCCURRED is the timestamp from the record that taken from MQ, while
WHEN_RECORDED is the time that the record was received at the
logsrv process. According to the database, the logs were being
RECORDED ten hours before they
We checked the Windows system time and timezone, and found they were set correctly. A quick check of the log4j debug output confirmed that the
logsrv process was ten hours behind normal time.
So now we had two problems. First,
logsrv had the wrong timezone, and second, the web service was publishing the
WHEN_RECORDED timestamp instead of the
In accordance with the ancient and mystic precepts of Issue Resolution, I asked the Simon The Sys Admin to stop and restart
logsrv. This did not correct the issue. The clients still showed the wrong timestamp. Again in accordance with the precepts, the sysadmin elected to Reboot The Server. This time, when the server came up, the clients began to show correct timestamps once more.
Feeling a little better about it all, I wrote an email to the IT manager, reporting that all was now well, and went hunting through the
publisher code for this showing-the-wrong-timestamp bug.
Funny. I couldn’t find a bug in the
publisher code. The
publisher code correctly ignored the
WHEN_RECORDED value, only using the
There was only place where the two values could have been confused: in Hibernate. Now, this wasn’t quite as silly as it sounds. The timestamps had been stored using a Hibernate UserType, something which is not as common as straight data types. We were also using an older version (2.1.6), and hadn’t upgraded. I downloaded 2.1.8 and read through the release notes, but there was nothing appropriate there.
I decided it was time for a cuppa. While waiting for the jug to boil, I realised that I was grasping at straws, and decided to put this particular theory away and see if I could think of something a little more likely.
Just then, a good thing happenned. Simon The Sys Admin emailed me the log4j debug output for the whole day so far, plus some of yesterday. Looking at the
logsrv output, I could see that, after restarting the service, it had the wrong idea of system time. However, after the reboot,
logsrv suddenly had the right time. Curious. How the service was started should have no effect on the timezone.
I happenned to have a copy of the SQLServer database on hand – one that I’d made only yesterday. A quick couple of queries showed that
logsrv only started to put the wrong
WHEN_RECORDED timestamp starting yesterday at – doh! – approximately the time I’d manually restarted the
logsrv service, breaking the 89 day uptime record.
logsrv was working fine, apart from putting in the wrong
Then it occurred to me that the
publisher likely had the same problem. Sure enough, I had had to restart it yesterday too, and after the restart it had the wrong timezone.
Fortunately, these applications dump the contents of
System.properties() to log4j on startup. I could see right there on every start up the value of the
user.timezone property. It was set to
GMT whenever the service was restarted manually, and
Australia/Sydney after the reboot.
GMT is what Java defaults to when it can’t determine the timezone, that explained the ten hour difference.
The simple fix is to configure the
publisher services to pass
-Duser.timezone=Australia/Sydney to java.exe on startup. Easy. I can do that next week, since I’ll be deploying a new release then anyway.
Production issue resolved, I went back to my development task.
However, something about this just didn’t make sense. Sure, it works fine when all the processes have the correct timezone, but still I wondered what exactly was causing the client to show the wrong times when the timezone is wrong.
logsrv was inserting the correct timestamp into the
WHEN_OCCURRED column of the database table.
publisher was not supposed to be modifying any dates or times at all.
I nutted it out while I was at the shops. This is what happens when the timezone in the
publisher is wrongly set to
logsrvneeds to know the timezone that applies to any given point in time. Since there is no explicit timezone, it interprets the “occurred time” as a GMT time.
logsrvthen writes the occurred time to the SQLServer database.
WHEN_OCCURREDends up with exactly the same time as was in the MQ record. So far so good.
publisherthen reads the data from the
WHEN_OCCURREDcolumn, assuming it to be a GMT time.
publisherthen constructs a SOAP message to send to the client. It marshals the “occurred time” as an XMLSchema dateTime. XMLSchema dateTimes include the timezone.
Now I’m much happier, because I know that if I have the correct timezone on
publisher and client, this problem is derinitely solved. I also have a better understanding of how this part of the application will behave if the timezones ever get of sync in future.
1 Observant readers will note that I was reasoning backwards.
WHEN_RECORDED was certainly off by ten hours, but it was ten hours early, whereas the reported symptom was that times where ten hours late. I’m kind of glad I didn’t notice this, though, as then I would have been really stumped, and may not have persevered past rebooting the PC.