Monday, January 23, 2006

Calculating time spans and ... a bug?

It's quite a while (exactly, one week - or seven days) ago since I wrote my last article and I was wondering if that was the longest period that I didn't write anything here on my blog. And whooops - this gave me right the idea what I could write about.

I store all my blog articles together with the date in a local MySQL database. Even though the service at blogger.com works fine, it always feels safer if there is a local copy of everything that I write, for the case if ...

So I extracted the ID and the dates, on which I wrote articles into a separate table which looks like this:
mysql> SELECT id, dt
-> FROM time_periods
-> LIMIT 10;
+----+------------+
| id | dt |
+----+------------+
| 1 | 2005-09-26 |
| 2 | 2005-09-27 |
| 3 | 2005-09-27 |
| 4 | 2005-09-27 |
| 5 | 2005-09-30 |
| 6 | 2005-09-30 |
| 7 | 2005-10-02 |
| 8 | 2005-10-03 |
| 9 | 2005-10-08 |
| 10 | 2005-10-12 |
+----+------------+
10 rows in set (0.00 sec)

I have put a primary key on the id column and a non unique index on the dt column, as we have to reference this column to itself and if there would be thousands or millions of records, it would make the query execution a lot faster.

The query to calculate the days between each date and the day when the last article was written, is not that difficult:
mysql> SELECT a.dt, max(b.dt) as max_b,
-> datediff(a.dt, max(b.dt)) as diff
-> FROM time_periods a INNER JOIN time_periods b
-> ON b.dt <= a.dt AND a.id != b.id
-> GROUP BY a.dt
-> HAVING max_b IS NOT NULL
-> LIMIT 10;
+------------+------------+------+
| dt | max_b | diff |
+------------+------------+------+
| 2005-09-27 | 2005-09-27 | 0 |
| 2005-09-30 | 2005-09-30 | 0 |
| 2005-10-02 | 2005-09-30 | 2 |
| 2005-10-03 | 2005-10-02 | 1 |
| 2005-10-08 | 2005-10-03 | 5 |
| 2005-10-12 | 2005-10-12 | 0 |
| 2005-10-14 | 2005-10-12 | 2 |
| 2005-10-20 | 2005-10-14 | 6 |
| 2005-10-24 | 2005-10-24 | 0 |
| 2005-10-25 | 2005-10-24 | 1 |
+------------+------------+------+
10 rows in set (0.01 sec)

The date column references to itself, based on the condition that the second date has to be less than or equal to the source date and the ID has to be different. To make sure that the first date isn't included (where there is no reference to an earlier written article), I have added that max_b should not be NULL.

However, I've found an interesting thing. If I change the query to
mysql> SELECT a.dt, max(b.dt) as max_b,
-> a.dt - max(b.dt) as diff
-> FROM time_periods a INNER JOIN time_periods b
-> ON b.dt <= a.dt AND a.id != b.id
-> GROUP BY a.dt
-> HAVING max_b IS NOT NULL
-> LIMIT 10;

(I have eliminated the datediff function to directly subtract the second date from the original date), I get this result:
+------------+------------+------+
| dt | max_b | diff |
+------------+------------+------+
| 2005-09-27 | 2005-09-27 | 0 |
| 2005-09-30 | 2005-09-30 | 0 |
| 2005-10-02 | 2005-09-30 | 72 |
| 2005-10-03 | 2005-10-02 | 1 |
| 2005-10-08 | 2005-10-03 | 5 |
| 2005-10-12 | 2005-10-12 | 0 |
| 2005-10-14 | 2005-10-12 | 2 |
| 2005-10-20 | 2005-10-14 | 6 |
| 2005-10-24 | 2005-10-24 | 0 |
| 2005-10-25 | 2005-10-24 | 1 |
+------------+------------+------+
10 rows in set (0.02 sec)

Note the value of 72 in the third row.

Is it a bug? I guess not, because MySQL seems to calculate 20,051,002 - 20,050,930 which results in 72. However, it's something that somebody might not expect so I would at least call it a "Gotcha". But note that it's essential to use datediff here instead of simply subtracting the values, because the results are not the same!

Last but not least - how do I get the maximum number of days that I didn't write any article? Look here:
mysql> SELECT max(diff) FROM
-> (SELECT a.dt, max(b.dt) as max_b,
-> datediff(a.dt, max(b.dt)) as diff
-> FROM time_periods a INNER JOIN time_periods b
-> ON b.dt <= a.dt AND a.id != b.id
-> GROUP BY a.dt
-> HAVING max_b IS NOT NULL) AS tmp;
+-----------+
| max(diff) |
+-----------+
| 7 |
+-----------+
1 row in set (0.01 sec)

I have put the original query into a derived table and asked for the maximum day of "differencies". Very straightforward and easy to see that seven days is really the maximum number of days that I didn't write anything ;-).

No comments: