Thursday, June 01, 2006

Filling table with prime numbers

First of all many thanks to Dean Swift, Carsten Pedersen, Kai Voigt and Kristian Köhntopp for providing me with this example and allowing me to blog about it.

This origins from a stored procedure exercise that a group of students did which ended up in an optimization competition. It's about a table that should be filled with prime numbers - up to a pre-defined bound - by a stored procedure.

So here's the basic solution:

mysql> DELIMITER //
mysql> CREATE DATABASE sieve //
Query OK, 1 row affected (0.00 sec)

mysql> USE sieve //
Database changed
mysql> CREATE TABLE sieve (
-> id INT PRIMARY KEY
-> ) //
Query OK, 0 rows affected (0.06 sec)

mysql> CREATE PROCEDURE sieve (max INT)
-> BEGIN
-> DECLARE l0 INT;
-> DECLARE l1 INT;
-> TRUNCATE sieve;
-> SET l0=2;
-> WHILE l0
-> INSERT INTO sieve (id) VALUES (l0);
-> SET l0=l0+1;
-> END WHILE;
-> SET l0=2;
-> WHILE l0
-> SET l1=l0*2; # delete from first multiple
-> WHILE l1
-> DELETE FROM sieve WHERE id=l1;
-> SET l1=l1+l0;
-> END WHILE;
-> SET l0=l0+1;
-> END WHILE;
-> END //
Query OK, 0 rows affected (0.00 sec)

Here's some further information, if you want to play with optimizing this stored procedure:

Our colleague, Philippe Campos, suggested removing the inner loop and replacing it with a modulo operator. (DELETE FROM sieve WHERE (id%l0)=0 AND id>l0) This increased speed. He then suggested batch inserts. This made it much faster. A student suggested batch insert of odd numbers to the memory storage engine. The former is cunning and the latter opens much scope for optimization beyond the algorithm of the stored procedure.

By what ratio can you improve the basic implementation? Do indexes help or hinder?


So what's your best solution?

Enjoy!

5 comments:

robinv said...

OK, how about this? I've not bothered with improving the population of the sieve but I've hoicked it out into a separate procedure to tidy things up.

When 'max' is 20000, it takes about 4.27 sec to the original's 11.75 sec on my machine.

-robin

DELIMITER //

CREATE PROCEDURE fill_sieve(max INT)
DETERMINISTIC
BEGIN
DECLARE counter INT DEFAULT 2;

TRUNCATE sieve;
WHILE counter < max DO
INSERT INTO sieve (id) VALUES (counter);
SET counter = counter + 1;
END WHILE;

END //

CREATE PROCEDURE sieve (max INT)
DETERMINISTIC
BEGIN
DECLARE counter INT DEFAULT 1;

CALL fill_sieve(max);

WHILE NOT ISNULL(counter) DO
SELECT MIN(id) INTO counter FROM sieve WHERE id > counter;
DELETE FROM sieve WHERE id > counter AND (id % counter) = 0;
END WHILE;

END //

DELIMITER ;

robinv said...

It also occurred to me that you only really need to loop through while the counter is < SQRT(max).

So adding that in to my previous solution:

DELIMITER //
CREATE PROCEDURE sieve3 (max INT)
DETERMINISTIC
BEGIN
DECLARE max_root INT DEFAULT CEIL(SQRT(max));
DECLARE prime INT DEFAULT 1;

CALL fill_sieve(max);

WHILE NOT ISNULL(prime) AND prime < max_root DO
SELECT MIN(id) INTO prime FROM sieve WHERE id > prime;
DELETE FROM sieve WHERE id > prime AND (id % prime) = 0;
END WHILE;

END //
DELIMITER ;

Timings:

mysql> call sieve1(20000); # original
Query OK, 0 rows affected (11.63 sec)

mysql> call sieve2(20000); # first try
Query OK, 0 rows affected (7.47 sec)

mysql> call sieve3(20000); # with sqrt
Query OK, 0 rows affected (1.62 sec)

robinv said...

Now the slow part is filling the table. That can be improved dramatically by using a doubling algorithm:

DELIMITER //
CREATE PROCEDURE fill_sieve(max INT)
DETERMINISTIC
BEGIN
DECLARE previous_max INT DEFAULT 1;

TRUNCATE sieve;

INSERT INTO sieve (id) VALUES (2);

WHILE previous_max <= max DO
INSERT INTO sieve (id) SELECT id + previous_max FROM sieve;
SET previous_max = previous_max * 2;
END WHILE;

DELETE FROM sieve WHERE id > max;

END //
DELIMITER ;

Using that method of filling brings the time to less than a second.

mysql> call sieve5(20000);
Query OK, 0 rows affected (0.63 sec)

Which can be improved further by using a memory table:

mysql> call sieve6(20000);
Query OK, 0 rows affected (0.24 sec)

-robin

tp diffenbach said...

Why not just use SQL to do this:


First, create a table that is just a range of numbers:

create table artificial_range (id int not null primary key auto_increment, idz int not null default 0);

insert into artificial_range(idz) values (0);

insert into artificial_range (idz) select idz from artificial_range;

Repeat the last statement, doubling the table size each time, until it reaches the length you want.

Now, assuming that your RBDMS returns a float type for integral division, just do this:

create view artifical_range_prime as
select * from artificial_range a
where not exists ( select * from artificial_range b where b.id != 1 and b.id < a.id and a.id/ b.id = floor(a.id / b.id));

Note that the predicate 'and b.id < a.id' is just to optimize the process a bit.

If your RBDMS does not return a float type for integral division, just cast one of the operand (a.id or b.id) to float prior to the divisions.

tp diffenbach said...

Oh, my bad, I forgot 1 is not prime.

We could also filter out multiples of two, but that's left as an exercise for the reader. ;)

This version will also factor out the division, possible casting, and
floor function, replacing all that with the modulus operation (if your RDBMS does not support operator %, it probably supports function mod():


select * from artificial_range a
where a.id != 1 and not exists ( select * from artificial_range b where b.id != 1 and b.id < a.id and a.id % b.id = 0 );