Thursday, November 10, 2005

Calculate uniqueness of text fields for setting indexes

Indexing long text fields can be more consuming than helpful. Indexes have to be maintained by the server. So if you have an indexed 255 character long varchar field, it probably means more work for the server than the index can help you.

Fortunately, it's easy to limit the index length to a specified number of characters. We can do this with e.g. a KEY(field(15)) clause when adding the index with a CREATE or ALTER TABLE or CREATE INDEX statement. In this case, only the first 15 characters would be indexed.

The question is - how long should the indexed length be? As a rule of thumb we could say, we should choose a length that's just enough to make most of the indexed contents unique. Maybe we won't reach 100 % of uniqueness, but that's not necessary. It will probably be too little to index only the first one or two characters, because they will be the same for many records. So the next question is - what's a good value?

To make the decision a little easier, we can calculate the uniqueness with following formula:
SELECT count(distinct left(field_name, indexed_length)) / 
count(*) * 100 FROM table_name;

The higher the value, the more unique the indexed contents are.

So I thought to myself, why not put this into a stored procedure where we pass the table name, the column name and two boundaries from which value to which other value we'd like to see the percentage of uniqueness. Here's how I did it: I created a temporary table inside the procedure, prepared the insert into [temporary table] select ... statement and queried the values inside a while loop from the lower bound to the upper bound. Here it is:
DELIMITER $$

DROP PROCEDURE IF EXISTS getUniqueness $$
CREATE PROCEDURE getUniqueness (IN _table VARCHAR(255),
IN _column VARCHAR(255), IN _noFrom INT, IN _noTo INT)
BEGIN

create temporary table tt_uniqueness
(noChar int unsigned not null, uniqueness decimal(10,2) not null);

set @sql := concat('insert into tt_uniqueness ',
'select ?, cast(count(distinct left(',
_column,
', ?)) / count(*) * 100 as decimal(10,2)) from ',
_table);

prepare pSql from @sql;

set @count := _noFrom;

while @count <= _noTo do

execute pSql using @count, @count;

set @count := @count + 1;

end while;

select * from tt_uniqueness;

drop temporary table if exists tt_uniqueness;

END$$

DELIMITER ;


Finally, lets take a look how it works. I try it on the phpbb_posts_text table from a phpBB database which holds 340 entries:

mysql> explain phpbb_posts_text;
+--------------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-----------------------+------+-----+---------+-------+
| post_id | mediumint(8) unsigned | NO | PRI | 0 | |
| bbcode_uid | varchar(10) | NO | | | |
| post_subject | varchar(60) | YES | | NULL | |
| post_text | text | YES | | NULL | |
+--------------+-----------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

mysql> select count(*) from phpbb_posts_text;
+----------+
| count(*) |
+----------+
| 340 |
+----------+
1 row in set (0.00 sec)

mysql> call getUniqueness('phpbb_posts_text', 'post_text', 1, 10);
+--------+------------+
| noChar | uniqueness |
+--------+------------+
| 1 | 7.94 |
| 2 | 27.65 |
| 3 | 41.47 |
| 4 | 51.18 |
| 5 | 59.41 |
| 6 | 67.94 |
| 7 | 71.47 |
| 8 | 78.82 |
| 9 | 84.12 |
| 10 | 85.00 |
+--------+------------+
10 rows in set (0.03 sec)

Query OK, 0 rows affected (0.03 sec)

mysql>

So we can see, how the uniqueness rises as we choose longer index lengths. If we for example decide that we want to have a uniqueness of 80 % for the index, it would be fine to select 9 characters as the index length.

No comments: