8/20/08

Query Madness (Or why I rarely post ;)

So while working on a Rails project yesterday I ran into an instance where I needed a quite complex query done and the result I was getting from the code was extremely slow. I figured it was time to optimize it via sql in order to save on processing time. My goal was to pair down two thousand queries down to two or three at the most and cut ten lines of code down to one or two. Easy right?

Wait until you see the setup.

So this is for a tagging system. I have three tables, let's call them items, tags, and items_tags. Here is the basic layout:


items: {id} [1,2,3,4]
tags: {id, name} [[1,"ring"],[2,"diamond"],[3,"pearl"],[4,"necklace"]]
items_tags: {item_id, tag_id} [[1,1],[1,2],[2,1],[2,2],[3,4],[3,2],[4,1],[4,2]]


So you can see that items 1 and 2 are diamond rings, whilst item 3 is a diamond necklace, and item 4 is a pearl ring. But how to build it into the sql logic ... Also bear in mind that we have to return whole rows since Rails will expect this via ActiveRecord.

For those who are actually creating these tables to follow along here is the proper sql to get you started:

CREATE DATABASE urban_test_cases;
USE urban_test_cases;

CREATE TABLE items (
id INT(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id)
);

CREATE TABLE tags (
id INT(11) NOT NULL AUTO_INCREMENT,
name VARCHAR(255) NOT NULL,
PRIMARY KEY(id)
);

CREATE TABLE items_tags (
item_id INT(11) NOT NULL,
tag_id INT(11) NOT NULL,
KEY(item_id),
KEY(tag_id)
);

INSERT INTO items VALUES(1);
INSERT INTO items VALUES(2);
INSERT INTO items VALUES(3);
INSERT INTO items VALUES(4);

INSERT INTO tags (name) VALUES("ring");
INSERT INTO tags (name) VALUES("diamond");
INSERT INTO tags (name) VALUES("pearl");
INSERT INTO tags (name) VALUES("necklace");

INSERT INTO items_tags (item_id,tag_id) VALUES(1,1);
INSERT INTO items_tags (item_id,tag_id) VALUES(1,2);
INSERT INTO items_tags (item_id,tag_id) VALUES(2,1);
INSERT INTO items_tags (item_id,tag_id) VALUES(2,2);
INSERT INTO items_tags (item_id,tag_id) VALUES(3,4);
INSERT INTO items_tags (item_id,tag_id) VALUES(3,2);
INSERT INTO items_tags (item_id,tag_id) VALUES(4,1);
INSERT INTO items_tags (item_id,tag_id) VALUES(4,3);


Let's run a few examples. Say I want to find all of the items with a tag name of "ring" the query would be fairly straight forward:


SELECT items.* FROM items
LEFT JOIN (
SELECT DISTINCT tag1.item_id FROM items_tags AS tag1
LEFT JOIN items_tags AS tag2 ON tag1.item_id = tag2.item_id
WHERE tag1.tag_id = (
SELECT id FROM tags
WHERE name = "ring"
)
) AS tagged ON items.id = tagged.item_id
WHERE items.id = tagged.item_id;


Viola!! If you had 1200 entries this would take you roughly 0.98 seconds to fulfill the queries. All in one query. The key is the sub queries. Well that's still 0.98 seconds though. Compared to a virtually straight 3 query finding the tag id, then finding the item ids in relation, and then pulling the items through a simple IN, 0.98 seconds ends up being 0.93 seconds longer. What happens when we add a second tag? And how would the query look? And how much time would it take? Is it going to be enough of a cost savings?

First think about the alternative and the goal. We want all items that match both tags. That means item comparisons which with two tags. That would mean we have the initial 3 queries:


SELECT DISTINCT id FROM tags WHERE name IN ('ring','diamond');
SELECT DISTINCT item_id FROM items_tags WHERE tag_id IN ('1','2');
SELECT DISTINCT * FROM items WHERE id IN ('1','2','3','4');


Which basically just dumped all of our items out to you. Then you would have 2 checks in code, 1 for each tag, for every item returned just to do a comparison in the code. That means looping through your arrays like so (in Ruby):


# I'm typing this out in the clearest manner possible for the readers while keeping this
# concise. I know I could shave off lines of code but am actively not doing that to make
# it more understandable. The same goes for my lack of use of inject. Not everyone is
# familiar or comfortable with inject so I'm ignoring it.
# --
# Assuming tags come in as an id separated by commas like "/items/tags/ring,diamond"

bench_times = [Time.now]

ids = []
tags = Tag.find(
:all,
:conditions => [
"name IN (?)",
params[:id].split(",").uniq
]
) #-> SELECT * FROM tags WHERE tags.name IN ('ring','diamond')

tags.each do |tag|
tag.item_ids.each do |item|
ids.push item
end
end

items = Item.find(
:all,
:conditions => [
"items.id IN (?)",
ids.uniq
]
)

ids.clear
tags_length = tags.length
items.each do |item|
tagged_count = 0
tags.each do |tag|
begin
tagged_count += 1 if item.tags.find(tag.id)
rescue ActiveRecord::RecordNotFound
end
end
ids.push item if tagged_count == tags_length
end

bench_times.push Time.now
puts "** Starting Time: #{bench_times[0]} **"
puts "** Ending Time: #{bench_times[1]} **"
puts "Total time spent: #{bench_times[1].to_f - bench_times[0].to_f} seconds"
@items = ids


That is of course as simple and readable as I could make it for everyone. As noted above I could make it more concise and use less lines of code. Now how long does that take to run? 22.78 seconds if you are again on the 1200 plus items db that I have.

That's a heck of a lot of time to process through a slew of tags. How much time can we shave off it with the method used earlier (modified only slightly - enough to fit other tags in it)?

Well let us run another example here:


SELECT DISTINCT items.* FROM items
LEFT JOIN (
SELECT DISTINCT t1.item_id FROM items_tags AS t1
JOIN items_tags AS t2 ON t1.item_id = t2.item_id
WHERE t1.tag_id = (
SELECT id FROM tags
WHERE name = 'ring'
)
AND t2.tag_id = (
SELECT id FROM tags
WHERE name = 'diamond'
)
) AS tagged ON items.id = tagged.item_id
WHERE items.id = tagged.item_id;


Hmm ... It only took 0.07 seconds to find those. That's a huge speed improvement! So here's the Ruby code to make that query work:


ids = []
incoming_tags = params[:id].split(",").uniq
first_tag = incoming_tags.shift

sql_joins = []
sql_conditions = ["WHERE t1.tag_id = (SELECT id FROM tags WHERE name = '#{first_tag}')"]

incoming_tags.each_index do |tag_index|
table = "t#{tag_index + 2}"
sql_joins.push "JOIN items_tags AS #{table} ON t1.item_id = #{table}.item_id"
sql_conditions.push "AND #{table}.tag_id = (SELECT id FROM tags WHERE name = '#{incoming_tags[tag_index]}')"
end

final_sql = "SELECT DISTINCT items.* FROM items LEFT JOIN (SELECT DISTINCT t1.item_id from items_tags as t1 #{sql_joins.join(" ")} #{sql_conditions.join(" ")}) AS tagged ON items.id = tagged.item_id WHERE items.id = tagged.item_id"

ids = Item.find_by_sql(final_sql).collect {|item| item.id}

@items = Item.paginate(
:page => params[:page],
:order => 'created_on DESC',
:conditions => ["items.id in (?)", ids.uniq],
:per_page => (params[:items_per_page] || 12)
)


There you go. Everything collated and sanitized in record time. Notice how the sql is dynamically built? I figured doing it in this way was the most expandable way. That proved to be true when I tried it with 3 and even 4 tags on my real production database.

I highly recommend putting this type of logic in your Item model itself and just pass the tags as a parameter. That will help keep your controller skinny and your model more extensible. Of course the another step would be to turn that sql statement into an sql procedure. One step at a time though ehh? ;)

1 Comments:

  • At August 21, 2008 9:31 PM , Blogger Sleazy said...

    Interesting structure. Pretty slick and a great improvement.

    I am a fan of MVC style development myself but have thus far focused mainly on the Zend framework. I'm a smidge of a php whore.
     

Links to this post:

Create a Link

<< Home