Thursday, September 25, 2008

MySQL full-text search optimization

A while ago we where developing a classified ads site for one of our most important clients. In this site, every ad belongs to a region (city, country) and a category (e.g. vehicles>buy/sell>cars).

When the visitor accesses the homepage, the region is determined from their IP address so only the ads for that region are displayed (the visitor can change the region) and every search is performed only for the ads on that region as well.

At first everything was working great, but when some regions started to have more than 500.000 ads, that’s when we started to get nervous. Every search on these regions took more than 30 seconds, which is obviously unacceptable. We tried everything: create a full-text field on the ads table, optimized the database queries, but could not bring the search time down to a reasonable amount. At this point I was considering fleeing the country, changing my name and closing all my email accounts so that our client couldn’t find me, and start a new life selling bananas in Brazil. I’m glad I didn’t do it (I think).

After a lot of research over the internet, we came across a miracle. That miracle is called Sphinx Search (http://www.sphinxsearch.com). This is an open-source solution for all your MySQL full-text search problems.

How does it work? You simply install the software on your server, create a configuration file for the table you want to perform the searches on and run the indexing process. This will create a dictionary file that allows you to perform searches in less than a second, on huge tables. You can also provide filters on any of the fields of the table as well as perform sorting on the result set. This software also provides a PHP API so that you can access it from your PHP pages and process the result set.

An important side note: since now all searches are performed on the data dictionary (which is created by a process than can take several minutes, so it’s advisable to run it only once or twice a day) and users can be adding new rows to the original table, the data dictionary will not have all the current data.

How can you solve this issue? Easy, you also create a delta dictionary file with the new rows and schedule a process to rebuild it many times a day (depending on how often new rows are added to your table). This delta dictionary is also included on the same configuration file as the master dictionary, but using a different query.

How do you combine both dictionaries? You can have a process that runs at midnight rebuilding the master dictionary with the overall query (e.g. SELECT * FROM ad) and another that runs every fifteen minutes rebuilding the delta dictionary with the new rows added that day. When the process runs again at midnight to rebuild the master index, the data from the delta dictionary will be incorporated to the master index and the delta index will be reset.

Now every search on the site is working great and we are even using it to obtain all the ads from a specific category without a keyword, because it also takes less than a second. It also allows you to extend the search filtering by fields, grouping and sorting.

Here’s an example configuration file for my classified ads site:


source ads {
type = mysql

sql_host = localhost
sql_user = user
sql_pass = password
sql_db = myDb
sql_port = 3306

sql_query = \
SELECT \
id, title, descr, \
catid, UNIX_TIMESTAMP(hw_added) as hw_added, \
UNIX_TIMESTAMP(exp_date) as exp_date \
FROM \
ad;

sql_attr_str2ordinal = v_title
sql_attr_uint = catid
sql_attr_timestamp = hw_added
sql_attr_timestamp = exp_date

sql_query_info = SELECT link_id FROM ec4_ad WHERE link_id=$id
}

source delta : ads
{
SELECT \
id, title, descr, \
catid, UNIX_TIMESTAMP(hw_added) as hw_added, \
UNIX_TIMESTAMP(exp_date) as exp_date \
FROM \
ad \
WHERE (TO_DAYS(hw_added) = TO_DAYS(NOW()) ;
}

index ads {
source = ads
path = /home/classifieds/sphinx/main/
# wordforms = /home/classifieds/wordforms.txt
# morphology = stem_en
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
}

index delta : ads {
source = delta
path = /home/classifieds/sphinx/delta/
}

indexer {
mem_limit = 256M
}

searchd {
port = 3312
log = /home/classifieds/searchd.log
query_log = /home/classifieds/query.log
pid_file = /home/classifieds/searchd.pid
}


I hope this will be useful for you and keeps you from fleeing the country!

8 comments:

Anonymous said...

Sphinx does have some limitations thou.. I started to play around with DBSite stuff. It's a Java based Full text search engine that connects to any SQL type DB.

Ben said...

Great article! I am really interested to know how much of a performance increase you got when using Sphinx.

Chris said...

How Sphinx deals with deleted records?
DBSight can support finding deleted database records.

Pablo said...

@ellisgl:
I think Sphinx is better than DBSight for a number of reasons. The most important ones are the fact that it's open source and that it provides PHP, Python, Ruby and pure C APIs, which makes it much more flexible.

@ben:
Thanks! I got a huge performance increase. Sphinx performs an average query in under 0.1 sec on 2-4 GB text collections.
For example, the search by keyword page on my site took like 40 seconds to load with search results and after using Sphinx started to load in 8 seconds.

@chris:
Sphinxs provides a function for updating the index from the API, which you can use to manage deleted records.
http://www.sphinxsearch.com/doc.html#api-func-updateatttributes

Anonymous said...

Does Sphinx support searching several different tables? For example, if I run several websites with different databases on the same machine, will I be able to have one search for every site with just one Sphinx installation?

Pablo said...

@dominik:

I belive that it's possible to have multiple databases on the same config file, because when you define the source (see source 'ads' on my example), you have to specify host, datasource, user and password.

I think that if you create a different source for each of your databases with their information on the config file, you could use one instance of Sphinx for all your databases. You just have to specify the source when you perform a search using the API.

More info from Sphinx forum:
http://www.sphinxsearch.com/forum/view.html?id=1255

I hope this helps. Thanks for stopping by!

NMMM.NU said...

thank you, the information was helpful.

Anonymous said...

Thank you for this. Very useful.

Does sphinx search add much load to the server?