Home > Fulltext Indexing Wikipedia with Sphinx

Fulltext Indexing Wikipedia with Sphinx

September 15th, 2007

So, earlier this year, I decided it would be cool to mirror Wikipedia. So, I successfully set up a local copy on my system, and it’s been just sitting there ever since. But lately, I’ve been interested in fulltext indexing offered by various indexing engines, and Sphinx has looked especially tasty. So, I figured I’d sit down and try it today.

I pointed it at my 16GB of Wikipedia text in my MySQL database.

So, earlier this year, I decided it would be cool to mirror Wikipedia. So, I successfully set up a local copy on my system, and it’s been just sitting there ever since. But lately, I’ve been interested in fulltext indexing offered by various indexing engines, and Sphinx has looked especially tasty. So, I figured I’d sit down and try it today.

I pointed it at my 16GB of Wikipedia text in my MySQL database like so:

sphinx.conf


source src1
{
  type        = mysql
  strip_html      = 0
  index_html_attrs  =
  sql_host      = localhost
  sql_user      = wikipedia
  sql_pass      = wikipedia
  sql_db        = wikidb
  sql_query_pre   =
  sql_query     = \
    SELECT old_id, old_text\
    FROM text
  sql_query_post    =
  sql_query_info    = SELECT * FROM text WHERE old_id=$id
}

Next, I set up the indexing section.


index wikipedia
{
  source      = src1
  path      = /nexus/rofl/sphinx/wikipedia.sphinx
  docinfo     = extern
  morphology      = none
  stopwords     =
  min_word_len    = 1
  charset_type    = utf-8
  min_prefix_len    = 0
  min_infix_len   = 0
}
index wikipediastemmed : wikipedia
{
  path      = /var/data/wikipediastemmed
  morphology    = stem_en
}
indexer
{
  mem_limit     = 512M
}

I left all the other options as default. Next, I turned on the indexing and waited for about 2.5 hours. Now, bear in mind that 2.5 hours isn’t all that long to index this much data, especially given the results I’m about to show you.

Now it’s time to test this out!



hank@rofl:/usr/local/etc$ time search endothermic
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
= Sterling D. | title = Cold FireĀ® is a Hot Fire Extinguisher | publisher =
Company press release | date = Nov. 28, 2003 | url= http://www.greaterthings.com/News/ColdFire/pr031122.html | accessdate = August 21, 2006}}</ref>
==References==
<references/>
== External links ==
* [http://www.firefreeze.com Fire Freeze Worldwide Inc.]

[[Category:Firefighting]]
        old_flags=utf-8
20. document=112594001, weight=1
        old_id=112594001
        old_text=#REDIRECT[[Endothermic]]
        old_flags=utf-8

words:
1. 'endothermic': 173 documents, 293 hits

real    0m0.831s
user    0m0.004s
sys     0m0.080s

hank@rofl:/usr/local/etc$ time search "hello & world" >/dev/null

real    0m0.659s
user    0m0.032s
sys     0m0.052s

Look at that time!! 0.8 Seconds to search 16GB of text!

Sphinx is indeed the master of the fulltexting.

I’m very impressed. I’m sure I will find a use for this soon.

Update: It’s actually faster.

Due to the comment from Sphinx’s author below, I ran a searchd instance with gets rid of all the overhead when searching from the command line.

Here are some results I got using the Ruby API that’s included with Sphinx:


irb(main):010:0> t = Time.now; s.query('(Single & mother) & !father'); puts Time.now - t
0.016864
=> nil

It only took 0.017 seconds to find all instances of single and mother without mention of father in Wikipedia’s database.

This is indeed impressive.


, , , , ,

  1. September 15th, 2007 at 22:17 | #1

    If you can beat this into a MediaWiki extension, please put details on http://mediawiki.org/ and the mediawiki-l mailing list. The default MediaWiki MySQL full-text search is literally worse than useless; Wikimedia sites use a Lucene variant; more options would be most welcomed.

  2. September 15th, 2007 at 22:17 | #2

    The actual search should be faster than that – CLI search has a lot of preload overhead which is not there in production mode when using searchd (which preloads data only once at startup) – especially when it warms up.

  3. Paul Grinberg
    September 15th, 2007 at 22:17 | #3

    Just to second an earlier post, a MediaWiki extension to enable full text searching with Sphinx would be excellent.

  4. Paul Grinberg
    September 15th, 2007 at 22:17 | #4

    Just wanted to point out that I started work on integrating the Sphinx Search Engine into MediaWiki. See more at http://www.mediawiki.org/wiki/Extension:SphinxSearch . I still have quite a ways, to go, so keep your eyes on that page.

  5. Hank
    September 15th, 2007 at 22:17 | #5

    @Andrew:
    Thanks for responding so quickly. I updated the post with my latest test using searchd per your suggestion. You were very right – I can’t believe the speed on this thing. Thanks for building it.

  6. Hank
    September 15th, 2007 at 22:17 | #6

    @Paul:
    Excellent plan! Thanks for the credit. I am pretty busy so I haven’t had time to pursue this on my own, but I’m glad you’re up to the task. Good luck!

  1. No trackbacks yet.
Comments are closed.