<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Schadenfreude &#187; sphinx</title>
	<atom:link href="http://www.ralree.com/tag/sphinx/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.ralree.com</link>
	<description>Malicious enjoyment derived from observing someone else's misfortune</description>
	<lastBuildDate>Thu, 09 Feb 2012 01:49:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Fulltext Indexing Wikipedia with Sphinx</title>
		<link>http://www.ralree.com/2007/09/15/fulltext-indexing-wikipedia-with-sphinx/</link>
		<comments>http://www.ralree.com/2007/09/15/fulltext-indexing-wikipedia-with-sphinx/#comments</comments>
		<pubDate>Sat, 15 Sep 2007 22:17:00 +0000</pubDate>
		<dc:creator>Erik</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[benchmark]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[sphinx]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.ralree.info/2007/10/13/fulltext-indexing-wikipedia-with-sphinx</guid>
		<description><![CDATA[So, earlier this year, I decided it would be cool to mirror Wikipedia. So, I successfully set up a local copy on my system, and it&#8217;s been just sitting there ever since. But lately, I&#8217;ve been interested in fulltext indexing offered by various indexing engines, and Sphinx has looked especially tasty. So, I figured I&#8217;d sit down and try it today. I pointed it at my 16GB of Wikipedia text in my MySQL database. So, earlier this year, I decided [...]]]></description>
			<content:encoded><![CDATA[<p>So, earlier this year, I decided it would be cool to mirror Wikipedia.  So, I successfully set up a local copy on my system, and it&#8217;s been just sitting there ever since.  But lately, I&#8217;ve been interested in fulltext indexing offered by various indexing engines, and <a href="http://www.sphinxsearch.com/">Sphinx</a> has looked especially tasty.  So, I figured I&#8217;d sit down and try it today.</p>
<p>I pointed it at my 16GB of Wikipedia text in my MySQL database.</p>
<p>            <span id="more-3411"></span></p>
<p>So, earlier this year, I decided it would be cool to mirror Wikipedia.  So, I successfully set up a local copy on my system, and it&#8217;s been just sitting there ever since.  But lately, I&#8217;ve been interested in fulltext indexing offered by various indexing engines, and <a href="http://www.sphinxsearch.com/">Sphinx</a> has looked especially tasty.  So, I figured I&#8217;d sit down and try it today.</p>
<p>I pointed it at my 16GB of Wikipedia text in my MySQL database like so:</p>
<h2>sphinx.conf</h2>
<pre><code>
source src1
{
  type        = mysql
  strip_html      = 0
  index_html_attrs  =
  sql_host      = localhost
  sql_user      = wikipedia
  sql_pass      = wikipedia
  sql_db        = wikidb
  sql_query_pre   =
  sql_query     = \
    SELECT old_id, old_text\
    FROM text
  sql_query_post    =
  sql_query_info    = SELECT * FROM text WHERE old_id=$id
}

</code></pre>
<h2>Next, I set up the indexing section.</h2>
<pre><code>
index wikipedia
{
  source      = src1
  path      = /nexus/rofl/sphinx/wikipedia.sphinx
  docinfo     = extern
  morphology      = none
  stopwords     =
  min_word_len    = 1
  charset_type    = utf-8
  min_prefix_len    = 0
  min_infix_len   = 0
}
index wikipediastemmed : wikipedia
{
  path      = /var/data/wikipediastemmed
  morphology    = stem_en
}
indexer
{
  mem_limit     = 512M
}

</code></pre>
<p>I left all the other options as default.  Next, I turned on the indexing and waited for about <strong>2.5 hours</strong>.  Now, bear in mind that 2.5 hours isn&#8217;t all that long to index this much data, especially given the results I&#8217;m about to show you.</p>
<h2>Now it&#8217;s time to test this out!</h2>
<pre><code>

hank@rofl:/usr/local/etc$ time search endothermic
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
= Sterling D. | title = Cold Fire® is a Hot Fire Extinguisher | publisher =
Company press release | date = Nov. 28, 2003 | url= http://www.greaterthings.com/News/ColdFire/pr031122.html | accessdate = August 21, 2006}}&lt;/ref&gt;
==References==
&lt;references/&gt;
== External links ==
* [http://www.firefreeze.com Fire Freeze Worldwide Inc.]

[[Category:Firefighting]]
        old_flags=utf-8
20. document=112594001, weight=1
        old_id=112594001
        old_text=#REDIRECT[[Endothermic]]
        old_flags=utf-8

words:
1. 'endothermic': 173 documents, 293 hits

real    0m0.831s
user    0m0.004s
sys     0m0.080s

hank@rofl:/usr/local/etc$ time search "hello &#038; world" &gt;/dev/null

real    0m0.659s
user    0m0.032s
sys     0m0.052s

</code></pre>
<h1>Look at that time!!  <strong>0.8 Seconds</strong> to search <strong>16GB of text</strong>!</h1>
<h2>Sphinx is indeed the master of the fulltexting.</h2>
<p>I&#8217;m very impressed.  I&#8217;m sure I will find a use for this soon.</p>
<h1>Update: It&#8217;s actually faster.</h1>
<p>Due to the comment from Sphinx&#8217;s author below, I ran a <code>searchd</code> instance with gets rid of all the overhead when searching from the command line.</p>
<p>Here are some results I got using the Ruby API that&#8217;s included with Sphinx:</p>
<pre><code>
irb(main):010:0&gt; t = Time.now; s.query('(Single &#038; mother) &#038; !father'); puts Time.now - t
0.016864
=&gt; nil
</code></pre>
<h2>It only took <strong>0.017 seconds</strong> to find all instances of single and mother without mention of father in Wikipedia&#8217;s database.</h2>
<p>This is indeed impressive.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ralree.com/2007/09/15/fulltext-indexing-wikipedia-with-sphinx/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

