<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Sanjay Kairam &#187; papers</title>
	<atom:link href="http://www.sanjaykairam.com/blog/tag/papers/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.sanjaykairam.com/blog</link>
	<description>Home Page and Blog (Commons Sense)</description>
	<lastBuildDate>Mon, 06 Sep 2010 23:00:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Anatomy of a Paper about a Large-Scale Social Search Engine</title>
		<link>http://www.sanjaykairam.com/blog/2010/02/anatomy-of-a-paper-about-a-large-scale-social-search-engine/</link>
		<comments>http://www.sanjaykairam.com/blog/2010/02/anatomy-of-a-paper-about-a-large-scale-social-search-engine/#comments</comments>
		<pubDate>Fri, 05 Feb 2010 21:43:22 +0000</pubDate>
		<dc:creator>skairam</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[aardvark]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[PageRank]]></category>
		<category><![CDATA[papers]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[social]]></category>
		<category><![CDATA[social search]]></category>
		<category><![CDATA[the mechanical zoo]]></category>
		<category><![CDATA[WWW]]></category>

		<guid isPermaLink="false">http://www.sanjaykairam.com/blog/?p=135</guid>
		<description><![CDATA[Earlier this week, the team at Aardvark unveiled a new paper "The Anatomy of a Large-Scale Social Search Engine" which will be presented in April at WWW 2010. Inspired by and patterned after "The Anatomy of a Large-Scale Hypertextual Web Search Engine", which describes the PageRank algorithm which drives Google's search ranking system (which as Aardvark's blog points out, was also presented at WWW 12 years ago). The paper, by Aardvark's Damon Horowitz and Stanford's Sep Kamvar, focuses mostly on the architecture of the Aardvark system, from the external representations with which users interact to the internal ranking algorithms on which the system runs. Below, I present a short summary of what they report, focusing on the elements I found most interesting.]]></description>
			<content:encoded><![CDATA[<p>Earlier this week, the team at Aardvark unveiled a new paper &#8220;<a title="Aardvark Blog - Anatomy of a Large-Scale Social Search Engine" href="http://blog.vark.com/?p=352" target="_blank">The Anatomy of a Large-Scale Social Search Engine</a>&#8221; which will be presented in April at <a title="WWW2010 - Home" href="http://www2010.org/www/" target="_blank">WWW 2010</a>. Inspired by and patterned after &#8220;<a title="Stanford InfoLab - Google" href="http://infolab.stanford.edu/~backrub/google.html">The Anatomy of a Large-Scale Hypertextual Web Search Engine</a>&#8220;, which describes the <a title="Wikipedia - PageRank" href="http://en.wikipedia.org/wiki/PageRank" target="_blank">PageRank</a> algorithm which drives Google&#8217;s search ranking system (which as Aardvark&#8217;s blog points out, was also presented at WWW 12 years ago).</p>
<p>The paper, by Aardvark&#8217;s Damon Horowitz and Stanford&#8217;s Sep Kamvar, focuses mostly on the architecture of the Aardvark system, from the external representations with which users interact to the internal ranking algorithms on which the system runs. Below, I present a short summary of what they report, focusing on the elements I found most interesting:</p>
<p><strong>The Basic Model</strong>: Aardvark&#8217;s scoring function is similar to PageRank in that both utilize two primary, but somewhat independently considered components: <em>relevance</em> and <em>quality</em>.</p>
<ul>
<li><em>Relevance</em> in the Aardvark model pertains to the probability that a particular user <em>i</em> can answer the given question <em>q</em> based on the identified topics contained in <em>t</em>.</li>
<li><em>Quality</em> in the Aardvark model pertains to the overall probability that a user <em>i</em> can return a satisfactory answer to another user <em>j</em>, regardless of the question.</li>
</ul>
<p><strong>Indexing Topics:</strong> Aardvark computes the relevance score by calculating a distribution of knowledge over topics known by the user using the following sources (keyword-y sounding italicized terms are for convenience only and are not used in the paper):</p>
<ul>
<li><em>Explicit Prompting</em> at sign-up for three &#8220;starter&#8221; topics about which the user has expertise.</li>
<li><em>Social Prompting</em> of a user&#8217;s friends to provide topics about which they trust the user&#8217;s opinion.</li>
<li><em>Structured Parsing</em> of the online profile pages connected to Aardvark by the user (e.g. &#8220;Interests&#8221; on a Facebook profile).</li>
<li><em>Unstructured Parsing</em> of the users&#8217; online homepage, blog, or status updates using a linear SVM to extract overall subject area and a named entity extractor to extract more specific topics.</li>
</ul>
<p><strong>Indexing Connections:</strong> Aardvark computes the quality score by building a set of weighted connections between users using characteristics ranging from social proximity to similarities in demographics or behavior, such as:</p>
<ul>
<li><em>Social Connections</em> either in the form of explicitly defined &#8220;friend&#8221; connections or implicit &#8220;network&#8221; connections, such as both being part of the Stanford network.</li>
<li><em>Demographic Similarity</em>, which likely includes age, gender, and location based on profile information collected by Aardvark.</li>
<li><em>Profile Similarity</em>, which seems to include similar movies and other items which might be listed on other profiles, such as Facebook.</li>
<li><em>Vocabulary Match</em>, which they explain with the example of &#8220;IM Shortcuts&#8221; (i.e. I assume this means it is based on the language you use to interact with Aardvark, but I am unsure.)</li>
<li><em>Chattiness and Verbosity Match</em>, which relate to frequency and length of messages used when interacting with Aardvark.</li>
<li><em>Politeness Match</em>, which basically seems to mean whether or not say &#8220;Thanks!&#8221; or not.</li>
<li><em>Speed Match</em>, which is a measure of responsiveness to other users.</li>
</ul>
<p><strong>Analyzing Questions:</strong> While all of the other components are pre-computed, this part is computed at question time (obviously). The utilize a number of classifiers to classify the question and then a set of mappers to map the question to a set of topics, noting that &#8220;the role of the Question Analyzer&#8230;is simply to learn enough about the qeustion that it may be sent to appropriately interested and knowledgeable human answerers&#8221;. Here are the classifiers they list (with the names used in the paper):</p>
<ul>
<li><em>NonQuestionClassifier:</em> Determines if input is a valid question.</li>
<li><em>InappropriateQuestionClassifier:</em> Determines if input is obscene, spam, or otherwise unsuitable for asking.</li>
<li><em>TrivialQuestionClassifier:</em> Determines if input is a simple factual question (examples given: &#8220;What time is it now?&#8221;, &#8220;What is the weather?&#8221;). If so, the user gets an automatically generated answer via traditional web search.</li>
<li><em>LocationSensitiveClassifier:</em> Determines if the question contains location information; if it does, it passes that information along to the Routing Engine</li>
</ul>
<ul>
<li><em>KeywordMatchTopicMapper:</em> Checks for string matches against user profile topics (the mapper attempts to classify meaningful vs. spurious matches).</li>
<li><em>TaxonomyTopicMapper:</em> Classifies question text using an SVM trained on an &#8220;annotated corpus of several million questions&#8221; (<strong>where did they find that?</strong>)</li>
<li><em>SalientTermTopicMapper:</em> Extracts salient phrases using a noun-phrase chunker and tf-idf and finds &#8220;semantically similar user topics&#8221;.</li>
<li><em>UserTagTopicMapper:</em>Utilizes tags explicitly provided by the asker or other answerers and maps them to user topics.</li>
</ul>
<p>This description of the routing algorithm comprises the main function of the paper. After some more description of how users interact with the system, the authors provide some interesting data collected over the past several months of use (from the beta launch in March 2009 until October 2009).  Here&#8217;s a quick run-down of the more interesting facts that they presented:</p>
<ul>
<li><em>Strong User Growth: </em>As of October 2009, they reported 90,361 user accounts, and users appear to be remaining active (in the study period, over 1/2 the users actively generated content and over 2/3 of the users passively participated).</li>
</ul>
<div id="attachment_139" class="wp-caption aligncenter" style="width: 402px"><a href="http://www.sanjaykairam.com/blog/wp-content/uploads/2010/02/aardvarkusers.png"><img class="size-full wp-image-139" title="Aardvark User Growth" src="http://www.sanjaykairam.com/blog/wp-content/uploads/2010/02/aardvarkusers.png" alt="Aardvark User Growth" width="392" height="331" /></a><p class="wp-caption-text">User Growth on Aardvark (graph taken from the paper).</p></div>
<ul>
<li><em>Higher Query Contextualization:</em> Aardvark queries average 18.6 words in length while the average query length reported for web search is between 2.2 and 2.9 words (citing previous comparison and characterization studies).  They further state that &#8220;98.1% of questions are unique&#8221;, though I am unsure as to how exact they are being about matching (I am sure the question &#8220;What&#8217;s a great restaurant in SF&#8221; has been asked 1000 times in different forms). In addition, they report from manual scoring of 1000 randomly selected questions that 64.7% of questions asked have a subjective element, with advice about travel, restaurants, and products being specifically popular.</li>
<li><em>Fast, High-Quality Answers:</em> They report that 87.7% of questions get answers and 57.2% received an answer within 10 minutes. They report that 70.4% of answers receiving feedback are rated as &#8220;good&#8221; and only 15.5% are rated as &#8220;bad&#8221;. Interestingly, they observe a notable difference in feedback on answers from users within the asker&#8217;s social network (76% rated as food) and outside the asker&#8217;s network (68% rated as good).</li>
</ul>
<div id="attachment_138" class="wp-caption aligncenter" style="width: 503px"><a href="http://www.sanjaykairam.com/blog/wp-content/uploads/2010/02/aardvarkquestions.png"><img class="size-full wp-image-138" title="Aardvark Questions" src="http://www.sanjaykairam.com/blog/wp-content/uploads/2010/02/aardvarkquestions.png" alt="Aardvark Questions" width="493" height="229" /></a><p class="wp-caption-text">Questions on Aardvark (chart taken from the paper).</p></div>
<p>Overall, I really enjoyed reading this paper. After using Aardvark for over a year now, it was really interesting to get to peer inside and see how the system works, and a lot of great details were provided about the ranking engine.</p>
<p>One place where I feel that the authors missed the mark was in the cursory side-by-side evaluation which pitted Aardvark against Google for a set of 200 questions randomly selected from the Aardvark system. They report that 71.5% of the questions studied were answered successfully on Aardvark, while 70.5% of the questions were answered successfully on Google. This comparison seems mostly useless as the questions, having been pulled from the Aardvark system in the first place, are ones that were specifically chosen because they are better adapted to what is being called &#8216;social search&#8217;. This comparison left me desirous of more investigation into two main questions.<em> </em></p>
<p><em>&#8220;What makes a search engine &#8216;social&#8217; in the first place?&#8221;</em></p>
<p>The distinction between social and non-social is extremely murky, something Brynn and I discovered when working on our <a title="Sanjay Kairam - Cognitive Consequences of Social Search (PDF)" href="http://sanjaykairam.com/papers/evans-kairam-pirolli-inSubmission.pdf" target="_blank">Social Search paper</a>. It has been argued before (one small example <a title="Brynn Evans' Blog - Comment by Manas Tungare" href="http://brynnevans.com/blog/2009/01/30/why-social-search-wont-topple-google-anytime-soon/#comment-1933">here</a>) that Google&#8217;s PageRank algorithm is inherently social, as it aggregates information provided by people (links to one another) to rank results. However, it is clear that something seems categorically different between Google and what people perceive to be &#8216;social search&#8217;. When it comes down to it, even though everyone is excited about <a title="Google Blog - Search is getting more social" href="http://googleblog.blogspot.com/2010/01/search-is-getting-more-social.html" target="_blank">Google&#8217;s forays into &#8220;Social Search&#8221;</a>, there&#8217;s nothing all that fundamentally different about Google indexing your blog and your tweets than any other documents extant on the web.</p>
<p>To me, it seems that the key difference is really the change in the <strong>direction of interaction</strong>. While Google takes a query (question) and compares it against traces of discussion about that question from the past (web documents), systems perceived as &#8216;social&#8217; take a question and attempt to generate new answers in the future. This change in direction is what allows for the higher context that makes &#8216;social&#8217; search answers so much more rich (at least for some questions.)  Perhaps we need a different word to define this phenomenon &#8211; &#8216;real-time search&#8217; seems to get at it more, but has its own problems.  Perhaps something like &#8216;generative search&#8217;? I really don&#8217;t know.</p>
<p><em>&#8220;Why do we need a social search engine at all?&#8221;</em></p>
<p>This one seems like the best fodder for a follow-up study by Aardvark. While they do provide a rough breakdown of the types of questions asked on Aardvark (see pie chart above), I think that a comparison might have been much more interesting if they had looked at a variety of classes of user needs and had compared the relative efficacy of searching on Aardvark and a traditional search engine such as Google. It is clear that &#8216;social&#8217; will work much better for some needs and much worse for others, but up to this point, people who talk about social search always seem to use the same types of examples (travel, restaurants, and products, for instance). It would be great to get a clear idea over a wide range of needs and use cases where systems such as Aardvark can provide benefits over existing tools.</p>
<p>Anyways, for those of you interested in &#8216;social search&#8217; and search systems, I encourage you to read this paper and tell me your thoughts!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.sanjaykairam.com/blog/2010/02/anatomy-of-a-paper-about-a-large-scale-social-search-engine/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
