This paper describes a query segmentation method for search engines supporting inverse lookup of words and phrases. Data mining in query logs and document corpora is used to produce segment candidates and compute connexity measures. Candidates are considered in context of the whole query, and a list of the most likely segmentations is generated, with each segment attributed with a connexity value. For each segmentation a segmentation score is computed from connexity values of non-trivial segments, which can be used as a sorting criterion for the segmentations. We also point to a relevancy improvement in query evaluation model by means of proximity penalty.
web search, query processing, data mining, query segmentation, query evaluation
Web Search engines are rapidly emerging into the most important
application of the World Wide Web. Several challenges arise when
trying to make the web
searchable[5].
Search engines like
AllTheWeb[1],
Google[3] and
AltaVista[2] are usually
based on a kernel supporting inverse lookups of words and phrases.
On top of these kernel features, techniques such as detection of
proper phrases (e.g. new york
``new york'') and
removal of stopwords or stop-phrases (e.g how can i get
information about X
X). These techniques
reduce the query into a form that is more likely to express the
topic(s) that are asked for, and in a suitable manner for a
word-based or phrase-based inverse lookup, and thus improve
precision of the search. For instance, a query like where can i
find pizza hut in new york will most likely have better
precision in a word/phrase match intersection when rewritten into a
form like ``pizza hut'' ``new york''. Problems with this
query rewriting occur when there are ambiguities in phrasing and
anti-phrasing alternatives. For instance the query free computer
wallpaper downloads will be rewritten into ``free computer''
wallpaper downloads if phrasing were done by a leftmost-longest
approach (which is a common approach) instead of more natural
free ``computer wallpaper'' downloads. In this paper we will
describe how we use data mining in query logs and document corpora
to derive information that can be used to segment queries into
words and phrases with a number indicating connexity.
Query logs yield a highly interesting data that may be useful for
various tasks. Whatever query content specific application
(statistical overview, generation of related queries or triggering
relevant flash-ins) is considered, a recognition of meaningful and
nonsplittable phrases in each multiword query remains one of its
core prerequisites. A sequence
(with
) of query tokens (words,
numbers, special symbols) is considered a potentially meaningful
phrase if the following conditions hold:
Both above conditions make up a central notion for the segmentation
of queries, a connexity of a sequence
, which is defined as a
product of the global frequency of the segment
and the mutual
information
between longest but complete
subsequences of
.
It is assumed that connexity of a single token is equivalent to its
frequency i.e.
. The
connexity value presented here is computed from a selected sample
of our query logs whose characteristics is: approx. 600 million
original query lines and 105 million lines in its normalized
frequency list
. Most of
the operations related to the computation of connexity are carried
out on
or on its
representations. For the sake of a brief characterization of
we split it
into subsets according to the number of tokens in a line. Table
1 shows how
many lines each subset of
consists of
and how many new (not seen in other subsets) segments
(with
) are estimated in each
subset.
|
The total number of -token segments that
appear in
is
estimated to
. A manageable
subset
must be
chosen to satisfy feasibility conditions on a single query
processing host i.e. search node (as defined in
[5]):
We have applied two quite rough rules-of-thumb for scaling down the
number of segments
by selecting only:
We have chosen to represent the database of segments and their
connexities with means of finite state devices as defined e.g. in
[4]. Each
segment is represented as an entry in the device and combined with
a perfect hash method to achieve the mapping to its connexity. This
twofold (segment strings + connexity values) representation consist
of a 140MB string representation part and 50MB connexity values
part which fulfils the feasibility condition for size. Thanks to a
very efficient lookup ( 0.5M lookups/second),
it is possible to realize 25000 segmentations/second of average
queries. The time may vary depending on the complexity of
segmentation procedure which we shall discuss in the next section.
The compact segment representation has an important disadvantage
for updating of the database. The database is always constructed as
a whole structure with no possibility of updates of a single entry.
However, the whole database creation time from
i.e.
collecting of segments, scaling down to
and
its finite state representation is carried out within 3 hours on a
single host. This allows for daily updates. All the above
procedures have been applied to frequency list of query logs
and its
derivatives. Another interesting source for fine tuning the
connexity values of segments are textual parts of web documents -
the web corpus. The connexity of a segment
can be modified by the mutual information values of its usage in a
document as opposed to its querying. However, our current
experience is to give more weight to the second perspective i.e.
user's expectation of segments expressed in query logs.
The database of segments and connexities may be applied to an
arbitrary text, preferably query, for splitting it into segments
according to a segmentation procedure. The procedure matches all
possible subsequences of the given tokenized and normalized query
in the segments database and assigns
connexity value to each matched segment (segments valuation).
Recognized nonoverlapping segments make up a single segmentation
. The
segmentation procedure computes
i.e. a
segmentation score for each segmentation. For a query
:
the computed
are as follows
(sorted by
,
connexities of segments in square brackets):
The function
computes
the sum of connexities for segments of the length
in the above procedure. This scoring gives us a
sorting criterion and a measure for `how far from each other' the
particular segmentation alternatives are. In addition to that we
may keep the function parametrizable in order to make the scoring
depend on e.g. the query contents related aspects. The problem of
query mining and segmentation seems to be self-referential. The
better we are able to segment the queries, the more reliable are
the connexity values we collect for the resulting segments. This
leads to the improved mining of logs via iteration of segmentation
procedure. We can assume that most multiword queries should allow
to get complete segmentation i.e. the whole query is covered by
segments longer that one token. This feature of the query is used
for boosting the connexity values for recognized segments which
unifies two approaches:
The essential goal of query segmentation is how the segments'
connexity can be integrated into the general framework of query
answering in the search engine. This goal becomes easier to solve
if we look at segments' connexity in a query as a counterpart of
proximity value in documents. Proximity of two words
and
,
, can
be described as a function of the number of tokens between them.
The function has typically higher values for lower distances
between
and
.
Connexity values can be used to modify the shape of the
function, e.g.
by means of a proximity penalty for larger
to
distances. In the
following segmentations
we see
This can be interpreted as higher proximity penalty for occurrences
of msdn and universal as opposed to lower
proximity penalty for a cooccurrence of universal and
subscription. In general, the higher the static
, the
higher proximity penalty for the coocurrence of
.
We presented some insights into the query segmentation methodology. The tests are based on a sample query logs of our search engine [1]. We sketched our technology and gave exemplary analysis traces and numbers. Our current experience lets us conclude that proper segmentation of queries, expecially finding in them the core of the user's inquiry and its weight/importance (i.e. connexity), brings a considerable quality gain for search. If integrated into an appropriate ranking formula, query segmentation and scoring delivers a new valuable block of information into query answering pipeline. In general, one of the future issuses is to integrate further means for modifying connexity values depending on various factors like document's features where the segment occurs e.g. its quality, popularity etc.