Apache Lucene is an open source search software used by many tech giants, including Twitter and LinkedIn to power their sites' search features. It offers the ability to search for multiple words across multiple documents, otherwise known as a conjunction. How is Lucene able to quickly perform these conjunctions?
This is where the leapfrog algorithm comes in. The leapfrog algorithm is used to find word conjunctions across several documents. Before we dive into how this algorithm works, we should explain a few things first:
-
Lucene’s inverted index - Lucene makes use of an inverted index, which is a map of all the words and a list of the document IDs they reside in (we’ll call this list a posting list). We will refer to a posting list for a given word,
x, asp_xfrom now on. -
The
nextfunction - thenextfunction acts on a list of document IDs,next(p_x), and returns the value of the next index in that list. If the current index is undefined, thennext(p_x)will return the first element in that list. -
The
advancefunction - theadvancefunction is a general function that acts on any list and, given a value that matches the type within the list, will either return the first value that is greater than or equal to the provided value. In the context of a list of document IDs, given a listp_foo = [1, 12, 17, 25], ifadvance(p_foo, 12)is executed, then12will be returned since12exists in the list - this will be referred to as a match; but ifadvance(p_foo, 13)is executed, then17will be returned, since the first value found greater than13.
⚠️ NOTE: These functions are not necessarily representative of how they exist in code, but should give a rough idea of how they are implemented and are modified to make the explanation of this algorithm simpler.
Given this information, we can now explain how the leapfrog algorithm works:
- For each word in the conjunction, retrieve its posting list
- Sort all posting lists in ascending order of length
- Store the next element of the first posting list:
curr_doc_id = next(p_1) - For each posting list after
p_1, runadvance(p_x, doc_id) - Advance to the next posting list if a match is found - append the doc ID to a
hitslist ifadvance(p_x, doc_id)returns a match for all posting lists. Repeat from step 3. - If a match is not found, set
curr_doc_idtoadvance(p_1, last_doc_id), wherelast_doc_idis the doc ID returned by the last execution ofadvance. - Return the list of
hitsonce all elements inp_1have been exhausted.
To learn more about Lucene’s data structures and algorithms, you can watch the full talk given by Adrien Grand.