Interpolation search: Difference between revisions

Browse history interactively ← Previous editContent deleted Content addedVisual WikitextInline

Revision as of 04:30, 25 April 2010 editMichael Hardy (talk \| contribs)Administrators210,276 edits →Further Analysis: fixed incorrect capitals per WP:MOS ← Previous edit		Latest revision as of 12:27, 13 September 2024 edit undo158.248.77.209 (talk)No edit summary
(95 intermediate revisions by 68 users not shown)
Line 1:		Line 1:
			{{short description\|Searching algorithm}}
	{{Unreferenced\|date=April 2009}}
	{{~~Original~~ ~~research~~\|date=~~April~~ ~~2009~~}}		{{more citations needed\|date=September 2017}}
	'''Interpolation search''' (sometimes referred to as '''extrapolation search''') is an ] for ] for a given key value in an indexed array that has been ] by the values of the key. It parallels how humans search through a telephone book for a particular name, the key value by which the book's entries are ordered. In each search step it calculates where in the remaining ] the sought item might be based on the key values at the bounds of the search space and the value of the sought key, usually via a linear interpolation. The key value actually found at this estimated position is then compared to the key value being sought. If it is not equal, then depending on the comparison, the remaining search space is reduced to the part before or after the estimated position. Only if calculations on the size of differences between key values are sensible will this method work.

			'''Interpolation search''' is an ] for ] for a key in an array that has been ] by numerical values assigned to the keys (''key values''). It was first described by W. W. Peterson in 1957.<ref>{{cite journal\|author=W. W. Peterson\|year=1957\|title=Addressing for Random-Access Storage\|journal=IBM J. Res. Dev.\|volume=1\|issue=2\|pages=130–146\|doi=10.1147/rd.12.0130}}</ref> Interpolation search resembles the method by which people search a ] for a name (the key value by which the book's entries are ordered): in each step the algorithm calculates where in the remaining ] the sought item might be, based on the key values at the bounds of the search space and the value of the sought key, usually via a ]. The key value actually found at this estimated position is then compared to the key value being sought. If it is not equal, then depending on the comparison, the remaining search space is reduced to the part before or after the estimated position. This method will only work if calculations on the size of differences between key values are sensible.
	By comparison, the ] chooses always the middle of the remaining search space, discarding one half or the other, again depending on the comparison between the key value found at the estimated position and the key value sought. The remaining search space is reduced to the part before or after the estimated position. The ] uses equality only as it compares elements one-by-one from the start, ignoring any sorting.

			{{use dmy dates\|date=February 2022}}
			{{Infobox algorithm
			\|class=]
			\|image=Interpolation sort.gif
			\|caption=Visualization of the interpolation search algorithm in which 24 is the target value.
			\|data=]
			\|time=]
			\|space = ]
			\|best-time=]
			\|average-time=]<ref>{{cite web\|author=Simon Yuan\|title=Understanding The Complexity Of Interpolation Search, Seminar Advanced Algorithms and Data Structures\|url= https://cadmo.ethz.ch/education/lectures/HS18/SAADS/reports/17.pdf}}</ref>
			\|optimal=Yes
			}}

			By comparison, ] always chooses the middle of the remaining search space, discarding one half or the other, depending on the comparison between the key found at the estimated position and the key sought — it does not require numerical values for the keys, just a total order on them. The remaining search space is reduced to the part before or after the estimated position. The ] uses equality only as it compares elements one-by-one from the start, ignoring any sorting.

	On average the interpolation search makes about log(log(''n'')) comparisons (if the elements are uniformly distributed), where ''n'' is the number of elements to be searched. In the worst case (for instance where the numerical values of the keys increase exponentially) it can make up to ](''n'') comparisons.		On average the interpolation search makes about log(log(''n'')) comparisons (if the elements are uniformly distributed), where ''n'' is the number of elements to be searched. In the worst case (for instance where the numerical values of the keys increase exponentially) it can make up to ](''n'') comparisons.

	In ], interpolation is used to find an item near the one being searched for, then ] is used to find the exact item.		In interpolation-sequential search, interpolation is used to find an item near the one being searched for, then ] is used to find the exact item.

	==Performance==		==Performance==

	Using big-O notation, the performance of the interpolation algorithm can be shown to be O(log log N).<ref>Weiss, Mark Allen (2006). ''Data structures and problem solving using Java'', Pearson Addison Wesley</ref>		Using ], the performance of the interpolation algorithm on a data set of size ''n'' is ''O''(''n''); however under the assumption of a uniform distribution of the data on the linear scale used for interpolation, the performance can be shown to be ''O''(log log ''n'').<ref>Weiss, Mark Allen (2006). ''Data structures and problem solving using Java'', Pearson Addison Wesley</ref><ref>Armenakis, A. C., Garey, L. E., Gupta, R. D., An adaptation of a root finding method to searching ordered disk files, BIT Numerical Mathematics, Volume 25, Number 4 / December, 1985.</ref><ref>Sedgewick, Robert (1990), ''Algorithms in C'', Addison-Wesley</ref>

			Dynamic interpolation search extends the ''o''(log log ''n'') bound to other distributions and also supports ''O''(log ''n'') insertion and deletion.<ref name="b050">{{cite journal \| last=Mehlhorn \| first=Kurt \| last2=Tsakalidis \| first2=Athanasios \| title=Dynamic interpolation search \| journal=Journal of the ACM \| volume=40 \| issue=3 \| date=1993 \| issn=0004-5411 \| doi=10.1145/174130.174139 \| pages=621–634}}</ref><ref name="x493">{{cite book \| last=Andersson \| first=Arne \| last2=Mattsson \| first2=Christer \| title=Automata, Languages and Programming \| chapter=Dynamic interpolation search in o(log log n) time \| publisher=Springer Berlin Heidelberg \| publication-place=Berlin, Heidelberg \| volume=700 \| date=1993 \| isbn=978-3-540-56939-8 \| doi=10.1007/3-540-56939-1_58 \| page=15–27}}</ref>
	Practical performance of interpolation search depends on whether the reduced number of probes is outweighed by the more complicated calculations needed for each probe. It can be useful for locating a record in a large sorted file on disk, where each probe involves a disk seek and is much slower than the interpolation arithmetic.

			Practical performance of interpolation search depends on whether the reduced number of probes is outweighed by the more complicated calculations needed for each probe. It can be useful for locating a record in a large sorted file on disk, where each probe involves a disk seek and is much slower than the interpolation arithmetic.
	Index structures like ] also reduce the number of disk accesses, and are more often used to index on-disk data in part because they can index many types of data and can be updated online. Still, interpolation search may be useful when one is forced to search certain sorted but unindexed on-disk datasets.

			Index structures like ]s also reduce the number of disk accesses, and are more often used to index on-disk data in part because they can index many types of data and can be updated ]. Still, interpolation search may be useful when one is forced to search certain sorted but unindexed on-disk datasets.

	==Adaptation to different datasets==		==Adaptation to different datasets==
Line 21:		Line 37:
	When sort keys for a dataset are uniformly distributed numbers, linear interpolation is straightforward to implement and will find an index very near the sought value.		When sort keys for a dataset are uniformly distributed numbers, linear interpolation is straightforward to implement and will find an index very near the sought value.

	On the other hand, for a phone book sorted by name, the straightforward approach to interpolation search ~~doesn't~~ apply. The same high-level principles can still apply, though: one can estimate a name's position in the phone book using the relative frequencies of letters in names and use that as a probe location.		On the other hand, for a phone book sorted by name, the straightforward approach to interpolation search does not apply. The same high-level principles can still apply, though: one can estimate a name's position in the phone book using the relative frequencies of letters in names and use that as a probe location.

	Some interpolation search implementations may not work as expected when a run of equal key values exists. The simplest implementation of interpolation search won't necessarily select the first (or last) element of such a run~~. The interpolation calculation also must be written to avoid division by zero~~.		Some interpolation search implementations may not work as expected when a run of equal key values exists. The simplest implementation of interpolation search won't necessarily select the first (or last) element of such a run.

	==Book-based searching==		==Book-based searching==
	The conversion of names in a telephone book to some sort of number clearly will not provide numbers having a uniform distribution (except via immense effort such as sorting the names and calling them name #1, name #2, etc.) and further, it is well-known that some names are much more common than others (Smith, Jones,) Similarly with dictionaries, where there are many more words starting with some letters than others. Some publishers go to the effort of preparing marginal annotations or even cutting into the side of the pages to show markers for each letter so that at a glance a segmented interpolation can be performed.		The conversion of names in a telephone book to some sort of number clearly will not provide numbers having a uniform distribution (except via immense effort such as sorting the names and calling them name #1, name #2, etc.) and further, it is well known that some names are much more common than others (Smith, Jones,) Similarly with dictionaries, where there are many more words starting with some letters than others. Some publishers go to the effort of preparing marginal annotations or even cutting into the side of the pages to show markers for each letter so that at a glance a segmented interpolation can be performed.


	==Sample implementation==		==Sample implementation==

	The following ~~code example for the~~ ] is a simple implementation. At each stage it computes a probe position then as with the binary search, moves either the upper or lower bound in to define a smaller interval containing the sought value. Unlike the binary search which guarantees a halving of the interval's size with each stage, a misled interpolation may reduce/~~increase the mid index by only one, thus resulting in a worst~~-case efficiency of O(''n'').		The following ] code example is a simple implementation. At each stage it computes a probe position then as with the binary search, moves either the upper or lower bound in to define a smaller interval containing the sought value. Unlike the binary search which guarantees a halving of the interval's size with each stage, a misled interpolation may reduce/i-case efficiency of O(''n'').

	<~~source~~ lang="~~Java~~">		<syntaxhighlight lang="cpp" line="1">
			#include <cassert>
	public int interpolationSearch(int sortedArray, int toFind){
	// Returns index of toFind in sortedArray, or -1 if not found
	int low = 0;
	int high = sortedArray.length - 1;
	int mid;

	while (sortedArray <= toFind && sortedArray >= toFind) {
	mid = low + ((toFind - sortedArray) * (high - low)) / (sortedArray - sortedArray);

	if (sortedArray < toFind)
	low = mid + 1;
	else if (sortedArray > toFind) // Repetition of the comparison code is forced by syntax limitations.
	high = mid - 1;
	else
	return mid;
	}

	if (sortedArray == toFind)
	return low;
	else
	return -1; // Not found
	}

			/*
	</source>
			arr[low, high) is sorted, search the data "key" in this array,
	Notice that having probed the list at index ''mid'', for reasons of loop control administration, this code sets either ''high'' or ''low'' to be not ''mid'' but an adjacent index, which location is then probed during the next iteration. Since an adjacent entry's value will not be much different the interpolation calculation is not much improved by this one step adjustment, at the cost of an additional reference to distant memory such as disc.
			if "key" is found, return the corresponding index (NOT necessarily the highest possible index);
			if "key" is not found, just return low - 1

			How to verify that the algorithm is correct?
	Each iteration of the above code requires between five and six comparisons (the extra is due to the repetitions needed to distinguish the three states of < > and = via binary comparisons in the absence of a ]) plus some messy arithmetic, while the binary search algorithm can be written with one comparison per iteration and uses only trivial integer arithmetic. It would thereby search an array of a million elements with no more than twenty comparisons (involving accesses to slow memory where the array elements are stored); to beat that the interpolation search as written above would be allowed no more than three iterations.
			Proof:
			(finiteness: after one loop, the width of decreases strictly )

			Fist, high <--- high - 1
	==Further analysis==
			scenario 1. when low = high
	The above code is clearly derived from the binary search method and unfortunately the "inclusive bounds" version, despite being almost universally found in references, text books, and example code collections, is a poor version of the method. Much better can be done working from the "exclusive bounds" version. Consider the following algol-style pseudocode:
			scenario 2. when low < high, arr = arr
			scenario 3. when low < high, arr < arr, key < arr or key > arr
			scenario 4. when low < high, arr < arr, arr <= key <= arr

			Now let's analyze scenario 4:
	Given a collection of data symbolised by an array y(1:n), in sorted order, and a value V, find y(i) = V.
			Once entering the "while" loop, low <= middle <= high
	'''if''' n <= 0 '''then''' '''return'''(0); \|Evade vapid invocations.
	'''if''' (dyl:=y(1) - V) > 0 '''then''' '''return'''(0) \|V Precedes the first element?
	'''else''' '''if''' dyl = 0 '''then''' '''return'''(1); \|V Matches it?
	'''if''' (dyr:=y(n) - V) < 0 '''then''' '''return'''(-n) \|V Follows the last element?
	'''else''' '''if''' dyr = 0 '''then''' '''return'''(n); \|V Matches it?
	L:=1; R:=n; \|Exclusive bounds. The search span is elements L + 1 to R - 1.
	'''while''' (p:=R - L) > 1 '''do''' \|(R - L) = (number of elements + 1).
	p:='''Round'''(L - dyl*p/(dyr - dyl)); \|Linear interpolation to find x such that f(x) = V.
	'''if''' p >= R '''then''' p:=R - 1 \|Element R has already been inspected.
	'''else''' '''if''' p <= L '''then''' p:=L + 1; \|Likewise L. Ensures p is within the span.
	'''case'''(sign(d:=y(p) - V)) \|Compare the probed value against V, saving the difference in ''d''.
	-1: L:=p, dyl:=d; \|V > y(p), shift L up.
	0: '''return'''(p); \|V = y(p), success!
	+1: R:=p, dyr:=d; \|V < y(p), shift R down.
	'''otherwise'''; \|No other cases.
	'''wend'''; \|Continue searching with the reduced span.
	'''return'''(-L); \|If exhausted, the search failed.

	This works on the differences between ''V'' and the ''y'' values as this difference will be required for inspection and the interpolation, and emphasises that a value from the ''y'' collection should be extracted only once as it might be available only from slow storage, such as a disc file. The initial stage checks whether ''V'' is within the span of values offered by ''y'' by explicit coding, rather than attempt to encompass this check by the workings of the main search loop as in the first example. A linear interpolation requires two values from the ''y'' array anyway.

			let's analyse after one loop(if we don't return), whether "low > high" will occur
	Should ''V'' be within the span of ''y'', then the initialisation of ''L'' and ''R'' to 1 and ''n'' before the interpolation search begins acknowledges the condition "if ''V'' is to be found, it will be amongst elements ''L + 1'' to ''R - 1'' of the array ''y''", which is the style of the exclusive bounds version of the binary search. The ''while'' loop tests whether there are any such elements remaining, and with the differences ''dyl'' and ''dyr'' in hand, a probe position can be calculated by linear interpolation, rather than taking the mid-point of the index span as with the binary search.
			After one loop:
			case a1: "low" branch has been executed in this loop
			arr < key <= arr
			so we have middle < high
			so after this loop, we have
			low <= high
			case a2: "high" branch has been executed in this loop
			arr <= key < arr
			so we have low < middle
			so after this loop, we have
			low <= high

			so after one loop(if we don't return), we have "low <= high"
	Although the probe position will be within the span ''L'' to ''R'' because ''y(L) < V < y(R)'' the requirement is that it select one of the elements ''L + 1'' to ''R - 1'', thus the check, after which follows the reference to the new element ''y(p)'' and the saving of the difference ''y(p) - V'' for the three-way test that will adjust the bounds of the search, or, discover a match. Each iteration reduces the search span, but unlike the binary search method, the reduction may be only by one element rather than half the span. Nevertheless, it is certain to finish, provided that the selection of the probe position is always correct. The risk here is that there may be equal values in the ''y'' collection, and if it should happen that ''y(L) = y(R)'' the interpolation calculation will fail ignominiously.

	But this cannot happen. The initialisation ensures that ''y(L) < y(R)'' so the question becomes could it develop during the search? Evidently, at a new position, ''y(p)'' might be a part of a sequence of equal elements. If ''y(p) = V'' then the search ends without further ado. Otherwise, supposing that ''y(p) < V'' then ''L'' will take the value of ''p'' so the lower bound is now in the clump of equal values. The new interpolation selects an element (possibly ''L + 1'') and if ''y(p) > y(L)'' then there is no further difficulty. It cannot be less than ''y(L)'' but it may be equal. If so, it will also be less than ''V'' and so the lower bound will be shifted up again: the other bound's value will remain different. The corresponding argument ''mutatis mutandis'' applies for the other bound if it were the case that ''y(p) > V'' earlier.

			when we exit the "while" loop:
	Each iteration makes one reference to the ''y'' storage collection, but of course shuffles values about and engages in arithmetic much more complex than that needed for an iteration of the binary search. The time required to access elements of ''y'' may be so great as to overwhelm all other considerations. Performance assessment is difficult, because the behaviour depends not just on the number of elements to be searched, but also on their numerical value - their spacing and clumping. Comparisons between the two methods could be performed by assessing the running time of actual tests on a computer, but there then arise further questions as to the quality of the code generated by the compiler, etc.
			case b1: arr >= arr
			In the last loop, if "low" branch is executed, we know
			arr < k <= arr
			arr >= arr
			low <= high
			so we have
			arr < k <= arr = arr
			In the last loop, if "high" branch is executed, we know
			arr <= key < arr
			arr >= arr
			low <= high
			so we have
			arr = arr <= key < arr

			case b2: (arr < arr) && (arr > key):
	===Example runs===
			In the last loop, "low" must have been changed
	====Uniform distribution====
			so we have
	The example uses a test set of a dozen two-digit numbers (actually being the first twenty-four digits of pi), searching for the value 66 which is not present in that set.
			arr < key
	]
			so we have
	The black dots (with jagged line) show the twelve values of ''y'' (in sorted order, of course!) and the horizontal line marks the sought value ''V''. The successive colours red, magenta, green (yellow is too faint) and blue (or aqua) depict the interpolation line for the successive iterations, plus a vertical line from the intercept with the ''V'' line to the new probe position with a horizontal part for any rounding to an index position. Should however that position not be within the required span it is adjusted, as marked by the coloured dots. Depending on the sign of the comparison, either the upper or lower bound of the search span is replaced by the new position and the next interpolation will be performed on the reduced span.
			arr < key < arr

			case b3: (arr < arr) && (key > arr)
	In this example, although the first interpolation does very well, subsequent probes are all crammed against the left end of the working span while the upper bound of the span is not shifted. Unfortunately, a straight line applied to a curve will be consistently wrong within the span of its adjacent inflexion points.
			In the last loop, "high" must have been changed
			so we have
			key < arr
			so we have
			arr < arr < key < arr

			*/
	]
			template <typename T>
	With a larger number of values (more digits of pi; there are some duplicate pairs but still some pairs in 0-99 are missing) the deviations of the cumulative distribution from the diagonal line are much smaller and there are many more "crossovers". The linear interpolation search does much better, adjusting both bounds early on, and for this example, required just three iterations to search amongst a hundred points, whereas with twelve values to search four iterations were required.
			static Rank interpolation_search_v001(T* arr, const T& key, Rank low, Rank high)
			{
			high -= 1;
			int middle;
			int initial_low = low;

			while ((arr < arr) && (arr <= key) && (key <= arr)) {
	====Non-uniform distributions====
			middle = low + ((key - arr) * (high - low)) / (arr - arr);
	Instead of a few local deviations of the cumulative line from the diagonal, the deviation can be large and for a large part of the span, indeed the whole span. This can be seen by replacing the y-values of the example by ''exp(x)'' for x = 1 to 12. Despite all the attendant arithmetic, the result is no better than the sequential search, for each iteration reduces the search space by one element only! The successive interpolation positions are 1.0043, 2.0036, 3.0025, 4.0006, each of which rounds to the left bound and so is nudged up one. With a greater number of values to search, the interpolated positions are no better, being pushed even closer to the left bound.
	]
	Although that is a contrived extreme case, a more reasonable example is possible. Suppose for example a set of coordinates depicts the outline of a lake or an island, and it is desired to find the coordinate closest to a given position, or a line of latitude. The straightforward sequential search is discouraging with many points, so, the process would be started by preparing a sorted list of latitude values. Rather than employing an actual data set, for the purposes of an example, consider a circle (topologically equivalent to the coast of an island) with positions at one degree intervals and prepare a set of ''y'' values from 180(1 + sin(d2*pi/360)) for d = 1 to 360, and sort them.

			assert((low <= middle) && (middle <= high));
	]
	As can be seen, even though the positions are regularly spaced, their y-coordinates (or latitudes) are not, as whenever a sequence of points tends to run East-West (or horizontal) there will be many nearly-equal latitudes resulting. But despite the painfully large deviations between the interpolation lines and the curve, the search determined that 66 was not found in just four iterations, rather better than a binary search would have managed.

			if (arr < key)
	Although an equal element was not found (and is not to be expected for non-integer values of different provenance), the search identifies the index of the next lowest value in the ''y'' collection (here, element 101) so that the nearest positions can be scanned for locally. This is why the routine returns ''-L'' to indicate "Not found" rather than a constant such as zero which provides no hint as to locality.
			low = middle + 1;
			else if (key < arr)
			high = middle - 1;
			else
			return middle;
			}

			if (key == arr)
	===Non-linear interpolation===
			return low;
	Especially when confronted by a cumulative curve of a non-linear but apparently simple form, the regular errors of a linear interpolation prompt consideration of more complex methods. In special circumstances there may be a special function that could guide the search as in the idealistic "island" example where the shape is of course that of a sine wave and the ''arc sine'' function could be used, but actual coordinate collections will have no such smooth regularity.
			else
			return initial_low - 1;
			}

	For a general method the obvious escalation is to a quadratic curve, fitted to some selection of points, that hopefully will follow the curve nicely. Ah, but what selection of points? And as the search progresses, the selection should be changed in some judicious way, but, what way? If the whole curve were well-approximated by a quadratic curve, then that would be convenient, but aside from special cases it is unlikely that a quadratic curve will offer a good fit to anything beyond local segments of the whole span, and extrapolation outside the fit would be particularly risky. For the "island" example, a quadratic would fit well only to the first or second half, not at all well to the whole. A cubic might do well, in the form ''x = f(y)''.

			/*
	In the case of the linear interpolation, two points are needed and they can obviously be the ''y'' values associated with the ''L'' and ''R'' bounds; the chosen points will be modified as the search procedure modifies these bounds. For a quadratic a third point is needed, plus additional administrative effort to keep the collection current as the search proceeds. One possible scheme is to start by taking the middle of the span, and if that ''y'' element doesn't match V, proceed with the three-point interpolation to give a probe position ''q''. Based on the ''y(q)'' value, adjust the ''L'' and ''R'' bounds as before, but now there is additional work: the collection of points used for the interpolation must also be adjusted. If ''q'' is less than ''p'' then the point associated with the old ''R'' is discarded, otherwise the old ''L'', while if they are equal, no change. Thus the next iteration will go forward with point ''p'', ''q'' and either ''L'' or ''R'' or else retain the same points, but with different values for ''L'' or ''R''. This is not of course the only method, but it will do to be going on with. A dotted line joins the first and last point of the three corresponding to the linear interpolation, but the quadratic interpolation is shown by the continuous curves.
			search "key" in the sorted array arr[low, high),
	]
			return: the highest index i such that arr <= key
	The plot shows some rather surprising curves. Although the usual fitting exercise is in terms of ''y = f(x)'', the objective of the interpolation is to find some ''x''-value corresponding to ''y = V'' so the fit is in terms of ''x = f(y)''; the curves should be viewed as being rotated ninety degrees from the usual. The auxiliary plot for the third loop of the quadratic interpolation search example shows the details.
	]
	If the usual orientation were employed (the red curve, ''y = p(x)'') the parabola fits the points nicely but finding the value of ''x'' such that ''p(x) = V'' involves the classic solution of a quadratic equation, which requires a square root evaluation and produces two values, only one of which will be suitable if indeed there are any real solutions at all since there is no guarantee that the curve will intersect ''V'' as might be the case were ''V'' to have the value 60 instead. By interpolating with ''x = p(y)'' (the green line) finding the required ''x'' value is simply a matter of evaluating ''p(V)'', and given that the method works not with the ''y'' values directly but ''y - V'', the evaluation needed is of ''p(0)'' which is particularly easy.

			How to verify that the algorithm is correct?
	But alas, the green line is not at all close to the segmented straight line joining the points, so interpolations based on it will not be particularly good, though at other stages the interpolation involved a good fit. Even so, this search ended with three iterations as compared to the four of the linear interpolation, not that this is guaranteed.
			Proof:
			finiteness: after one loop, the width of decreases strictly

			Fist, high <---- high - 1
	Other rules might be used for the collection of interpolating points, such as maintaining a list of the probe positions in order- that is, starting with positions ''L, R, p'', move on to ''R p q1'' then ''p q1 q2'' and so on. It would also be possible to vary the order of the interpolation polynomial - since one value only, ''p(0)'' is required, Aitken's method would be appropriate. Yet another approach might be to have successive iterations alternate between the mid-point of the span and an interpolated position. In all cases, it will be necessary to forestall division by zero (if any two ''y''-values were equal) or confusedly taking the same point more than once. Enough.
			scenario 1. when low = high
			scenario 2. when low < high, key < arr or arr <= key
			scenario 3. when low < high, arr <= key < arr

			Now let's analyze scenario 3:
	==See also==
			Once entering the "while" loop, low <= middle < high
	*]
	*]
	*]
	*]

			when we exit the "while" loop:
	==External links==
			case a1: key < arr
	*
			so "low" is changed in the last loop, we know
	*
			arr <= key < arr
			case a2: arr <= key
			so "high" is changed in the last loop, we know
			key < arr, impossible

			conclusion: we should return "low - 1"
	] \| ]

			*/
			template <typename T>
			static Rank interpolation_search_v002(T* arr, const T& key, Rank low, Rank high) {
			high -= 1;
			assert(low <= high);
			Rank middle;

			if(key < arr) {
			return low - 1;
			}
			if(arr <= key) {
			return high;
			}

			// now low < high , arr <= key < arr
			while ((arr <= key) && (key < arr)) {
			middle = low + ((high - low) * (key - arr)) / (arr - arr);

			assert((low <= middle) && (middle < high));

			if(key < arr) {
			high = middle;
			} else {
			low = middle + 1;
			}
			}

			return low - 1;
			}


			</syntaxhighlight>
			Notice that having probed the list at index ''mid'', for reasons of loop control administration, this code sets either ''high'' or ''low'' to be not ''mid'' but an adjacent index, which location is then probed during the next iteration. Since an adjacent entry's value will not be much different, the interpolation calculation is not much improved by this one step adjustment, at the cost of an additional reference to distant memory such as disk.

			Each iteration of the above code requires between five and six comparisons (the extra is due to the repetitions needed to distinguish the three states of < > and = via binary comparisons in the absence of a ]) plus some messy arithmetic, while the ] can be written with one comparison per iteration and uses only trivial integer arithmetic. It would thereby search an array of a million elements with no more than twenty comparisons (involving accesses to slow memory where the array elements are stored); to beat that, the interpolation search, as written above, would be allowed no more than three iterations.

			==See also==
			* ]
			* ]
			* ],<ref>{{Cite journal \|last=Mohammed \|first=Adnan Saher \|last2=Amrahov \|first2=Şahin Emrah \|last3=Çelebi \|first3=Fatih V. \|date=2021-10-01 \|title=Interpolated binary search: An efficient hybrid search algorithm on ordered datasets \|url=https://www.sciencedirect.com/science/article/pii/S221509862100046X \|journal=Engineering Science and Technology\|language=en \|volume=24 \|issue=5 \|pages=1072–1079 \|doi=10.1016/j.jestch.2021.02.009 \|issn=2215-0986\|doi-access=free }}</ref> a hybrid of interpolated search and binary search
			* ]
			* ]
			* ]
			* ]
			* ], using the distribution of values for ] rather than searching

			==References==
	<references/>		<references/>

			==External links==
	]
			*
	]
			*
	]
			*

			{{DEFAULTSORT:Interpolation Search}}
			]

Latest revision as of 12:27, 13 September 2024

Searching algorithm

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Interpolation search" – news · newspapers · books · scholar · JSTOR (September 2017) (Learn how and when to remove this message)

Interpolation search is an algorithm for searching for a key in an array that has been ordered by numerical values assigned to the keys (key values). It was first described by W. W. Peterson in 1957. Interpolation search resembles the method by which people search a telephone directory for a name (the key value by which the book's entries are ordered): in each step the algorithm calculates where in the remaining search space the sought item might be, based on the key values at the bounds of the search space and the value of the sought key, usually via a linear interpolation. The key value actually found at this estimated position is then compared to the key value being sought. If it is not equal, then depending on the comparison, the remaining search space is reduced to the part before or after the estimated position. This method will only work if calculations on the size of differences between key values are sensible.

Interpolation search
Visualization of the interpolation search algorithm in which 24 is the target value.
Class	Search algorithm
Data structure	Array
Worst-case performance	O(n)
Best-case performance	O(1)
Average performance	O(log(log(n)))
Worst-case space complexity	O(1)
Optimal	Yes

By comparison, binary search always chooses the middle of the remaining search space, discarding one half or the other, depending on the comparison between the key found at the estimated position and the key sought — it does not require numerical values for the keys, just a total order on them. The remaining search space is reduced to the part before or after the estimated position. The linear search uses equality only as it compares elements one-by-one from the start, ignoring any sorting.

On average the interpolation search makes about log(log(n)) comparisons (if the elements are uniformly distributed), where n is the number of elements to be searched. In the worst case (for instance where the numerical values of the keys increase exponentially) it can make up to O(n) comparisons.

In interpolation-sequential search, interpolation is used to find an item near the one being searched for, then linear search is used to find the exact item.

Performance

Using big-O notation, the performance of the interpolation algorithm on a data set of size n is O(n); however under the assumption of a uniform distribution of the data on the linear scale used for interpolation, the performance can be shown to be O(log log n).

Dynamic interpolation search extends the o(log log n) bound to other distributions and also supports O(log n) insertion and deletion.

Practical performance of interpolation search depends on whether the reduced number of probes is outweighed by the more complicated calculations needed for each probe. It can be useful for locating a record in a large sorted file on disk, where each probe involves a disk seek and is much slower than the interpolation arithmetic.

Index structures like B-trees also reduce the number of disk accesses, and are more often used to index on-disk data in part because they can index many types of data and can be updated online. Still, interpolation search may be useful when one is forced to search certain sorted but unindexed on-disk datasets.

Adaptation to different datasets

When sort keys for a dataset are uniformly distributed numbers, linear interpolation is straightforward to implement and will find an index very near the sought value.

On the other hand, for a phone book sorted by name, the straightforward approach to interpolation search does not apply. The same high-level principles can still apply, though: one can estimate a name's position in the phone book using the relative frequencies of letters in names and use that as a probe location.

Some interpolation search implementations may not work as expected when a run of equal key values exists. The simplest implementation of interpolation search won't necessarily select the first (or last) element of such a run.

Book-based searching

The conversion of names in a telephone book to some sort of number clearly will not provide numbers having a uniform distribution (except via immense effort such as sorting the names and calling them name #1, name #2, etc.) and further, it is well known that some names are much more common than others (Smith, Jones,) Similarly with dictionaries, where there are many more words starting with some letters than others. Some publishers go to the effort of preparing marginal annotations or even cutting into the side of the pages to show markers for each letter so that at a glance a segmented interpolation can be performed.

Sample implementation

The following C++ code example is a simple implementation. At each stage it computes a probe position then as with the binary search, moves either the upper or lower bound in to define a smaller interval containing the sought value. Unlike the binary search which guarantees a halving of the interval's size with each stage, a misled interpolation may reduce/i-case efficiency of O(n).

#include <cassert>
/*
arr[low, high) is sorted, search the data "key" in this array,
if "key" is found, return the corresponding index (NOT necessarily the highest possible index);
if "key" is not found, just return low - 1
How to verify that the algorithm is correct?
Proof:
(finiteness: after one loop, the width of  decreases strictly )
Fist, high <--- high - 1
scenario 1. when low = high
scenario 2. when low < high, arr = arr
scenario 3. when low < high, arr < arr, key < arr or key > arr
scenario 4. when low < high, arr < arr, arr <= key <= arr
Now let's analyze scenario 4:
Once entering the "while" loop,  low <= middle <= high
    let's analyse after one loop(if we don't return), whether "low > high" will occur
    After one loop:
        case a1: "low" branch has been executed in this loop
            arr < key <= arr
            so we have middle < high
            so after this loop, we have
            low <= high
        case a2: "high" branch has been executed in this loop
            arr <= key < arr
            so we have low < middle
            so after this loop, we have
            low <= high
    so after one loop(if we don't return), we have "low <= high"
    when we exit the "while" loop:
        case b1: arr >= arr
            In the last loop, if "low" branch is executed, we know
                arr < k <= arr
                arr >= arr
                low <= high
                so we have 
                arr < k <= arr = arr
            In the last loop, if "high" branch is executed, we know
                arr <= key < arr
                arr >= arr
                low <= high
                so we have
                arr = arr <= key < arr
        case b2: (arr < arr) && (arr > key):
            In the last loop, "low" must have been changed
            so we have
            arr < key
            so we have 
            arr < key < arr
        case b3: (arr < arr) && (key > arr)
            In the last loop, "high" must have been changed
            so we have
            key < arr
            so we have
            arr < arr < key < arr
*/
template <typename T>
static Rank interpolation_search_v001(T* arr, const T& key, Rank low, Rank high)
{
    high -= 1;
    int middle;
    int initial_low = low;
    while ((arr < arr) && (arr <= key) && (key <= arr)) {
        middle = low + ((key - arr) * (high - low)) / (arr - arr);
        assert((low <= middle) && (middle <= high));
        if (arr < key)
            low = middle + 1;
        else if (key < arr)
            high = middle - 1;
        else
            return middle;
    }
    if (key == arr)
        return low;
    else
        return initial_low - 1;
}
/*
search "key" in the sorted array arr[low, high),
return: the highest index i such that arr <= key
How to verify that the algorithm is correct?
Proof:
finiteness: after one loop, the width of  decreases strictly 
Fist, high <---- high - 1
scenario 1. when low = high
scenario 2. when low < high, key < arr or arr <= key
scenario 3. when low < high, arr <= key < arr
Now let's analyze scenario 3:
Once entering the "while" loop,  low <= middle < high
when we exit the "while" loop:
    case a1: key < arr
        so "low" is changed in the last loop, we know
        arr <= key < arr
    case a2: arr <= key
        so "high" is changed in the last loop, we know
        key < arr, impossible
conclusion: we should return "low - 1"
*/
template <typename T> 
static Rank interpolation_search_v002(T* arr, const T& key, Rank low, Rank high) {
    high -= 1;
    assert(low <= high);
    Rank middle;
    if(key < arr) {
        return low - 1;
    } 
    if(arr <= key) {
        return high;
    }
    // now low < high ,  arr <= key < arr
    while ((arr <= key) && (key < arr)) {
        middle = low + ((high - low) * (key - arr)) / (arr - arr);
        assert((low <= middle) && (middle < high));
        if(key < arr) {
            high = middle;
        } else {
            low = middle + 1;
        } 
    }
    return low - 1;
}

Notice that having probed the list at index mid, for reasons of loop control administration, this code sets either high or low to be not mid but an adjacent index, which location is then probed during the next iteration. Since an adjacent entry's value will not be much different, the interpolation calculation is not much improved by this one step adjustment, at the cost of an additional reference to distant memory such as disk.

Each iteration of the above code requires between five and six comparisons (the extra is due to the repetitions needed to distinguish the three states of < > and = via binary comparisons in the absence of a three-way comparison) plus some messy arithmetic, while the binary search algorithm can be written with one comparison per iteration and uses only trivial integer arithmetic. It would thereby search an array of a million elements with no more than twenty comparisons (involving accesses to slow memory where the array elements are stored); to beat that, the interpolation search, as written above, would be allowed no more than three iterations.

References

W. W. Peterson (1957). "Addressing for Random-Access Storage". IBM J. Res. Dev. 1 (2): 130–146. doi:10.1147/rd.12.0130.
Simon Yuan. "Understanding The Complexity Of Interpolation Search, Seminar Advanced Algorithms and Data Structures" (PDF).
Weiss, Mark Allen (2006). Data structures and problem solving using Java, Pearson Addison Wesley
Armenakis, A. C., Garey, L. E., Gupta, R. D., An adaptation of a root finding method to searching ordered disk files, BIT Numerical Mathematics, Volume 25, Number 4 / December, 1985.
Sedgewick, Robert (1990), Algorithms in C, Addison-Wesley
Mehlhorn, Kurt; Tsakalidis, Athanasios (1993). "Dynamic interpolation search". Journal of the ACM. 40 (3): 621–634. doi:10.1145/174130.174139. ISSN 0004-5411.
Andersson, Arne; Mattsson, Christer (1993). "Dynamic interpolation search in o(log log n) time". Automata, Languages and Programming. Vol. 700. Berlin, Heidelberg: Springer Berlin Heidelberg. p. 15–27. doi:10.1007/3-540-56939-1_58. ISBN 978-3-540-56939-8.
Mohammed, Adnan Saher; Amrahov, Şahin Emrah; Çelebi, Fatih V. (1 October 2021). "Interpolated binary search: An efficient hybrid search algorithm on ordered datasets". Engineering Science and Technology. 24 (5): 1072–1079. doi:10.1016/j.jestch.2021.02.009. ISSN 2215-0986.

External links

Category:

Search algorithms