| About   «  22.6. Perfect Hashing   ::   Contents   ::   23.2. Adversarial Lower Bounds Proofs  »

23.1. Finding the Maximum Value

23.1.1. Finding the Maximum Value

How can we find the \(i\) th largest value in a sorted list? Obviously we just go to the \(i\) th position. But what if we have an unsorted list? Can we do better than to sort it? If we are looking for the minimum or maximum value, certainly we can do better than sorting the list. Is this true for the second biggest value? For the median value? In later sections we will examine those questions. For this section, we will continue our examination of lower bounds proofs by reconsidering the simple problem of finding the maximum value in an unsorted list.

Here is a simple algorithm for finding the largest value.

// Return position of largest value in integer array A
static int largest(int[] A) {
  int currlarge = 0;             // Position of largest element seen
  for (int i=1; i<A.length; i++) // For each element
    if (A[currlarge] < A[i])     //   if A[i] is larger
       currlarge = i;            //     remember its position
  return currlarge;              // Return largest position
}
/** Return position of largest value in integer array A */
int largest(int A[], int size) {
  int currlarge = 0;             // Position of largest element seen
  for (int i=1; i<size; i++)     // For each element
    if (A[currlarge] < A[i])     //   if A[i] is larger
       currlarge = i;            //     remember its position
  return currlarge;              // Return largest position
}

Obviously this algorithm requires \(n\) comparisons. Is this optimal? It should be intuitively obvious that it is, but let us try to prove it. (Before reading further you might try writing down your own proof.)

This proof is clearly wrong, because the winner does not need to explicitly compare against all other elements to be recognized. For example, a standard single-elimination playoff sports tournament requires only \(n-1\) comparisons, and the winner does not play every opponent. So let’s try again.

This proof is sound. However, it will be useful later to abstract this by introducing the concept of posets. We can view the maximum-finding problem as starting with a poset where there are no known relationships, so every member of the collection is in its own separate DAG of one element.

What is the average cost of largest? Because it always does the same number of comparisons, clearly it must cost \(n-1\) comparisons. We can also consider the number of assignments that largest must do. Function largest might do an assignment on any iteration of the for loop.

Because this event does happen, or does not happen, if we are given no information about distribution we could guess that an assignment is made after each comparison with a probability of one half. But this is clearly wrong. In fact, largest does an assignment on the \(i\) th iteration if and only if A [\(i\)] is the biggest of the the first \(i\) elements. Assuming all permutations are equally likely, the probability of this being true is \(1/i\). Thus, the average number of assignments done is

\[1 + \sum_{i=2}^n \frac{1}{i} = \sum_{i=1}^n \frac{1}{i}\]

which is the Harmonic Series \({\cal H}_n\).

\[{\cal H}_n = \Theta(\log n).\]

More exactly, \({\cal H}_n\) is close to \(\log_e n\).

How “reliable” is this average? That is, how much will a given run of the program deviate from the mean cost? According to Chebysev’s Inequality, an observation will fall within two standard deviations of the mean at least 75% of the time. For Largest, the variance is

\[{\cal H}_n - \frac{\pi^2}{6} = \log_e n - \frac{\pi^2}{6}.\]

The standard deviation is thus about \(\sqrt{\log_e n}\). So, 75% of the observations are between \(\log_e n - 2\sqrt{\log_e n}\) and \(\log_e n + 2\sqrt{\log_e n}\). Is this a narrow spread or a wide spread? Compared to the mean value, this spread is pretty wide, meaning that the number of assignments varies widely from run to run of the program. Here are a few values:

  • For \(n = 100\), 75% of the values fall in the range \(4.605 \pm 4.30\) (so 25% are even further from the mean)

  • For \(n = 1000\), 75% of the values fall in the range \(6.908 \pm 5.26\).

  • For \(n = 1,000,000\), 75% of the values fall in the range \(13.816 \pm 7.43\).

23.1.2. Acknowledgement

This page borrows heavily from presentation in Section 3.2 of Compared to What? by Gregory J.E. Rawlins.

   «  22.6. Perfect Hashing   ::   Contents   ::   23.2. Adversarial Lower Bounds Proofs  »

Close Window