Programming: Median of medians



Programming: Median of medians

Quickselect is linear-time on average, but it can require quadratic time with poor pivot choices. This is because quickselect is a decrease and conquer algorithm, with each step taking O(n) time in the size of the remaining search set. If the search set decreases exponentially quickly in size (by a fixed proportion), this yields a geometric series times the O(n) factor of a single step, and thus linear overall time. However, if the search set decreases slowly in size, such as linearly (by a fixed number of elements, in the worst case only reducing by one element each time), then a linear sum of linear steps yields quadratic overall time. For example, the worst case occurs when pivoting on the smallest element at each step, such as applying quickselect for the maximum element to already sorted data and taking the first element as pivot each time.

If one instead consistently chooses "good" pivots, this is avoided and one always gets linear performance even in the worst case. A "good" pivot is one for which we can establish that a constant proportion of elements fall both below and above it, as then the search set decreases at least by a constant proportion at each step, hence exponentially quickly, and the overall time remains linear. The median is a good pivot – the best for sorting, and the best overall choice for selection – decreasing the search set by half at each step. Thus if one can compute the median in linear time, this only adds linear time to each step, and thus the overall complexity of the algorithm remains linear.

The median-of-medians algorithm does not actually compute the exact median, but computes an approximate median, namely a point that is guaranteed to be between the 30th and 70th percentiles . Thus the search set decreases by a fixed proportion at each step, namely at least 30% (so at most 70% left). Lastly, the overhead of computing the pivot consists of recursing in a set of size 20% the size of the original search set, plus a linear factor, so at linear cost at each step, the problem is reduced to 90% (20% and 70%) the original size, which is a fixed proportion smaller, and thus by induction the overall complexity is linear in size.


Median-of-medians algorithm:
  • Line up elements in groups of five (this number 5 is not important, it could be e.g. 7 without changing the algorithm much). Call each group S[i], with i ranging from 1 to n/5.
  • Find the median of each group. (Call this x[i]). This takes 6 comparisons per group, so 6n/5 total (it is linear time because we are taking medians of very small subsets).
  • Find the median of the x[i], using a recursive call to the algorithm. If we write a recurrence in which T(n) is the time to run the algorithm on a list of n items, this step takes time T(n/5). Let M be this median of medians.
  • Use M to partition the input and call the algorithm recursively on one of the partitions, just like in quickselect.
The chosen pivot is both less than and greater than half of the elements in the list of medians, which is around n/10 elements (½×n/5) for each half. Each of these elements is a median of 5, making it less than 2 other elements and greater than 2 other elements outside the block. Hence, the pivot is less than 3(n/10) elements outside the block, and greater than another 3(n/10) elements inside the block. Thus the chosen median splits the elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm. To visualize:

Read full article from Programming: Median of medians


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts