================ Start Lecture #22 ================

4.7 Selection (officially skipped; unofficial comments follow)

Selection means the ability to find the kth smallest element. Sorting will do it, but there are faster (comparison-based) methods. One example problem is finding the median (N/2 th smallest).

It is not too hard (but not easy) to implement selection with linear expected time. The surprising and difficult result is that there is a version with linear worst-case time.

4.7.1 Prune-and-Search

The idea is to prune away parts of the set that cannot contain the desired element. This is easy to do as seen in the next algorithm. The less easy part is to show that it takes O(n) expected time. The hard part is to modify the algorithm so that it takes O(n) worst case time.

4.7.2 Randomized Quick-Select

Algorithm quickSelect(S,k)
   Input:  A sequence S of n elements and an integer k in [1,n]
   Output: The kth smallest element of S

   if n=1 the return the (only) element in S

   pick a random element x of X
   divide S into 3 sequences
      L, the elements of S that are less than x
      E, the elements of S that are equal to x
      G, the elements of S that are greater than x

   { Now we reduce the search to one of these three sets }

   if k≤|L|      then return quickSelect(L,k)
   if k>|L|+|E|  then return quickSelect(G,k-|L|+|E|
   return x  { We want an element in E; all are equal to x }

4.7.3 Analyzing Randomized Quick-Select (skipped)

4.8 Java Example: In-Place Quick-Sort (skipped)

Chapter 5 Fundamental Techniques

5.1 The Greedy Method

The greedy method is applied to maximization/minimization problems. The idea is to at each decision point choose the configuration that maximizes/minimizes the objective function so far. Clearly this does not lead to the global max/min for all problems, but it does for a number of problems.

This chapter does not make a good case for the greedy method. The method is used to solve simple variants of standard problems, but the the standard variants are not solved with the greedy method. There are better examples, for example the minimal spanning tree and shortest path graph problems. The two algorithms chosen for this section, fractional knapsack and task scheduling, were (presumably) chosen because they are simple and natural to solve with the greedy method.

5.1.1 The Fractional Knapsack Method

In the knapsack problem we have a knapsack of a fixed capacity (say W pounds) and different items i each with a given weight w_i and a given benefit b_i. We want to put items into the knapsack so as to maximize the benefit subject to the constraint that the sum of the weights must be less than W.

The knapsack problem is actually rather difficult in the normal case where one must either put an item in the knapsack or not. However, in this section, in order to illustrate greedy algorithms, we consider a much simpler variation in which we can take a portion, say x_i≤w_i, of an item and get a proportional part of the benefit. This is called the ``fractional knapsack problem'' since we can take a fraction of an item. (The more common knapsack problem is called the ``0-1 knapsack problem'' since we must either take all (1) or none (0) of an item).

More formally, for each item i we choose an amount x_i (0≤x_i≤w_i) that we will place in the knapsack. We are subject to the constraint that the sum of the x_i is no more than W since that is all the knapsack can hold.

We desire to maximize the total benefit. Since, for item i, we only put x_i in the knapsack, we don't get the full benefit. Specifically we get benefit (x_i/w_i)b_i.

But now this is easy!

Item i has benefit b_i and weighs w_i.
So its value (per pound) is v_i=b_i/w_i.
We make the greedy choice and pick the most valuable item and take all of it or as much as the knapsack can hold.
Then we move to the second most valuable and do the same.
This clearly is optimal since it never makes sense to leave over some valuable item to take some less valuable item.

Why doesn't this work for the normal knapsack problem when we must take all of an item or none of it?

Example: W=6, w₁=4, w₂=w₃=3, b₁=5, b₂=b₃=3.
So v₁=5/4, v₂=v₃=1.
Start with the most valuable item, number 1 and put it n the knapsack.
Knapsack can hold 6-4=2 more "pounds" but remaining items won't fit
So the total benefit carried is 5.
The right solution is to take items 2 and 3 for a total benefit of 6.
The difference between this item and fractional knapsack is that the knapsack can still hold 2/3 of item 2, but we can't take part of an item as we can in the fractional knapsack problem.

Algorithm

algorithm FractionalKnapsack(S,W):
   Input:  Set S of items i with weight wi and benefit bi all positive.
           Knapsack capacity W>0.
   Output: Amount xi of i that maximizes the total benefit without
           exceeding the capacity.

   for each i in S do
      xi ← 0        { for items not chosen in next phase }
      vi ← bi/wi    { the value of item i "per pound" }
   w ← W            { remaining capacity in knapsack }

   while w > 0 and S is not empty
      remove from S an item of maximal value   { greedy choice }
      xi ← min(wi,w)  { can't carry more than w more }
      w ← w-xi

Analysis

FractionalKnapsack has time complexity O(NlogN) where N is the number of items in S.

The book suggests assuming S is a heap-based priority queue and then the removal has complexity Θ(logN) so the up to N removals take O(NlogN). The rest of the algorithm is O(N).
Alternatively, S could be a sequence and we could begin FractionalKnapsack by sorting S with a Θ(NlogN) sort. Now the removal is simply removing the first element. If we use a circular list for S, the removal is O(1) so the algorithm is O(N). Including the sort we again have O(NlogN).

Homework: R-5.1

5.1.2 Task Scheduling

We again consider an easy variant of a well known, but difficult, optimization problem.

We have a set T of N tasks, each with a start time s_i and a finishing time f_i (s_i<f_i).
Each task must start at time s_i and will finish at time f_i.
Each task is executed on a machine M_j.
A machine can execute only one task at a time, but can start a task at the same time as the current task ends (red lines on the figure to the right).
Tasks are non-conflicting if they do not conflict, i.e., if f_i≤s_j or f_j≤s_i. For example, think of tasks as college classes.
The problem is to schedule all the tasks in T using the minimal number of machines.

In the figure there are 6 tasks, with start times and finishing times (1,3), (2,5), (2,6), (4,5), (5,8), (5,7). They are scheduled on three machines M1, M2, M3. Clearly 3 machines are needed as can be seen by looking at time 4.

Note that a good solution to this problem has three objectives that must be met.

We must generate a feasible solution. That is, all the tasks assigned to a single machine are non-conflicting. A schedule that does not satisfy this condition is invalid and an algorithm that produces invalid solutions is wrong.
We want to generate a feasible solution using the minimal number of machines. A schedule that uses more than the minimal number of machines might be called non-optimal or inefficient and an algorithm that produces such schedules might be called poor.
We want to generate optimal solutions in the (asymptotically) smallest amount of time possible. An algorithm that generates optimal solutions using more time might be called slow.

Let's illustrate the three objectives with following example consisting of four tasks having starting and stopping times (1,3), (6,8), (2,5), (4,7). It is easy to construct a wrong algorithm, for example

Algorithm wrongTaskSchedule(T)
   Input:  A set T of tasks, each with start time s_i and
           finishing time f_i (s_i≤f_i).
   Output: A schedule of the tasks.

   while T is not empty do
      remove from T the first task and call it i.
      schedule i on M₁

When applied to our 4-task example, the result is all four tasks assigned to machine 1. This is clearly infeasible since the last two tasks conflict.

It is also not hard to produce a poor algorithm, one that generates feasible, but non-optimal solutions.

Algorithm poorTaskSchedule(T):
   Input:  A set T of tasks, each with start time s_i and
   finishing time f_i (s_i≤f_i).
   Output: A feasible schedule of the tasks.

   m ← 0                         { current number of machines }
   while T is not empty do
      remove from T a task i
      m ← m+1
      schedule i on M_m

On the 4-task example, poorTaskSchedule puts each task on a different machine. That is certainly feasible, but is not optimal since the first and second task can go on one machine.

Hence it looks as though we should not put a task on a new machine if it can fit on an existing machine. That is certainly a greedy thing to do. Remember we are minimizing so being greedy really means being stingy. We minimize the number of machines at each step hoping that will give an overall minimum. Unfortunately, while better, this idea does not give optimal schedules. Let's call it mediocre.

Algorithm mediocreTaskSchedule(T):
   Input:  A set T of tasks, each with start time s_i and
   finishing time f_i (s_i≤f_i).
   Output: A feasible schedule of the tasks of T

   m ← 0                         { current number of machines }
   while T is not empty do
      remove from T a task i
      if there is an M_j having all tasks non-conflicting with i then
         schedule i on M_j
      else
         m ← m+1
         schedule i on M_m

When applied to our 4-task example, we get the first two tasks on one machine and the last two tasks on separate machines, for a total of 3 machines. However, a 2-machine schedule is possible, as we will shall soon see.

The needed new idea is to processes the tasks in order. Several orders would work, we shall order them by start time. If two tasks have the same start time, it doesn't matter which one is put ahead. For example, we could view the start and finish times as pairs and use lexicographical ordering. Alternatively, we could just sort on the first component.

finalAlgorithm

Algorithm taskSchedule(T):
   Input:  A set T of tasks, each with start time s_i and
   finishing time f_i (s_i≤f_i).
   Output: An optimal schedule of the tasks of T

   m ← 0                         { current number of machines }
   while T is not empty do
      remove from T a task i with smallest start time
      if there is an M_j having all tasks non-conflicting with i then
         schedule i on M_j
      else
         m ← m+1
         schedule i on M_m

When applied to our 4-task example, we do get a 2-machine solution: the middle two tasks on one machine and the others on a second. But is this the minimum? Certainly for this example it is the minimum; we already saw that a 1-machine solution is not feasible. But what about other examples?

Correctness (i.e. Minimality of m)

Assume the algorithm runs and declares m to be the minimum number of machines needed. We must show that m are really needed.

Consider the step when the algorithm increases m to its final value and assume the task under consideration is i.
At this point the current task conflicts with one (or more) task(s) in each of the m-1 machines currently used.
But all these tasks have start time no later than s_i since the tasks were processed in order of their start time.
Since they conflict with i, they each have finishing time after s_i.
Hence they all conflict with each other as well (consider time s_i).
Hence we really do need m machines.

OK taskSchedule is feasible and optimal. But is it slow? That actually depends on some details of the implementation and it does require a bit of cleverness to get a fast solution.

Complexity

Let N be the number of tasks in T. The book asserts that it is easy to see that the algorithm runs in time O(NlogN), but I don't think this is so easy. It is easy to see O(N²).

The while loop has N iterations.
If we initially sort the tasks based on their start time (a one time cost of Θ(NlogN)), the removal is constant time per iteration or Θ(N) time in total.
To figure out the condition part of the if is not trivial.
A simple method would be to compare the current task with all previously scheduled tasks, which shows that the iteration is O(N) and the algorithm is O(N²).
To get O(log(N)) for each iteration, keep the machines in a heap using as key the latest finishing time assigned to that machine. This tells you when that machine will be free (remember that all tasks assigned so far start no later than si, the current job's start time).
Check the min element of the heap. If it is free at s_i, then it is free forever starting at s_i. We now
1. Remove the machine from the heap (removeMin), Θ(logN).
2. Assign the current job to the removed machine, Θ(1).
3. Now this machine is free at f_i and we re-insert it into the heap, Θ(logN).
If it is not free at s_i, then no machine is free at s_i so
1. Increase m generating a new machine, Θ(1).
2. Assign i to the new machine m, Θ(1).
3. Insert machine m (which has key fi) into the heap, Θ(logN).