Wednesday, February 8, 2012

The Barbershop Load Distribution algorithm for Linux Kernel.

        The Barbershop Load Distribution algorithm for Linux kernel scheduler
                - Rakib Mullick <rakib.mullick at gmail dot com>
        ---------------------------------------------------------------------------------------------------------------


Introduction
-------------------
Here, I'm going to introduce an alternative load distribution algorithm for Linux kernel scheduler. This technique is named as "The Barbershop Load Distribution Algorigthm" or BLD for short and will be refered as BLD from here on. As it's name implies, it only tries to distribute the load properly by tracking lowest and highest loaded rq of the system. This technique never tries to balance the system load at idle context, which is done by the current scheduler. The motivation behind this technique is to distribute load properly amongst the CPUs in a way as if load balancer isn't needed and to make load distribution easier.


Naming Reason and it's origin:
--------------------------------------------------

Now, question might arrive that why I gave this name? Well, the answer is - it's been perceived from a real life scenario. In Operating System's literature, it's not new to use real life scenario to solve/describe a particular problem. So, here a barbershop scenario has been introduced to deal with load balancing problem. Let's assume, in a barbershop there are four barber and there is a manager who's responsible for placing each customer evenly as if each barber gets
even load and every customer has to wait less time. Now, think it from  scheduler's perspective, each "barber" is analogous with "CPU", "barbershopmanager" is analogous with kernel "scheduler" and "customer" is analogous with a "process". We can assume every barber has a queue and
analogous with CPU runqueue we usually have Linux kernel.

If we use the current technique of load balancing used in Linux kernel, it could be sum up like this:

  * Whenever a new customer comes, it finds the less busiest barber and places that customer to that barber's queue. To do this, manager needs to calculate each rq's load and compare with others (Assume schedule domain features off for now).(done at sched_fork for first time)

  * Again, when the real execution begins, means when the barber gets in action over a customer it again try to balance the system. It follows the same procedure as previous one, tries to find the less busiest queue. (sched_exec)

  * Then periodically, barbermanager needs to check the load of the system and if there's an imbalance it runs load balancing operation. This option is only when NO_HZ=n  (scheduler_tick balancing).

  * When a barber is about to go idle, barbermanager checks other barber's queue, if any barber has overloaded queue, then move those overloaded tasks from busiest one to the 'about to go idle' barber's queue.(idle_balance).

  * When a customer gets scheduled out and later on tries to wake up, again barbermanager calculates and compare the each queue load to get the lowest busiest queue.(try_to_wake_up)

Roughly above steps are taken, for keeping a Linux system balanced by kernel scheduler.

Whole scenario would've been much easier if barbermanager (scheduler) can track the system load. That means whenever a load change happens on the system, barbermanager will track the load and will try to keep the barber queue ordered. In this way, barbermanager will no need to calculate and compare queue's load periodically or huristically.

Now, let's have a draft of barbershop load distribution technique:

  * Whenever a new customer comes, barbermanager will pick the lowest busiest queue and will place the customer on that queue. (sched_fork)

  * Above steps will be repeated when the real execution begins (sched_exec) and at try_to_wake_up.

  * For load tracking barbermanager needs to calculate load and reorder the queue load list. That means, whenever a customer enter or exit (activate_task / deactivate_task) it's queue load has been tracked and load queue has been sorted. (Previously, it was assumed that, barbermanager has a display board, looking at the display board barbermanager finds which is the lowest loaded queue.)

Accumulating above steps, The Barbershop Load Distribution algorithm has been designed. I hope this will clarify the reason behind the naming and how it's came. The barbershop scenario has been adjusted to reflect the Linux kernel's load balancing scenario to make the algorithm more compatible and to make implementation easier.

Premitive Implementation:
-------------------------------------------
 First implementation of BLD, introduces the concept of 'display board', barbermanager will make quick decisions by looking at the display board. Display board will show the amount of load exists in a runqueue and the average load of the whole system. It was implemented using an structure called 'display_board' -
  
    struct display_board {
        unsigned long loads[NR_CPUS];
        unsigned long average_load, total_load;
    };

But, picking the lowest loaded CPU was depended on number of CPUs exist on a system - therefore O(n), where n = number of CPUs.

To avoid this O(n) - a simpler approach was required. So, linked list approch was introduced, binding runqueues based on load. And it shows that, it is possible to keep system balanced by simply load tracking and runqueue reordering.

Description
--------------------
  BLD is best described as a O(1) CPU picking technique. Which is done by reordering CPU runqueues based on runqueue loads. In other words, it keeps the scheduler aware of the load changes, which helps scheduler to keep runqueues in an order. This technique doesn't depend on scheduler ticks. The two most simple things in this technique are: load tracking and runqueue ordering; these are relatively simpler operations. Load tracking will be done whenever a load change happens on the system and based on this load change runqueue will be ordered.
So, if we have an ordered runqueue from lowest to highest, then picking the less (or even busiest) runqueue is easy. Scheduler can pick the lowest runqueue without calculation and comparison at the time of placing a task in a runqueue. And while trying to distribute load at sched_exec and sched_fork our best choice is to pick the lowest busiest runqueue of the system. And in this way, system remains balanced without doing any load balancing. At the time of try_to_wake_up picking the idlest runqueue is topmost priority but it has been done as per domain basis to utilize CPU cache properly and it's an area where more concentration is requires.

Implementation
--------------------------
  As described above, BLD needs to do two things:
    a) load tracking i.e load change;
    b) ordering runqueues

  a) track load: To track load we need to hook at activate_task() and deactivate_task(). At this particular point load tracking is done, which simply calculates the current rq load and updates it accordingly. Runqueue loads are found by simply calling weighted_cpuload(). While trying to keep balance at task wakeup, simply calculating weighted_cpuload() isn't enough and needs to accumulate sleeping tasks into account.

  b) ordering runqueues: Ordering runqueues are simple and done by linking runqueues based on load from lowest to highest. Ordering runqueues are done after a runqueue load has been updated and makes necessary comparision with loaded runqueues to make it ordered. This ordering operations doesn't depend on number of runqueues i.e CPUs and it's a constant time operation.

Runqueues are kept on a doubly linked list based on load we get from weighted_cpuload(), which makes easier to access less busiest queue and highest busiest queue. Maintaining runqueues with linked list are feasible for keeping it in order and maintaining large no of runqueues. This linked list's operations are protected with a single read-write spinlock.

O(1) CPU picking at sched_fork() and sched_exec()
-----------------------------------------------------------------------------------
 At this point we've an ordered runqueues, where we can access with help of runqueue head pointer. Current implementation doesn't seperates offline CPU from online CPU (when HOTPLUG is enabled), so it checks whether the first CPU of rq head is online or not. So, CPU picking is also a O(1) operation (with side effect of CPU getting offline, it's avoidable), when we are balancing from sched_fork and sched_exec. At this particular moment we're allowed to move
a task on any CPU of the system, cause we've zero cache footprint.

CPU picking at task wakeup:
------------------------------------------------
While picking a CPU at task waking, the lowest loaded CPU of a particular domain will be chosen. In this way, it tries to utilize CPU resources - this is a bit over causious technique and an area which needs more attention. Simply picking the lowest loaded runqueue of the system isn't appropriate in this case. When CPU affinity is set for a particular task, the lowest loaded CPU will be returned from task's affinity mask by finding out the lowest loaded CPU.

Remarks
---------------
  Current implementation requires a *lot* of cleanup and it's been messed up with #ifdef's. So, implementation is ugly and very basic. It needs to be tested in other architechtures. So, I'm fully unware how it'll *really* run on other architectures. So, if it hangs/crashes your system while testing don't blaim me too hard ;-). It also wasn't tested to make sure how it'll react upon different system settings available on Linux kernel, like - cgroups. Current locking schemes might be guilty of heavy lock contention for large systems, it could be reduced by introducing
more fine grained locking scheme. And more work is required for task waking up balancing. It shows good performance for kernel building sort of loads. But has some downfall for that sort of loads which is a 'single process creates various numbers of threads' (1:N). It shows better fairness - test was made by running "for i in `seq 1 5`; do tools/perf/perf bench sched pipe & done;wait" (found from previous lkml post by Ingo Molnar).

 Personally I was worried about CFS gets hurt by this load distribution technique, but looking at the fairness test (perf bench  sched pipe) shows that it improves fairness. Current scheduler is a result of years of dedicated effort of numbers of developers and a lot of scheduler modification has been done by introducing huristics and showing good results as per performance measuring tools.
But the case for BLD isn't same - it's introduces an approach which shows how cpu load can be distributed evenly without worrying too much about how many number of CPUs exists and so, putting negative impact on system performance isn't abnormal. So, rather simply looking at performance results, I'd suggest readers to think about "current load balancing approach" vs. "BLD load
distribution approach" (although BLD performance is not *bad* as per my testings). And finally, my writing skills are not that good, so there might be repetitive sentences, spelling mistakes etc. Please be patient if you started to feel disgusting. I try to explain all the details about this technique and if any  confusion, I think looking at the patch will give you a clear idea. Is there anything , that I'm missing ???

 So, please test it, if you can. Any comments, suggestions are highly expected. Implementation of BLD could be found at the following link's download section:


http://code.google.com/p/bld/

Bld versions are named after Linux kernel's release like: if a particular bld release is for kernel version 3.3-rc2 then, bld version is for 3.3-rc2 kernel is bld-3.3-rc2. So, find the appropriate version.


Thanks,
Rakib.

No comments:

Post a Comment