Xen的调度分析 (三) ——credit调度算法细节

时间:2022-06-03 02:09:46

在上一文中,分析了Xen的schedule()函数的主要四个步骤。

(一)首先是消耗信任值函数:

static void burn_credits(struct csched_vcpu *svc, s_time_t now)
{
    s_time_t delta;
    uint64_t val;
    unsigned int credits;

    /* Assert svc is current */
    ASSERT( svc == CSCHED_VCPU(curr_on_cpu(svc->vcpu->processor)) );

    if ( (delta = now - svc->start_time) <= 0 )
        return;

    val = delta * CSCHED_CREDITS_PER_MSEC + svc->residual;
    svc->residual = do_div(val, MILLISECS(1));
    credits = val;
    ASSERT(credits == val); /* make sure we haven't truncated val */
    atomic_sub(credits, &svc->credit);
    svc->start_time += (credits * MILLISECS(1)) / CSCHED_CREDITS_PER_MSEC;
}

delta是用来计算该VCPU已经调度了多久。可以看到,now减去start time就是调度了多久。

val = delta * CSCHED_CREDITS_PER_MSEC + svc->residual;

这句是用来计算需要消耗多少credit值的。

后面加上的是一个余差,不用管他,CSCHED_CREDITS_PER_MSEC 是10。也就是说,val值大概是调度的时间乘以10。

接下来是计算余差的,do_div返回余数,同时将结果保存在val里,可以看到计算的credits值是就是val的1/1000。

然后计算下次start_time的开始时间。

(二)接下来分析VCPU的队列插入

static inline void
__runq_insert(unsigned int cpu, struct csched_vcpu *svc)
{
    /*获取对应cpu的运行队列*/ 
    const struct list_head * const runq = RUNQ(cpu);
    struct list_head *iter;

    BUG_ON( __vcpu_on_runq(svc) );
    BUG_ON( cpu != svc->vcpu->processor );
    /*遍历runq,从runq->next开始,每次循环之后iter指向下一个,直到下一个是runq(循环链表)*/ 
    list_for_each( iter, runq )
    {
        /*获取runq中VCPU节点的地址*/ 
        const struct csched_vcpu * const iter_svc = __runq_elem(iter);
        /*判断svc的优先级是否比当前iter所在vcpu的优先级大*/ 
        if ( svc->pri > iter_svc->pri )
            break;
    }

    /* If the vcpu yielded, try to put it behind one lower-priority
     * runnable vcpu if we can.  The next runq_sort will bring it forward
     * within 30ms if the queue too long. */
    
    if ( test_bit(CSCHED_FLAG_VCPU_YIELD, &svc->flags)
         && __runq_elem(iter)->pri > CSCHED_PRI_IDLE )
    {
        iter=iter->next;

        /* Some sanity checks */
        BUG_ON(iter == runq);
    }
    /*将svc插入到第一个优先级比它小的vcpu的前面*/
    list_add_tail(&svc->runq_elem, iter);
}

VCPU队列采取的是双向循环链表组成,该队列使用的库函数为list.h,来自linux 2.6内核文件。该队列在使用的时候需要注意链表不包含节点的信息。所以需要使用__runq_elem来通过获得节点的信息。

对于每一个iter节点判断节点的优先级是否大于该节点,若是则继续后面的工作。

最后将该节点插入到该节点的前面。

插入过程只与优先级有关,不比较credit的值。

(三)负载平衡

//负载平衡,返回要调度的vcpu的调度信息
static struct csched_vcpu *
csched_load_balance(struct csched_private *prv, int cpu,
    struct csched_vcpu *snext, bool_t *stolen)
{
    
    struct csched_vcpu *speer;
    cpumask_t workers;
    cpumask_t *online;
    int peer_cpu, peer_node, bstep;
    int node = cpu_to_node(cpu);

    BUG_ON( cpu != snext->vcpu->processor );
    online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));

    /* If this CPU is going offline we shouldn't steal work. */
    //如果这个CPU已离线我们不该抢他的任务
    if ( unlikely(!cpumask_test_cpu(cpu, online)) )
        goto out;

    if ( snext->pri == CSCHED_PRI_IDLE )
        SCHED_STAT_CRANK(load_balance_idle);
    else if ( snext->pri == CSCHED_PRI_TS_OVER )
        SCHED_STAT_CRANK(load_balance_over);
    else
        SCHED_STAT_CRANK(load_balance_other);

    /*
     * Let's look around for work to steal, taking both vcpu-affinity
     * and node-affinity into account. More specifically, we check all
     * the non-idle CPUs' runq, looking for:
     *  1. any node-affine work to steal first,
     *  2. if not finding anything, any vcpu-affine work to steal.
     */
    //看看有哪些任务可以偷。无论是否有亲和性。
    //注意,我们要检查所有的非空闲状态的CPU运行队列:
    //1 任何非亲和性的工作首先去偷
    //2 没有非亲和性的VCPU,那就无所谓了
    for_each_csched_balance_step( bstep )
    {
        /*
         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
         * it is more likely that we find some node-affine work on our same
         * node, not to mention that migrating vcpus within the same node
         * could well expected to be cheaper than across-nodes (memory
         * stays local, there might be some node-wide cache[s], etc.).
         */
        //我们偷看一下所有的非空闲cpu,使用节点方式。
        //事实上,在同一个node中很容易找到一些对节点亲和性的任务
        //更不用提同一个node上的迁移vcpu会开销少(跨node会带来开销)
        
        peer_node = node;
        do
        {
            /* Find out what the !idle are in this node */
            //查看一下有哪些非空闲的在这个节点中
            cpumask_andnot(&workers, online, prv->idlers);
            cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
            cpumask_clear_cpu(cpu, &workers);

            peer_cpu = cpumask_first(&workers);
            if ( peer_cpu >= nr_cpu_ids )
                goto next_node;
            do
            {
                /*
                 * Get ahold of the scheduler lock for this peer CPU.
                 *
                 * Note: We don't spin on this lock but simply try it. Spinning
                 * could cause a deadlock if the peer CPU is also load
                 * balancing and trying to lock this CPU.
                 */
                //锁住cpu
                //注意,我们不旋转该锁,我们只是试一下。
                //旋锁会导致死锁,这种情况发生在并行cpu也处于负载平衡中
                //并且尝试锁住该cpu时
                
                spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);

                if ( !lock )
                {
                    SCHED_STAT_CRANK(steal_trylock_failed);
                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
                    continue;
                }

                /* Any work over there to steal? */
                //这边有任何人物可以偷吗
                speer = cpumask_test_cpu(peer_cpu, online) ?
                    csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
                pcpu_schedule_unlock(lock, peer_cpu);

                /* As soon as one vcpu is found, balancing ends */
                //只要找到一个VCPU,平衡停止
                if ( speer != NULL )
                {
                    *stolen = 1;
                    return speer;
                }

                peer_cpu = cpumask_cycle(peer_cpu, &workers);

            } while( peer_cpu != cpumask_first(&workers) );

 next_node:
            peer_node = cycle_node(peer_node, node_online_map);
        } while( peer_node != node );
    }

 out:
    /* Failed to find more important work elsewhere... */
    //在其他地方寻找更重要的任务失败
    __runq_remove(snext);
    return snext;
}

这里比较复杂,涉及到了CPU MASK部分的知识。如果不考虑MASK细节,只要关注第二层的do while()函数。第二层判断结束是通过peer_node != node,而peer_node是通过next_node: cycle_node()来实现的。也就是说将每一个node都遍历了。对每一个node,都调用了speer = cpumask_test_cpu(peer_cpu, online) ? csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;如果偷成功了,则返回被偷来的speer。