APICV

在qemu/kernel混合模拟的基础上,为了进一步优化中断模拟,硬件提出了APICV。

在上面一小节混合模拟的流程中,其实也会涉及到apicv的代码,但没有仔细考察其作用。现在我们把它拿出来,仔细考察。

硬件知识

既然是硬件的辅助功能,那自然是先学习下硬件都提供了些什么。这部分的内容主要在SDM Chapter 29 APIC VIRTUALIZATION AND VIRTUAL INTERRUPTS。

相关的VM-execution controls

24.6.8 Controls for APIC Virtualization

  • APIC-access address (64 bits). This field contains the physical address of the 4-KByte APIC-access page

  • Virtual-APIC address (64 bits). This field contains the physical address of the 4-KByte virtual-APIC page

Certain VM-execution controls enable the processor to virtualize certain accesses to the APIC-access page without a VM exit. In general, this virtualization causes these accesses to be made to the virtual-APIC page instead of the APIC-access page.

  • TPR threshold (32 bits). Bits 3:0 of this field determine the threshold below which bits 7:4 of VTPR (see Section 29.1.1) cannot fall.

  • EOI-exit bitmap (4 fields; 64 bits each). These fields are supported only on processors that support the 1- setting of the “virtual-interrupt delivery” VM-execution control. They are used to determine which virtualized writes to the APIC’s EOI register cause VM exits

  • Posted-interrupt notification vector (16 bits). Its low 8 bits contain the interrupt vector that is used to notify a logical processor that virtual interrupts have been posted. See Section 29.6 for more information on the use of this field.

  • Posted-interrupt descriptor address (64 bits).

24.4.2 Guest Non-Register State

  • Guest interrupt status (16 bits) This field is supported only on processors that support the 1-setting of the “virtual-interrupt delivery” VM-execution control.

  • Requesting virtual interrupt (RVI)(low byte)

  • Servicing virtual interrupt (SVI)(high byte)

29 APIC VIRTUALIZATION AND VIRTUAL INTERRUPTS

The following are the VM-execution controls relevant to APIC virtualization and virtual interrupts (see Section 24.6 for information about the locations of these controls):

  • Virtual-interrupt delivery. This controls enables the evaluation and delivery of pending virtual interrupts (Section 29.2). It also enables the emulation of writes (memory-mapped or MSR-based, as enabled) to the APIC registers that control interrupt prioritization.

  • Use TPR shadow. This control enables emulation of accesses to the APIC’s task-priority register (TPR) via CR8 (Section 29.3) and, if enabled, via the memory-mapped or MSR-based interfaces.

  • Virtualize APIC accesses. This control enables virtualization of memory-mapped accesses to the APIC (Section 29.4) by causing VM exits on accesses to a VMM-specified APIC-access page. Some of the other controls, if set, may cause some of these accesses to be emulated rather than causing VM exits.

  • Virtualize x2APIC mode. This control enables virtualization of MSR-based accesses to the APIC (Section 29.5).

  • APIC-register virtualization. This control allows memory-mapped and MSR-based reads of most APIC registers (as enabled) by satisfying them from the virtual-APIC page. It directs memory-mapped writes to the APIC-access page to the virtual-APIC page, following them by VM exits for VMM emulation.

  • Process posted interrupts. This control allows software to post virtual interrupts in a data structure and send a notification to another logical processor; upon receipt of the notification, the target processor will process the posted interrupts by copying them into the virtual-APIC page (Section 29.6).

Virtual APIC Page

29.1.1 Virtualized APIC Registers

  • Virtual task-priority register (VTPR): the 32-bit field located at offset 080H on the virtual-APIC page.

  • Virtual processor-priority register (VPPR): the 32-bit field located at offset 0A0H on the virtual-APIC page.

  • Virtual end-of-interrupt register (VEOI): the 32-bit field located at offset 0B0H on the virtual-APIC page.

  • Virtual interrupt-service register (VISR)

  • Virtual interrupt-request register (VIRR)

  • Virtual interrupt-command register (VICR_LO): the 32-bit field located at offset 300H on the virtual-APIC page

  • Virtual interrupt-command register (VICR_HI): the 32-bit field located at offset 310H on the virtual-APIC page.

图解相关寄存器

    vmcs
    +----------------------------------+
    |guest state area                  |
    |   +------------------------------+
    |   |guest non-register state      |
    |   |   +--------------------------+
    |   |   |Guest interrupt status    |
    |   |   |   +----------------------+
    |   |   |   |Requesting virtual    |
    |   |   |   |   interrupt (RVI).   |
    |   |   |   +----------------------+
    |   |   |   |Servicing virtual     |
    |   |   |   |   interrupt (RVI).   |
    |   |   |   |                      |
    |   |   +---+----------------------+
    |   |                              |
    |   +------------------------------+
    |                                  |
    |vm-execution control              |
    |   +------------------------------+
    |   |APIC-access address           |
    |   |                              |
    |   |                              |         4K Virtual-APIC page
    |   |Virtual-APIC address      ----|-------->+-------------------------+
    |   |                              |     080H|Virtual task-priority    |
    |   |                              |         |        register (VTPR)  |
    |   |                              |     0A0H|Vrtl processor-priority  |
    |   |                              |         |        register (VPPR)  |
    |   |                              |     0B0H|Virtual end-of-interrupt |
    |   |                              |         |        register (VEOI)  |
    |   |                              |         |Virtual interrupt-service|
    |   |                              |         |        register (VISR)  |
    |   |                              |         |Virtual interrupt-request|
    |   |                              |         |        register (VIRR)  |
    |   |                              |     300H|Virtual interrupt-command|
    |   |                              |         |        register(VICR_LO)|
    |   |                              |     310H|Virtual interrupt-command|
    |   |                              |         |        register(VICR_HO)|
    |   |                              |         |                         |
    |   |                              |         +-------------------------+
    |   |                              |
    |   |                              |
    |   |TPR threshold                 |
    |   |EOI-exit bitmap               |
    |   |Posted-interrupt notification |
    |   |        vector                |
    |   |                              |
    |   |                              |    64 byte descriptor
    |   |                              |    511              255              0
    |   |Posted-interrupt descriptor   |--->+----------------+----------------+
    |   |        address               |    |                |                |
    |   |                              |    |                |                |
    |   |                              |    +----------------+----------------+
    |   |Pin-Based VM-Execution Ctrls  |
    |   |    +-------------------------+
    |   |    |Process posted interrupts|
    |   |    |                         |
    |   |    +-------------------------+
    |   |                              |
    |   |Primary Processor-Based       |
    |   |   VM-Execution Controls      |
    |   |    +-------------------------+
    |   |    |Interrupt window exiting |
    |   |    |Use TPR shadow           |
    |   |    |                         |
    |   |    +-------------------------+
    |   |                              |
    |   |Secondary Processor-Based     |
    |   |   VM-Execution Controls      |
    |   |    +-------------------------+
    |   |    |Virtualize APIC access   |
    |   |    |Virtualize x2APIC mode   |
    |   |    |APIC-register virtual    |
    |   |    |Virtual-intr delivery    |
    |   |    |                         |
    |   |    |                         |
    |   |    +-------------------------+
    |   |                              |
    |   +------------------------------+
    |                                  |
    +----------------------------------+

虚拟APIC状态

29.1 VIRTUAL APIC STATE

硬件虚拟化通过Virtualize APIC accesses来指定一块4k内存,用于模拟APIC寄存器的访问和管理中断。

如果没有理解错的话,这个页面在vmx_vcpu_reset函数中设置。

vmx_vcpu_reset
   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, __pa(vcpu->arch.apic->regs));

虚拟中断

29.2 EVALUATION AND DELIVERY OF VIRTUAL INTERRUPTS

虚拟中断包含了中断检测和中断发送。

对virtual-APIC page的操作会触发虚拟中断的检测,如果这个检测得到了一个虚拟中断,则会向guest发送一个中断且不导致guest退出。

虚拟中断检测

一下几种情况将触发虚拟中断的检测:

  • VM entry(Section 26.3.2.5)

  • TPR virtualization(Section 29.1.2)

  • EOI virtualization(Section 29.1.4)

  • self-IPI virtualization(Section 29.1.5)

  • posted-interrupt processing(Section 29.6)

检测的伪代码如下:

IF “interrupt-window exiting” is 0 AND
  RVI[7:4] > VPPR[7:4] (see Section 29.1.1 for definition of VPPR)
    THEN recognize a pending virtual interrupt;
ELSE
    do not recognize a pending virtual interrupt;
FI;

虚拟中断发送

虚拟中断发送会改变虚拟机中断状态(RVI/SVI),并且在vmx non-root 模式下发送一个中断,从而不导致虚拟机退出。

中断发送的伪代码如下:

Vector ← RVI; VISR[Vector] ← 1;
SVI ← Vector;
VPPR ← Vector & F0H; VIRR[Vector] ← 0;
IF any bits set in VIRR
  THEN RVI ← highest index of bit set in VIRR
  ELSE RVI ← 0; FI;
deliver interrupt with Vector through IDT;
cease recognition of any pending virtual interrupt;

看过了硬件提供的能力,我们来看看软件上是如何借助硬件的。

相关软件

检测APICV能力

在kvm代码中首先需要检测硬件是否支持apicv的功能,并作标示。

vmx_init
    kvm_init
        kvm_arch_hardware_setup
            hardware_setup
                if (!cpu_has_vmx_apicv()) {
                  enable_apicv = 0;
                  kvm_x86_ops->sync_pir_to_irr = NULL;
                }

可以看到在启动kvm模块的时候就需要检测apicv是否存在并标示enable_apicv。

那这个cpu_has_vmx_apicv又做了啥?

static inline bool cpu_has_vmx_apicv(void)
{
    return cpu_has_vmx_apic_register_virt() &&
        cpu_has_vmx_virtual_intr_delivery() &&
        cpu_has_vmx_posted_intr();
}

这个就和上文列出了硬件提供的内容匹配了。

何处使用

检测好了硬件的特性,现在就要看在什么地方使用了。当然,我们并没有直接使用enable_apicv这个值,而是把它赋值给了vcpu。

kvm_arch_vcpu_init
    vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu);
        return enable_apicv;

聪明的朋友一定已经想到,接下来就是找apicv_active这个值会在哪里判断,判断的地方就是使用apicv功能的地方了。

Posted Interrupt

Posted Interrupt作为一个比较重要的功能,我们单独拿出来研究。

概念

Posted-interrupt processing is a feature by which a processor processes the virtual interrupts by recording them as pending on the virtual-APIC page.

If the “external-interrupt exiting” VM-execution control is 1, any unmasked external interrupt causes a VM exit (see Section 25.2). If the “process posted interrupts” VM-execution control is also 1, this behavior is changed and the processor handles an external interrupt as follows.

  1. The local APIC is acknowledged; this provides the processor core with an interrupt vector, called here the physical vector.

  2. If the physical vector equals the posted-interrupt notification vector, the logical processor continues to the next step. Otherwise, a VM exit occurs as it would normally due to an external interrupt; the vector is saved in the VM-exit interruption-information field. 物理中断向量号等于 posted-interrupt notification vector才继续。

  3. The processor clears the outstanding-notification bit in the posted-interrupt descriptor. This is done atomically so as to leave the remainder of the descriptor unmodified (e.g., with a locked AND operation). 清ON位。

  4. The processor writes zero to the EOI register in the local APIC; this dismisses the interrupt with the posted- interrupt notification vector from the local APIC.清EOI。

  5. The logical processor performs a logical-OR of PIR into VIRR and clears PIR. No other agent can read or write a PIR bit (or group of bits) between the time it is read (to determine what to OR into VIRR) and when it is cleared. PIR->VIRR

  6. The logical processor sets RVI to be the maximum of the old value of RVI and the highest index of all bits that were set in PIR; if no bit was set in PIR, RVI is left unmodified. 计算得到RVI

  7. The logical processor evaluates pending virtual interrupts as described in Section 29.2.1.

简单来说就是原来运行是的虚拟机需要退出来处理中断,现在不退出了,宿主机上用一个特殊的中断将真正的中断注入到虚拟机。真正注入到虚拟机的中断号记录在 PIR (Posted Interrupt Requests)。

软件中断处理的代码

在之前的代码分析中我们也看到过,不过没有着重讲解。在发送中断的内核部分中,我们看到

kvm_irq_delivery_to_apic()
   kvm_irq_delivery_to_apic_fast()
   kvm_vector_to_index()
   kvm_get_vcpu()
   kvm_apic_set_irq()
       __apic_accept_irq()
           ...
           APIC_DM_FIXED
               kvm_lapic_set_vector
               kvm_lapic_clear_vector
               if (vcpu->arch.apicv_active)
                 kvm_x86_ops->deliver_posted_interrupt(vcpu, vector);
               else
                 kvm_lapic_set_irr
                 kvm_make_request(KVM_REQ_EVENT, vcpu)
                 kvm_vcpu_kick()
                 ...

对于APIC_DM_FIXED中断类型,如果apicv_active为真,则会采用Posted interrupt方式。

这个函数的工作在注释中写得很清楚了。我向着重讲的是pi_test_and_set_pir将真正的中断向量号写在了pir中。

/*
 * Send interrupt to vcpu via posted interrupt way.
 * 1. If target vcpu is running(non-root mode), send posted interrupt
 * notification to vcpu and hardware will sync PIR to vIRR atomically.
 * 2. If target vcpu isn't running(root mode), kick it to pick up the
 * interrupt from PIR in next vmentry.
 */
static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
{
    struct vcpu_vmx *vmx = to_vmx(vcpu);
    int r;

  ...

  if (pi_test_and_set_pir(vector, &vmx->pi_desc))
  return;

  /* If a previous notification has sent the IPI, nothing to do.  */
  if (pi_test_and_set_on(&vmx->pi_desc))
    return;

    if (!kvm_vcpu_trigger_posted_interrupt(vcpu, false))
        kvm_vcpu_kick(vcpu);
}

进一步打开kvm_vcpu_trigger_posted_interrupt,发生了什么呢?实际上是向目标vcpu所在的物理cpu上发送了vector为POSTED_INTR_VECTOR的一个中断。当然这个中断就叫Posted Interrupt。

static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
                             bool nested)
{
#ifdef CONFIG_SMP
    int pi_vec = nested ? POSTED_INTR_NESTED_VECTOR : POSTED_INTR_VECTOR;

    if (vcpu->mode == IN_GUEST_MODE) {
        /*
         * The vector of interrupt to be delivered to vcpu had
         * been set in PIR before this function.
         *
         * Following cases will be reached in this block, and
         * we always send a notification event in all cases as
         * explained below.
         *
         * Case 1: vcpu keeps in non-root mode. Sending a
         * notification event posts the interrupt to vcpu.
         *
         * Case 2: vcpu exits to root mode and is still
         * runnable. PIR will be synced to vIRR before the
         * next vcpu entry. Sending a notification event in
         * this case has no effect, as vcpu is not in root
         * mode.
         *
         * Case 3: vcpu exits to root mode and is blocked.
         * vcpu_block() has already synced PIR to vIRR and
         * never blocks vcpu if vIRR is not cleared. Therefore,
         * a blocked vcpu here does not wait for any requested
         * interrupts in PIR, and sending a notification event
         * which has no effect is safe here.
         */

        apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
        return true;
    }
#endif
    return false;
}

硬件中断处理的代码

在上面的代码中我们可以看到,如果vcpu在guest状态下才会通过post interrupt发送中断。否则还是走kvm_vcpu_kick。

但是对于硬件中断来说,中断发生时直接进入中断处理函数,而不会去判断vcpu状态。这要怎么处理呢?

暂时我能看到是当vcpu处于block状态时,会更换notification vector。也就是由另一个中断函数来响应这个事件。

     vcpu_block()
       pre_block = vmx_pre_block
          pi_pre_block
             new.nv = POSTED_INTR_WAKEUP_VECTOR
       kvm_vcpu_block()
       post_block = vmx_post_block
          pi_post_block = __pi_post_block
             new.nv = POSTED_INTR_VECTOR

而这个中断处理函数的内容是:

static void wakeup_handler(void)
{
    struct kvm_vcpu *vcpu;
    int cpu = smp_processor_id();

    spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
    list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
            blocked_vcpu_list) {
        struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);

        if (pi_test_on(pi_desc) == 1)
            kvm_vcpu_kick(vcpu);
    }
    spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
}

暂时还不是特别理解,待我以后好好研究。

Last updated