KVM MMU
Finally, I reached my original purpose to learn systemtap.
Investigate how kvm mmu system works
Now let systemtap to show its power.
Investigate kvm_mmu_pages and mmu_page_path in mmu_zap_unsync_children
To utilize memory efficiently, kvm would zap unsync children. This is achieved in mmu_zap_unsync_children() with two helpers: kvm_mmu_pages and mmu_page_path.
With the help of systemtap, we could have a close look at how they works.
Here it the systemtap script
global started
probe begin { started = 0 }
probe module("kvm").statement("*@arch/x86/kvm/mmu.c:2605")
{
level_2 = 0
for (i = 0; i < $pages->nr; i++) {
if ($pages->page[i]->sp->role->level == 2)
level_2++
}
printf("level2 %d\n", level_2);
if (level_2 < 2)
next;
started = $parent;
printf("Dump pages after mmu_unsync_walk \n");
printf("--------------------------------\n");
printf("SP Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n");
printf("SP Level:");
for (i = 0; i < $pages->nr; i++) {
printf("%3d", $pages->page[i]->sp->role->level);
}
printf("\n");
for (i = 0; i < $pages->nr; i++) {
printf("[%d]: %x\n", i, $pages->page[i]->sp);
}
}
probe module("kvm*").function("mmu_zap_unsync_children").return
{
if (started == @entry($parent))
exit();
}
probe module("kvm*").statement("*@arch/x86/kvm/mmu.c:2609")
{
if (!started)
next;
printf("Dump parents after for_each_sp \n");
printf("--------------------------------\n");
printf("Current sp : [%d]%x\n", $i, $sp);
printf("Level in Parents:");
for (idx = 0; idx < 5; idx++) {
if (!$parents->parent[idx])
break;
printf("%3d", $parents->parent[idx]->role->level);
}
printf("\n");
printf("SP in Parents:\n")
for (idx = 0; idx < 5; idx++) {
if (!$parents->parent[idx])
break;
printf("[%d]: %x\n", idx, $parents->parent[idx]);
}
}
Then is the result
Dump pages after mmu_unsync_walk
--------------------------------
SP Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
SP Level: 4 3 2 1 1 1 3 2 1
Dump parents after for_each_sp
--------------------------------
Current sp : [3]ffff8dd3c7fd3aa0
Level in Parents: 2 3 4
SP in Parents:
Dump parents after for_each_sp
--------------------------------
Current sp : [4]ffff8dd3b1152280
Level in Parents: 2 3 4
SP in Parents:
Dump parents after for_each_sp
--------------------------------
Current sp : [5]ffff8dd3b1152460
Level in Parents: 2 3 4
SP in Parents:
Dump parents after for_each_sp
--------------------------------
Current sp : [8]ffff8dd3b4b46f00
Level in Parents: 2 3 4
SP in Parents:
Do you find something in the output?
Dump pages after mmu_unsync_walk
--------------------------------
SP Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
SP Level: 4 3 2 1 1 1 3 2 1
First in "Dump pages after mmu_unsync_walk" part, we found there is a patten of page level in kvm_mmu_pages. When mmu_unsync_walk traverses the tree, it traverses to leaf (level 1) and then go to another subtree (level 3 in case it has). To be simple, this is a depth first traverse.
Dump parents after for_each_sp
--------------------------------
Current sp : [3]ffff8dd3c7fd3aa0
Level in Parents: 2 3 4
SP in Parents:
Then take a look at the "Dump parents after for_each_sp" part. This output tries to show the effect of for_each_sp.
Each iteration of for_each_sp, mmu_page_path will contain a path from root to a leaf node (SP in Parents).
And the leaf (Current sp) will be passed to zap process.
That's interesting.
Dump vcpu root_hpa
I am curious about the content of root_hpa of each vcpu in a kvm, then tried this example.
probe module("kvm*").function("vcpu_run")
{
kvm = $vcpu->kvm;
for (idx = 0; kvm->vcpus[idx]; idx++) {
printf("vcpu[%d] root_hpa %x\n",
kvm->vcpus[idx]->vcpu_id,
kvm->vcpus[idx]->arch->mmu->root_hpa);
}
exit();
}
Then found
ROOT_HPA in vcpu are not the same
The result looks:
vcpu[0] root_hpa 2f394000
vcpu[1] root_hpa 2dd82000
vcpu[2] root_hpa 29f71000
vcpu[3] root_hpa 2b044000
vcpu[4] root_hpa 30ba3000
vcpu[5] root_hpa 29f9c000
vcpu[6] root_hpa 2abd7000
vcpu[7] root_hpa 29cc7000
And another round it looks:
vcpu[0] root_hpa a5c3f000
vcpu[1] root_hpa a1948000
vcpu[2] root_hpa 1004a0000
vcpu[3] root_hpa af305000
vcpu[4] root_hpa c9f83000
vcpu[5] root_hpa 15605a000
vcpu[6] root_hpa 1d4a51000
vcpu[7] root_hpa 12eea0000
This means the root_hpa always change. Originally, I think they will not change in their life time.
Investigate ROOT_HPA Invalidation
Each mmu->root_hpa is actually the cr3 in guest. With each task switch in guest, mmu->root_hpa should change accordingly.
Current kvm implements cr3 cache in mmu->prev_roots[], so it will search the cache first to see whether the cr3 is there. While if not, mmu->root_hpa will be set to invalid.
And then, on vcpu_enter_guest(), kvm_mmu_load() will be invoked since mmu->root_hpa is invalid to create a new cr3.
My understanding is more tasked running in guest, more kvm_mmu_load() will be invoked to get a new cr3. Here is a systemtap script to observe the case.
global sum, mmu_load
probe module("kvm*").function("kvm_mmu_load")
{
sum++
}
probe timer.s(1)
{
mmu_load <<< sum;
sum = 0;
}
probe timer.s(10)
{
print(@hist_linear(mmu_load, 0, 100, 10));
delete mmu_load
}
Every 1 second the script will gather the total amounts in this period and every 10 seconds, it will dump the statistics.
The result is obvious.
When guest is idle:
value |-------------------------------------------------- count
0 |@@@@@@@@@@ 10
10 | 0
20 | 0
which means task switch is not frequent.
When guest is building kernel:
value |-------------------------------------------------- count
90 | 0
100 | 0
>100 |@@@@@@@@@@ 10
which means guest is running heavily.
While I am wondering is this a heavy burden? or we could improve at this point?
Last updated
Was this helpful?