RISE RP005 QEMU weekly report 2024-06-05

Meetings will be on a fortnightly basis from now on. Next meeting will be on June 19th.

Overview

Introducing the project team

Paolo Savini (Embecosm)
Helene Chelin (Embecosm)
Jeremy Bennett (Embecosm)
Hugh O'Keeffe (Ashling)
Nadim Shehayed (Ashling)
Daniel Barboza (Ventana)

Work completed since last report

WP1:
- Prepare routine/nightly runs of benchmarks.
  - Tests are running.
WP2:
- We have implemented an optimization of the vext_ldst_us helper function for vle8.v (See results below).
- Explore optimization through the usage of builtins like __builtin_memcpy.
  - Deferred to prioritize the optimization of the helper function first.

Work planned for the coming week

WP2:
- Get SPECCPU results the optimization of the vle8.v instruction.
- Optimize the tail bytes of vle8.v.
- Add optimization for vse8.v.
- Explore optimization through the usage of builtins like __builtin_memcpy.

Current priorties

Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.

vector load/store ops for x86_64 AVX
vector load/store ops for AArch64/Neon
vector integer ALU ops for x86_64 AVX
vector load/store ops for Intel AVX10

For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.

WP0: Infrastructure
WP1: Analysis of vector load/store ops on x86_64 AVX
WP2: Optimization of vector load/store ops on x86_64 AVX
WP3: Analysis of vector load/store ops on AArch64/Neon
WP4: Optimization of vector load/store ops on AArch64/Neon
WP5: Analysis of integer ALU ops on x86_64 AVX
WP6: Optimization of integer ALU ops on x86_64 AVX
WP7: Analysis of vector load/store ops on Intel AVX10
WP8: Optimization of vector load/store ops on Intel AVX10

These priorities can be revised by agreement with RISE during the project.

Detailed description of work

WP2: Optimization of vle8.v

We implemented an optimization of the helper function:

static void
vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
             vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl,
             uintptr_t ra)
{
    uint32_t i, k;
    uint32_t nf = vext_nf(desc);
    uint32_t max_elems = vext_max_elems(desc, log2_esz);
    uint32_t esz = 1 << log2_esz;


     VSTART_CHECK_EARLY_EXIT(env);
 
+    uint32_t mod = evl % 8;
+
     /* load bytes from guest memory */
-    for (i = env->vstart; i < evl; env->vstart = ++i) {
+    for (i = env->vstart; i < (evl - mod); env->vstart = i) {
+        k = 0;
+        while (k < nf) {
+            target_ulong addr = base + ((i * nf + k) << log2_esz);
+            lde_d(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            k++;
+        }
+    i += 8;
+    }
+    env->vstart = i;
+    for (; i < evl; env->vstart = ++i) {
        k = 0;
        while (k < nf) {
            target_ulong addr = base + ((i * nf + k) << log2_esz);
            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
            k++;
        }
    }
    env->vstart = 0;

    vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems);
}

The change involves digesting blocks of 8 bytes with 8 byte loads and then processing the remaining bytes byte by byte. A further improvement will be to process the remaining bytes by chunks of 4 bytes and then 2 bytes when possible. The results with data sizes smaller then 8 bytes show that there's space for improvement there. Further benchmarking will show whether the gain obtaining by optimizing the tail bytes will exceed the overhead.

Statistics

Memory/string operations

Performance improvement ratio of the optimized version over the master branch. A value of 4 means that the optimized version is going 4 times faster then the unoptimized version (in terms of ns/instruction). We've hit a QEMU error when dealing with data sizes larger then 512 with VLEN is 1024. We are investigating this.

Size	VLEN=128,LMUL=1	VLEN=1024,LMUL=8
1	1.17	0.91
2	0.75	0.93
4	1.25	1.11
8	3.22	1.79
16	3.38	3.14
32	3.44	4.05
64	3.87	5.77
128	3.84	6.06
256	4.04	6.35
512	4.16	6.89
	4.20
	4.10
	4.10

:{width=160mm} :{width=160mm} :{width=160mm} :{width=160mm}

Individual RVV instruction performance

These results are in ns/instruction.

length	master	HC	ratio	master	HC	ratio

1	21.13	71.49	0.30	21.12	21.41	0.99
2	33.44	43.11	0.78	33.50	33.02	1.01
4	62.16	72.75	0.85	62.06	58.31	1.06
8	108.88	26.50	4.11	109.00	22.00	4.95
16	209.88	38.88	5.40	209.88	34.00	6.17
32	210.50	55.00	3.83	396.25	72.00	5.50
64	209.00	52.00	4.02	771.50	131.50	5.87
128	207.00	48.00	4.31	1523.00	250.00	6.09
256	212.00	51.00	4.16	3012.00	481.00	6.26
512	210.00	50.00	4.20	5998.00	938.00	6.39
1,024	213.00	50.00	4.26	5995.00	939.00	6.38
2,048	207.00	36.00	5.75	6000.00	772.00	7.77
4,096	210.00	38.00	5.53	5996.00	772.00	7.77
8,192	208.00	34.00	6.12	5994.00	770.00	7.78
8,192	208.00	34.00	6.12	5994.00	770.00	7.78

:{width=160mm} :{width=160mm} :{width=160mm} :{width=160mm}

SPEC CPU 2017 performance

We haven't been able to collect all the results we need with SPECCPU and what we have so far will need more analysis but you can see the progress here

Actions

2024-06-05

Paolo Check behaviour of QEMU with tail bytes.
Paolo Subscribe to QEMU mailing list.
Paolo Look at the patches from Max Chou.

2024-05-15

Jeremy to look at impact of masked v unmasked and strided v unstrided on vector operations.
- lower proirity.

2024-05-08

Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
- low priority, deferred to prioritize the smoke tests work.

2024-05-01

Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
- Reproduction deferred to prioritize ARM analysis.
- So far we didn't see the execution time difference reported in the issue. Need to check the context.
- The bionic benchmarks may be a useful source of small benchmarks.
- Taken the ARM example: it might be tricky to map each load/store with the right host operations but that's the kind of optimization we are aiming at.
- PAUSED
Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.

Risk register

The risk register is held in a shared spreadsheet

We will keep it updated continuously and report any changes each week.

There are no changes to the risk register this week.

Planned absences

Jeremy will be on vacation from the 7th to the 16th of June. Paolo will be on vacation from the 20th to the 24th of June.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20240605.md

20240605.md

RISE RP005 QEMU weekly report 2024-06-05

Meetings will be on a fortnightly basis from now on. Next meeting will be on June 19th.

Overview

Introducing the project team

Work completed since last report

Work planned for the coming week

Current priorties

Detailed description of work

WP2: Optimization of vle8.v

Statistics

Memory/string operations

Individual RVV instruction performance

SPEC CPU 2017 performance

Actions

Risk register

Planned absences

Files

20240605.md

Latest commit

History

20240605.md

File metadata and controls

RISE RP005 QEMU weekly report 2024-06-05

Meetings will be on a fortnightly basis from now on. Next meeting will be on June 19th.

Overview

Introducing the project team

Work completed since last report

Work planned for the coming week

Current priorties

Detailed description of work

WP2: Optimization of vle8.v

Statistics

Memory/string operations

Individual RVV instruction performance

SPEC CPU 2017 performance

Actions

Risk register

Planned absences