Problem overview

At the moment of writing this article, I am developing a new project: its name is spiral. As a text editing library for a terminal, it handles both input, unicode processing and, most importantly, rendering. Efficient text editing is enabled via usage of efficient data structures — that’s why first versions of this project are using gap buffer.

However, while text manipulation is implemented extremely efficiently, the same can’t be said about rendering a line in a terminal after every change. This is caused by an immediate mode UI design pattern (see Immediate mode UI for the reason behind choosing this design). So even if you just insert one character, press one key on a keyboard while using your favourite application utilizing spiral, along with fast insertion of bytes in a buffer, the application has to iterate through every byte in order to render it, on almost every keystroke. This inefficiency can overshadow performance of a gap buffer.

On the other hand, classic dynamic array allows for extremely easy rendering, because it is just a continuous slice of items in memory, but it has less efficient insertion operation. So the main question to be answered in this article is:

What is going to be more efficient for the spiral and similar use cases: slow insertion, but fast rendering of a dynamic array, or fast insertion, but a little bit slower rendering and complexity of a gap buffer?

Possible solutions

This is the basic sequence of operations at spiral’s main loop:

Insert new character into the data structure.
Clear terminal screen.
Iterate through every character in the data structure and print it to the terminal.

The data structure is uncertain, but there are three options that came to my mind while developing my library:

Using gap buffer.

Data structure specifically designed for efficient text editing. Efficient insertions near movable gap (we’ll call it cursor in the context of text editing). Iteration is a little bit complicated, because elements are stored in 2 separate slices: left and right from the cursor.
Using dynamic array.

Data structure with an extremely simple layout in memory. Efficient iteration, but complicated insertions at arbitrary positions.
Using gap buffer, but temporarily converting it to dynamic array before rendering.

Taking the best out of both worlds: efficient insertions of gap buffer and efficient iterations of dynamic array.

Conversion of data structures is happening by the following algorithm:
1. Let n be amount of elements in right slice at this moment.
2. Move the cursor n elements to the right.
3. Use left slice of gap buffer in the same way as dynamic array
4. Move the cursor n elements to the left.

Once our three contenders for the best data structure of spiral have been established, we can proceed to their analysis and comparison.

Theoretical analysis

First, we’ll start with analyzing the time complexity of operations through a series of mathematical calculations. We’re going to count every common CPU operations, so the terms for our formulas will be:

n — amount of bytes in a data structure;
m — time of executing one memory operation, either reading or writing one byte;
c — time of executing one arithmeric/conditional operation.

Every data structure dynamically allocates memory and reserves it for the future element insertions. For the sake of simplicity, we’re going to assume that data structures always have unused capacity, and thus, do not require new allocations.

Time complexity of insertion

For gap buffer, insertion is extremely simple: simply store new element right after its left slice. Not even additional calculation is required (not counting the length increment, because it is a common operation for all data structures).

m

For dynamic array, however, insertion is harder. Before inserting an element for the same easy m, we first need to move k elements one element forward. The main question is: how many is k?

Here is the part where our analysis branches into 2 different cases: worst and best.

Worst case: cursor is at the start of the line. $k = n$
Best case: cursor is at the end of the line. $k = 0$

Theoretically thinking, every element needs to be both read from old position, and then written to a new position. In POSIX programming, this operation is represented by a memmove function¹. That’s why the time complexity of moving k elements is:

\text{ memmove}(k) = k \ast 2m

However, in practice, memmove-like functions are often getting compiled with the use of SIMD instructions, greatly parallelizing processing of data. Let s be the SIMD parallelization coefficient, and so the time complexity of memmove is:

\text{ memmove}(k) = \frac{k \ast 2m}{s}

So, best case for dynamic array insertion is:

m

And worst case:

\text{ memmove}(n) + m = \frac{2nm}{s} + m

Time complexity of iteration

Dynamic array has a straighforward iteration, equal to a slice: just sequentially read every element in the data structure.

nm

For gap buffer, the actual implementation of indexing needs to be analyzed. From the provided source code, let:

l — amount of elements in left slice;
r — amount of elements in right slice;
i — order of element in the gap buffer that is to be retrieved;
c — total capacity of gap buffer;

The following constraint is applied:

0 \leq i < l + r \leq c

Gap buffer is implemented with one contiguous slice, inhabited by elements from left and right side. That’s why some additional calculations required when retrieving an element of an i-th order.

Important: i is not equal to index of the element in an underlying data slice.

\text{ index}(i) = \begin{cases} i\text{ if }i < l \\ c - r - (i - l)\text{ otherwise} \end{cases}

In best case, $r = 0$ . This means that indexing operation will consist of only one comparison:

\forall i:i < l \Rightarrow \text{index}(i) = i

In worst case, $l = 0$ . This means that indexing operation will consist of a comparison and then 3 mathematical operations:

\forall i:i \geq l \Rightarrow \text{index}(i) = c - r - (i - l)

So, best case for gap buffer is:

n \ast (c + m)

And worst case is:

n \ast (4c + m)

And don’t forget about the third case: temporary conversion from gap buffer to a slice.

Best case:

nm

Worst case:

\text{ memmove}(n) + nm + \text{ memmove}(n) = \frac{4nm}{s} + nm

Total time complexity

Finally, we’ll conclude our analysis by adding the time complexities of insertion and iteration together, which will form a total time complexity for a full edit-render cycle. Below are the results of addition.

Sum of insertion and iteration for every data structure case

Data structure	Best case	Worst case
Gap buffer	$m + nm + nc$	$m + nm + 4nc$
Dynamic array	$m + nm$	$\frac{2nm}{s} + m + nm$
Gap buffer to slice	$m + nm$	$m + \frac{4nm}{s} + nm$

As we can see on this table, gap buffer to slice always performs worse than just using a dynamic array. Thus, we can exclude the third case from our future research, as it is clearly not the most optimal option.

It was a bad idea to mix two data structures together in one algorithm.

Next, let’s try to determine the condition of the remaining data structures performing equally well. To do that, we’ll simplify equations for best/worst case, with gap buffer on the left side and dynamic array on the right:

m + nm + nc = m + nm \Rightarrow nc = 0

m + nm + 4nc = \frac{2nm}{s} + m + nm \Rightarrow 4nc = \frac{2nm}{s} \Rightarrow 2c = \frac{m}{s}

Time complexity overhead for every data structure

Data structure	Best case overhead	Worst case overhead
Gap buffer	$nc$	$2c$
Dynamic array	$0$	$\frac{m}{s}$

From our theoretical analysis, we can conclude the following:

For best case, gap buffer will always have an overhead, compared to dynamic array, because $n > 0$ .
For worst case, the results are ambiguous. $m > c$ , because retrieving values from memory is always slower than calculating registers on any CPU. If the memory is cached, we could also say that $m \geq 4c$ ². But there are also vector instructions, like SIMD, which can significanty speed up the dynamic array, potentially making it a better option.

Since we don’t have a clear best data structure from our plain calculations and theory, our last option is to do a practical comparison of data structures.

Practical benchmarking

Benchmarks should give us the definitive answer on the main question of this article and choose the best between dynamic array and gap buffer. See benchmark implementation for the source code (written in Zig) that was used to measure performance of both data structures.

Benchmark implementation

Benchmarkable operations, line 8

noinline fn gapBufferOperation(gb: *GapBuffer(u8)) void {
	gb.insertRightAssumeCapacity(new_items);
	for (0..gb.length()) |i|
		std.mem.doNotOptimizeAway(gb.at(i));
}

noinline fn arrayListOperation(al: *ArrayList(u8)) void {
	@memcpy(al.addManyAtAssumeCapacity(0, new_items.len), new_items);
	for (al.items) |v|
		std.mem.doNotOptimizeAway(v);
}

For gap buffer, my own implementation was used. For dynamic array, std.ArrayList from the Zig standard library was used.

Every operation consists of two stages:

Inserting a new item, which can be an arbitrary string. Length of the string doesn’t matter, the costs of copying it to the buffer will always be the same;
Iterating through the whole data structure. Note the std.mem.doNotOptimizeAway functions, that are intended to prevent the compiler from omitting value reads at the highest optimization modes (ReleaseFast and ReleaseSmall).

Allocating space for elements, line 26

	const n = 1024;

	try gap_buffer.ensureUnusedCapacity(alloc, n * 2);
	try array_list.ensureUnusedCapacity(alloc, n * 2);

Amount of bytes was chosen according to the needs of spiral — that is, for command line editing. For a terminal window with 256 columns, 1024 bytes can fully contain a line consisting of ‘🧐’ emoji, 2 lines of basic greek letters like ‘α’, or 4 lines of english text and basic ASCII punctuation. 256 columns is a possible terminal width on a desktop.

Space for elements is pre–allocated before entering the loop, and it is allocating more elements than n — that’s to account additional elements that are being inserted at every iteration.

Main loop, line 34

	const step = std.atomic.cache_line;
	var i: usize = 0;
	while (i <= n) : (i += step) {
		// ...
	}

Now this is the most interesting part. Increment of the loop is a value from the standard library.

Add this much padding or align to this boundary to avoid atomically-updated memory from forcing cache invalidations on near, but non-atomic, memory.

Data structures will perform at their best if they have amount of elements divisible by std.atomic.cache_line³. Otherwise, we will get not very smooth results and a jagged plot from benchmarking.

Benchmark execution

Benchmarks were executed on an AMD Ryzen 7 5800H CPU, which has support for SIMD. Below are the results for different optimization modes: for execution speed or for minimizing the executable size.

Performance plot for ReleaseFast — Plotted benchmark results, optimization mode ReleaseFast

Performance plot for ReleaseSmall — Plotted benchmark results, optimization mode ReleaseSmall

Amount of time may possibly be different for your CPU, but the ranking of data structures shouldn’t be different if your CPU also supports SIMD (which is 99% the case). The results are clear:

When optimizing for execution speed, ArrayList is significantly faster.
When optimizing for executable size, ArrayList is a little slower than GapBuffer.

But before we go to the conclusions, we can figure out what caused this difference by looking at the assembly of the executable.

Executable analysis

Disassembly was performed using these commands:

zig build -Doptimize=...
objdump -d -Mintel zig-out/bin/benchmark

Disassembly of an arrayListOperation function, optimized for ReleaseSmall

push    rbp
mov     rbp, rsp
mov     rcx, qword ptr [rdi + 0x8]
mov     rax, qword ptr [rdi]
lea     rdx, [rcx + 0x4]
mov     qword ptr [rdi + 0x8], rdx
mov     rdx, rcx
sub     rdx, 0x1
jb      0x1002713 <main.arrayListOperation+0x2b>
mov     sil, byte ptr [rax + rcx - 0x1]
mov     byte ptr [rax + rcx + 0x3], sil
mov     rcx, rdx
jmp     0x10026fe <main.arrayListOperation+0x16>
mov     rax, qword ptr [rdi]
xor     edx, edx
mov     dword ptr [rax], 0x90a79ff0
mov     rax, qword ptr [rdi]
mov     rcx, qword ptr [rdi + 0x8]
cmp     rcx, rdx
je      0x1002733 <main.arrayListOperation+0x4b>
mov     sil, byte ptr [rax + rdx]
inc     rdx
jmp     0x1002725 <main.arrayListOperation+0x3d>
pop     rbp
ret

Casual scalar instructions in one loop, very simple. No vector instructions, so the SIMD parallelization coefficient (previously defined s) is equal to 1.

Disassembly fragment of an arrayListOperation function, optimized for ReleaseFast

push    rbp
mov     rbp, rsp
push    rbx
push    rax
mov     rdx, qword ptr [rdi + 0x8]
mov     rsi, qword ptr [rdi]
mov     rbx, rdi
lea     rax, [rdx + 0x4]
mov     qword ptr [rdi + 0x8], rax
test    rdx, rdx
je      0x1002b19 <main.arrayListOperation+0x29>
lea     rdi, [rsi + 0x4]
call    0x10039a0 <memmove>
mov     rsi, qword ptr [rbx]
mov     dword ptr [rsi], 0x90a79ff0
; ... and many more ...

Same thing as with ReleaseSmall, but memmove is being called at line 13 — we’ll disassemble it as well.

Disassembly fragment of a memmove function, optimized for ReleaseFast

mov     rax, rdi
cmp     rdx, 0xf
ja      0x10039dc <memmove+0x3c>
cmp     rdx, 0x3
ja      0x1003a4f <memmove+0xaf>
test    rdx, rdx
je      0x1003a7b <memmove+0xdb>
mov     rcx, rdx
shr     rcx
movzx   edi, byte ptr [rsi]
movzx   r8d, byte ptr [rsi + rcx]
movzx   esi, byte ptr [rsi + rdx - 0x1]
mov     byte ptr [rax], dil
mov     byte ptr [rax + rcx], r8b
mov     byte ptr [rax + rdx - 0x1], sil
ret
push    rbp
mov     rbp, rsp
push    r14
push    rbx
sub     rsp, 0x40
cmp     rdx, 0x3f
ja      0x1003a7c <memmove+0xdc>
vmovups xmm0, xmmword ptr [rsi]
mov     ecx, edx
shr     ecx
lea     rdi, [rdx - 0x10]
and     ecx, 0x10
sub     rdi, rcx
vmovaps xmmword ptr [rbp - 0x50], xmm0
vmovups xmm1, xmmword ptr [rsi + rcx]
vmovaps xmmword ptr [rbp - 0x20], xmm1
vmovups xmm1, xmmword ptr [rsi + rdi]
vmovaps xmmword ptr [rbp - 0x30], xmm1
vmovups xmm1, xmmword ptr [rsi + rdx - 0x10]
vmovaps xmmword ptr [rbp - 0x40], xmm1
vmovups xmmword ptr [rax], xmm0
vmovaps xmm0, xmmword ptr [rbp - 0x20]
vmovups xmmword ptr [rax + rcx], xmm0
vmovaps xmm0, xmmword ptr [rbp - 0x30]
vmovups xmmword ptr [rax + rdi], xmm0
vmovaps xmm0, xmmword ptr [rbp - 0x40]
vmovups xmmword ptr [rax + rdx - 0x10], xmm0
; ... and many more ...

xmm* are vector registers, and vmov* are vector instructions. Optimization for execution speed takes advantage of vectorization, so in this case, $s > 1$ .

Going back to our theoretical analysis we now have:

$2c < m$ $2c > \frac{m}{s}$

Conclusion

In this research, we used both theoretical analysis and benchmarking to compare three data structures and algorithms that could be used to design spiral. Using theoretical analysis of time complexity of every operation, we were able to weed out one worst option. Implementing practical benchmarking allowed us to draw a reliable conclusion between two remaining data structures.

So, in an immediate mode UI that redraws every line, a library that utilizes vectorization should use dynamic arrays, while a library that is targeting the architecture without SIMD support should use a gap buffer. In reality, however, intended usage of spiral are terminals running on a general purpose PCs and laptops which, nowadays, always have support for vectorization. Not to mention that Linux distributions usually compile the software with optimizations for execution speed, always leveraging the vectorization and greatly improving the performance of spiral. That’s why dynamic array is a data structure of choice for my library.

Dynamic array vs Gap buffer: performance analysis for immediate mode UI

Author

Creation date

Word count

Table of contents