Table of contents
Problem overview
At the moment of writing this article, I am developing a new project: its name is spiral. As a text editing library for a terminal, it handles both input, unicode processing and, most importantly, rendering. Efficient text editing is enabled via usage of efficient data structures — that’s why first versions of this project are using gap buffer.
However, while text manipulation is implemented extremely efficiently, the same can’t be said about rendering a line in a terminal after every change. This is caused by an immediate mode UI design pattern (see Immediate mode UI for the reason behind choosing this design). So even if you just insert one character, press one key on a keyboard while using your favourite application utilizing spiral, along with fast insertion of bytes in a buffer, the application has to iterate through every byte in order to render it, on almost every keystroke. This inefficiency can overshadow performance of a gap buffer.
On the other hand, classic dynamic array allows for extremely easy rendering, because it is just a continuous slice of items in memory, but it has less efficient insertion operation. So the main question to be answered in this article is:
What is going to be more efficient for the spiral and similar use cases: slow insertion, but fast rendering of a dynamic array, or fast insertion, but a little bit slower rendering and complexity of a gap buffer?
Possible solutions
This is the basic sequence of operations at spiral’s main loop:
- Insert new character into the data structure.
- Clear terminal screen.
- Iterate through every character in the data structure and print it to the terminal.
The data structure is uncertain, but there are three options that came to my mind while developing my library:
Using gap buffer.
Data structure specifically designed for efficient text editing. Efficient insertions near movable gap (we’ll call it cursor in the context of text editing). Iteration is a little bit complicated, because elements are stored in 2 separate slices: left and right from the cursor.
Using dynamic array.
Data structure with an extremely simple layout in memory. Efficient iteration, but complicated insertions at arbitrary positions.
Using gap buffer, but temporarily converting it to dynamic array before rendering.
Taking the best out of both worlds: efficient insertions of gap buffer and efficient iterations of dynamic array.
Conversion of data structures is happening by the following algorithm:
- Let n be amount of elements in right slice at this moment.
- Move the cursor n elements to the right.
- Use left slice of gap buffer in the same way as dynamic array
- Move the cursor n elements to the left.
Once our three contenders for the best data structure of spiral have been established, we can proceed to their analysis and comparison.
Theoretical analysis
First, we’ll start with analyzing the time complexity of operations through a series of mathematical calculations. We’re going to count every common CPU operations, so the terms for our formulas will be:
- n — amount of bytes in a data structure;
- m — time of executing one memory operation, either reading or writing one byte;
- c — time of executing one arithmeric/conditional operation.
Every data structure dynamically allocates memory and reserves it for the future element insertions. For the sake of simplicity, we’re going to assume that data structures always have unused capacity, and thus, do not require new allocations.
Time complexity of insertion
For gap buffer, insertion is extremely simple: simply store new element right after its left slice. Not even additional calculation is required (not counting the length increment, because it is a common operation for all data structures).
For dynamic array, however, insertion is harder. Before inserting an element for the same easy m, we first need to move k elements one element forward. The main question is: how many is k?
Here is the part where our analysis branches into 2 different cases: worst and best.
- Worst case: cursor is at the start of the line.
- Best case: cursor is at the end of the line.
Theoretically thinking, every element needs to be both read from old position, and then written to a new position. In POSIX programming, this operation is represented by a memmove
function1. That’s why the time complexity of moving k elements is:
However, in practice, memmove
-like functions are often getting compiled with the use of SIMD instructions, greatly parallelizing processing of data. Let s be the SIMD parallelization coefficient, and so the time complexity of memmove
is:
So, best case for dynamic array insertion is:
And worst case:
Time complexity of iteration
Dynamic array has a straighforward iteration, equal to a slice: just sequentially read every element in the data structure.
For gap buffer, the actual implementation of indexing needs to be analyzed. From the provided source code, let:
- l — amount of elements in left slice;
- r — amount of elements in right slice;
- i — order of element in the gap buffer that is to be retrieved;
- c — total capacity of gap buffer;
The following constraint is applied:
Gap buffer is implemented with one contiguous slice, inhabited by elements from left and right side. That’s why some additional calculations required when retrieving an element of an i-th order.
Important: i is not equal to index of the element in an underlying data slice.
In best case, . This means that indexing operation will consist of only one comparison:
In worst case, . This means that indexing operation will consist of a comparison and then 3 mathematical operations:
So, best case for gap buffer is:
And worst case is:
And don’t forget about the third case: temporary conversion from gap buffer to a slice.
Best case:
Worst case:
Total time complexity
Finally, we’ll conclude our analysis by adding the time complexities of insertion and iteration together, which will form a total time complexity for a full edit-render cycle. Below are the results of addition.Data structure Best case Worst case Gap buffer Dynamic array Gap buffer to slice
As we can see on this table, gap buffer to slice always performs worse than just using a dynamic array. Thus, we can exclude the third case from our future research, as it is clearly not the most optimal option.
It was a bad idea to mix two data structures together in one algorithm.
Next, let’s try to determine the condition of the remaining data structures performing equally well. To do that, we’ll simplify equations for best/worst case, with gap buffer on the left side and dynamic array on the right:
Data structure | Best case overhead | Worst case overhead |
---|---|---|
Gap buffer | ||
Dynamic array |
From our theoretical analysis, we can conclude the following:
- For best case, gap buffer will always have an overhead, compared to dynamic array, because .
- For worst case, the results are ambiguous. , because retrieving values from memory is always slower than calculating registers on any CPU. If the memory is cached, we could also say that 2. But there are also vector instructions, like SIMD, which can significanty speed up the dynamic array, potentially making it a better option.
Since we don’t have a clear best data structure from our plain calculations and theory, our last option is to do a practical comparison of data structures.
Practical benchmarking
Benchmarks should give us the definitive answer on the main question of this article and choose the best between dynamic array and gap buffer. See benchmark implementation for the source code (written in Zig) that was used to measure performance of both data structures.
Benchmark implementation
noinline fn gapBufferOperation(gb: *GapBuffer(u8)) void {
gb.insertRightAssumeCapacity(new_items);
for (0..gb.length()) |i|
std.mem.doNotOptimizeAway(gb.at(i));
}
noinline fn arrayListOperation(al: *ArrayList(u8)) void {
@memcpy(al.addManyAtAssumeCapacity(0, new_items.len), new_items);
for (al.items) |v|
std.mem.doNotOptimizeAway(v);
}
For gap buffer, my own implementation was used. For dynamic array, std.ArrayList
from the Zig standard library was used.
Every operation consists of two stages:
- Inserting a new item, which can be an arbitrary string. Length of the string doesn’t matter, the costs of copying it to the buffer will always be the same;
- Iterating through the whole data structure. Note the
std.mem.doNotOptimizeAway
functions, that are intended to prevent the compiler from omitting value reads at the highest optimization modes (ReleaseFast
andReleaseSmall
).
const n = 1024;
try gap_buffer.ensureUnusedCapacity(alloc, n * 2);
try array_list.ensureUnusedCapacity(alloc, n * 2);
Amount of bytes was chosen according to the needs of spiral — that is, for command line editing. For a terminal window with 256 columns, 1024 bytes can fully contain a line consisting of ‘🧐’ emoji, 2 lines of basic greek letters like ‘α’, or 4 lines of english text and basic ASCII punctuation. 256 columns is a possible terminal width on a desktop.
Space for elements is pre–allocated before entering the loop, and it is allocating more elements than n — that’s to account additional elements that are being inserted at every iteration. const step = std.atomic.cache_line;
var i: usize = 0;
while (i <= n) : (i += step) {
// ...
}
Now this is the most interesting part. Increment of the loop is a value from the standard library.
Add this much padding or align to this boundary to avoid atomically-updated memory from forcing cache invalidations on near, but non-atomic, memory.
Data structures will perform at their best if they have amount of elements divisible by std.atomic.cache_line
3. Otherwise, we will get not very smooth results and a jagged plot from benchmarking.
Benchmark execution
Benchmarks were executed on an AMD Ryzen 7 5800H CPU, which has support for SIMD. Below are the results for different optimization modes: for execution speed or for minimizing the executable size.
Amount of time may possibly be different for your CPU, but the ranking of data structures shouldn’t be different if your CPU also supports SIMD (which is 99% the case). The results are clear:
- When optimizing for execution speed,
ArrayList
is significantly faster. - When optimizing for executable size,
ArrayList
is a little slower thanGapBuffer
.
But before we go to the conclusions, we can figure out what caused this difference by looking at the assembly of the executable.
Executable analysis
Disassembly was performed using these commands:
push rbp
mov rbp, rsp
mov rcx, qword ptr [rdi + 0x8]
mov rax, qword ptr [rdi]
lea rdx, [rcx + 0x4]
mov qword ptr [rdi + 0x8], rdx
mov rdx, rcx
sub rdx, 0x1
jb 0x1002713 <main.arrayListOperation+0x2b>
mov sil, byte ptr [rax + rcx - 0x1]
mov byte ptr [rax + rcx + 0x3], sil
mov rcx, rdx
jmp 0x10026fe <main.arrayListOperation+0x16>
mov rax, qword ptr [rdi]
xor edx, edx
mov dword ptr [rax], 0x90a79ff0
mov rax, qword ptr [rdi]
mov rcx, qword ptr [rdi + 0x8]
cmp rcx, rdx
je 0x1002733 <main.arrayListOperation+0x4b>
mov sil, byte ptr [rax + rdx]
inc rdx
jmp 0x1002725 <main.arrayListOperation+0x3d>
pop rbp
ret
Casual scalar instructions in one loop, very simple. No vector instructions, so the SIMD parallelization coefficient (previously defined s) is equal to 1.push rbp
mov rbp, rsp
push rbx
push rax
mov rdx, qword ptr [rdi + 0x8]
mov rsi, qword ptr [rdi]
mov rbx, rdi
lea rax, [rdx + 0x4]
mov qword ptr [rdi + 0x8], rax
test rdx, rdx
je 0x1002b19 <main.arrayListOperation+0x29>
lea rdi, [rsi + 0x4]
call 0x10039a0 <memmove>
mov rsi, qword ptr [rbx]
mov dword ptr [rsi], 0x90a79ff0
; ... and many more ...
Same thing as with ReleaseSmall, but memmove
is being called at line 13 — we’ll disassemble it as well.mov rax, rdi
cmp rdx, 0xf
ja 0x10039dc <memmove+0x3c>
cmp rdx, 0x3
ja 0x1003a4f <memmove+0xaf>
test rdx, rdx
je 0x1003a7b <memmove+0xdb>
mov rcx, rdx
shr rcx
movzx edi, byte ptr [rsi]
movzx r8d, byte ptr [rsi + rcx]
movzx esi, byte ptr [rsi + rdx - 0x1]
mov byte ptr [rax], dil
mov byte ptr [rax + rcx], r8b
mov byte ptr [rax + rdx - 0x1], sil
ret
push rbp
mov rbp, rsp
push r14
push rbx
sub rsp, 0x40
cmp rdx, 0x3f
ja 0x1003a7c <memmove+0xdc>
vmovups xmm0, xmmword ptr [rsi]
mov ecx, edx
shr ecx
lea rdi, [rdx - 0x10]
and ecx, 0x10
sub rdi, rcx
vmovaps xmmword ptr [rbp - 0x50], xmm0
vmovups xmm1, xmmword ptr [rsi + rcx]
vmovaps xmmword ptr [rbp - 0x20], xmm1
vmovups xmm1, xmmword ptr [rsi + rdi]
vmovaps xmmword ptr [rbp - 0x30], xmm1
vmovups xmm1, xmmword ptr [rsi + rdx - 0x10]
vmovaps xmmword ptr [rbp - 0x40], xmm1
vmovups xmmword ptr [rax], xmm0
vmovaps xmm0, xmmword ptr [rbp - 0x20]
vmovups xmmword ptr [rax + rcx], xmm0
vmovaps xmm0, xmmword ptr [rbp - 0x30]
vmovups xmmword ptr [rax + rdi], xmm0
vmovaps xmm0, xmmword ptr [rbp - 0x40]
vmovups xmmword ptr [rax + rdx - 0x10], xmm0
; ... and many more ...
xmm*
are vector registers, and vmov*
are vector instructions. Optimization for execution speed takes advantage of vectorization, so in this case, .
Going back to our theoretical analysis we now have:
Conclusion
In this research, we used both theoretical analysis and benchmarking to compare three data structures and algorithms that could be used to design spiral. Using theoretical analysis of time complexity of every operation, we were able to weed out one worst option. Implementing practical benchmarking allowed us to draw a reliable conclusion between two remaining data structures.
So, in an immediate mode UI that redraws every line, a library that utilizes vectorization should use dynamic arrays, while a library that is targeting the architecture without SIMD support should use a gap buffer. In reality, however, intended usage of spiral are terminals running on a general purpose PCs and laptops which, nowadays, always have support for vectorization. Not to mention that Linux distributions usually compile the software with optimizations for execution speed, always leveraging the vectorization and greatly improving the performance of spiral. That’s why dynamic array is a data structure of choice for my library.