Benchmarks

To ensure that the program meets the performance requirements described in the Requirements document, we run the scripts described below to record the following metrics:

how long the program takes to run;
how much memory the ebk-find program itself uses; and
how much data is sent to (and presumably loaded into memory by) the fuzzy finder.

Benchmarking scripts

The benchmarking scripts can be run by calling:

make benchmarks

This will update the tables stdout.tsv and time+mem.tsv in this directory and produce the images stdout.svg, time.svg, and memory.svg.

`benchmark_stdout.py`

The first version of ebk-find (v0.1.0) just printed the book data to stdout. This script uses that version to record the amount of data sent to the fuzzy finder. This must be added to the amount of memory consumed by the ebk-find Python process to ensure that the peak memory usage remains below the required threshold.

The commit tagged v0.1.0 is checked out into a temporary directory using git worktree and installed into a virtualenv.
The temp_library.py script is used to generate artificial libraries in a range of sizes up to 50 000.
ebk-find v0.1.0 is run against each library in turn, and the number of bytes in the output is counted with wc.

`benchmark_time+mem.py`

This script is similar to benchmark_stdout.py, but it measures the performance of every minor version of ebk-find. For each combination of version and library size, it records the amount of time taken and the peak memory usage required to run the equivalent of ebk-find -n -s 'tail -n 1' against the given library.

We use tail -n 1 as the selection command to ensure that the data for the entire library is piped to the subprocess; it only exits after having visited its entire input.

The temp_library.py script is used to generate artificial libraries in a range of sizes up to 50 000.
Each minor version of ebk-find is installed into a virtualenv as described for benchmark_stdout.py.
Each version of ebk-find is run against each library in turn (the arguments for the command being adjusted based on the options available and/or required for that version of the program). The call to ebk-find is wrapped in a call to /usr/bin/time, which records the elapsed real time and the maximum RSS memory of the process.

`temp_library.py`

This script generates a single artificial library of the requested size. The generated library contains the full directory tree expected by ebk-find, with a minimal but valid metadata.opf file and an empty .epub file in each leaf directory.

In a new temporary directory, an author subdirectory is created.
Inside that directory, n book directories are created, where n is the requested library size.
Inside each book directory, a metadata.opf file is created from a template file. The template is populated by inserting the author field; a unique title field (so that book entries can be distinguished by the finder program); and a dummy subject tag field, which is padded such that the total file size is representative of that of OPF files found in a real Calibre library.
An empty .epub file is also created in each book directory.

`plot_bench_*.py`

These scripts plot the tables produced by the benchmark_*.py scripts for visual inspection.

Results

Time

A library of 2500 books can be searched in less than 300 ms, and a library of 50 000 books can be searched in < 5 s.

The time taken increases more or less linearly with library size (not easy to see on this plot because only the x axis is on a log scale; refer to the raw data).

Total time taken to run ebk-find (without the xdg-open step) — Total time taken to run `ebk-find` (without the `xdg-open` step)

The time requirements are fairly similar across versions.

Memory usage

Memory usage increases linearly with library size, but it remains below 100 MiB for a library of 50 000 books.

Maximum memory used by ebk-find (not including that of the fuzzy finder subprocess) — Maximum memory used by `ebk-find` (not including that of the fuzzy finder subprocess)

It isn’t very clear from the plot, but the memory used by v0.3.0 and v0.4.2 were nearly identical.

The jump in memory usage between v0.2.0 and v0.3.0 is because the book data started to be retained in a list in the Python process, so that the user could be presented with just the metadata to search in, the path to the book file being looked up afterwards based on their selection.

Data sent to fuzzy finder

The amount of data sent to the finder subprocess increases linearly with library size. It is about 6 MiB for a library of 50 000 books.

Conclusions

For most users, the loading of their library into the fuzzy finder should feel instantaneous (< 300 ms for 2500 books).

The maximum memory required for ebk-find including the finder subprocess is just over 100 MiB for libraries of 50 000 books: 95.2 MiB for ebk-find + 6.5 MiB for fzf + 6.3 MiB of data passed between them = 108.0 MiB.

For a library of 2500 books, it would take about 22.0 + 6.5 + 0.3 = 28.8 MiB.