Benchmarks
To ensure that the program meets the performance requirements described in the Requirements document, we run the scripts described below to record the following metrics:
- how long the program takes to run;
- how much memory the
ebk-findprogram itself uses; and - how much data is sent to (and presumably loaded into memory by) the fuzzy finder.
Benchmarking scripts
The benchmarking scripts can be run by calling:
make benchmarks
This will update the tables stdout.tsv and
time+mem.tsv in this directory and produce the images
stdout.svg, time.svg, and
memory.svg.
benchmark_stdout.py
The first version of ebk-find (v0.1.0) just printed the
book data to stdout. This script uses that version to record the amount
of data sent to the fuzzy finder. This must be added to the amount of
memory consumed by the ebk-find Python process to ensure
that the peak memory usage remains below the required threshold.
- The commit tagged v0.1.0 is checked out into a temporary directory
using
git worktreeand installed into a virtualenv. - The
temp_library.pyscript is used to generate artificial libraries in a range of sizes up to 50 000. ebk-findv0.1.0 is run against each library in turn, and the number of bytes in the output is counted withwc.
benchmark_time+mem.py
This script is similar to benchmark_stdout.py, but it
measures the performance of every minor version of
ebk-find. For each combination of version and library size,
it records the amount of time taken and the peak memory usage required
to run the equivalent of ebk-find -n -s 'tail -n 1' against
the given library.
We use tail -n 1 as the selection command to ensure that
the data for the entire library is piped to the subprocess; it only
exits after having visited its entire input.
- The
temp_library.pyscript is used to generate artificial libraries in a range of sizes up to 50 000. - Each minor version of
ebk-findis installed into a virtualenv as described forbenchmark_stdout.py. - Each version of
ebk-findis run against each library in turn (the arguments for the command being adjusted based on the options available and/or required for that version of the program). The call toebk-findis wrapped in a call to/usr/bin/time, which records the elapsed real time and the maximum RSS memory of the process.
temp_library.py
This script generates a single artificial library of the requested
size. The generated library contains the full directory tree expected by
ebk-find, with a minimal but valid
metadata.opf file and an empty .epub file in
each leaf directory.
- In a new temporary directory, an author subdirectory is created.
- Inside that directory, n book directories are created, where n is the requested library size.
- Inside each book directory, a
metadata.opffile is created from a template file. The template is populated by inserting the author field; a unique title field (so that book entries can be distinguished by the finder program); and a dummy subject tag field, which is padded such that the total file size is representative of that of OPF files found in a real Calibre library. - An empty
.epubfile is also created in each book directory.
plot_bench_*.py
These scripts plot the tables produced by the
benchmark_*.py scripts for visual inspection.
Results
Time
A library of 2500 books can be searched in less than 300 ms, and a library of 50 000 books can be searched in < 5 s.
The time taken increases more or less linearly with library size (not easy to see on this plot because only the x axis is on a log scale; refer to the raw data).
The time requirements are fairly similar across versions.
Memory usage
Memory usage increases linearly with library size, but it remains below 100 MiB for a library of 50 000 books.
It isn’t very clear from the plot, but the memory used by v0.3.0 and v0.4.2 were nearly identical.
The jump in memory usage between v0.2.0 and v0.3.0 is because the book data started to be retained in a list in the Python process, so that the user could be presented with just the metadata to search in, the path to the book file being looked up afterwards based on their selection.
Data sent to fuzzy finder
The amount of data sent to the finder subprocess increases linearly with library size. It is about 6 MiB for a library of 50 000 books.
Conclusions
For most users, the loading of their library into the fuzzy finder should feel instantaneous (< 300 ms for 2500 books).
The maximum memory required for ebk-find including the
finder subprocess is just over 100 MiB for libraries of 50 000 books:
95.2 MiB for ebk-find + 6.5 MiB for fzf +
6.3 MiB of data passed between them = 108.0 MiB.
For a library of 2500 books, it would take about 22.0 + 6.5 + 0.3 = 28.8 MiB.