Elasticity

Wednesday, June 18, 2025

Simple Hashing with AF_ALG

My previous post demonstrated how the Linux kernel's AF_ALG socket type can be used with io_uring for fast hashing.

I wasn't very happy with the complexity of the io_uring based implementation, so set out to write something much simpler with plain syscalls. What came out was just over 100 lines of boring, uncomplicated C. Better still, use of splice() for copy-offload sees it perform very similar to the io_uring based implementation on my systems.

I've published the BSD-3-Clause licensed source at https://github.com/ddiss/splice-digest, with the main snippets below:

#define SPLICE_MAX (1024 * 1024)
...
int main(int argc, char *argv[])
{
...
        struct sockaddr_alg sa = {
                .salg_family = AF_ALG,
                .salg_type = "hash",
        };
...
        infd = open(infile, O_RDONLY);
...
        sfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
...
        memcpy(sa.salg_name, alg, alg_len + 1);
        if (bind(sfd, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
...
        }

        outfd = accept(sfd, NULL, 0);
...
        if (pipe(pipefds) < 0)
                err(-1, "pipe");
...
        for (insize = st.st_size; insize; insize -= got) {
                size_t tryl = (insize < SPLICE_MAX ? insize : SPLICE_MAX);

                l = splice(infd, NULL, pipefds[1], NULL, tryl, 0);
...
                got = splice(pipefds[0], NULL, outfd, NULL, l, SPLICE_F_MORE);
...
        }

        fprintf(stdout, "Spliced %s(%s): ", alg, infile);
        print_hash_result(outfd);
        putc('\n', stdout);
}

Hopefully this convinces a few people to drop those external crypto libraries in favour of Linux AF_ALG.

Wednesday, October 9, 2024

Speedreader's Digest: Hashing Data on Linux using AF_ALG

(I seem to have written this in the tone of an advertisment for household cleaning products. Sorry.)

Do you need to quickly hash (e.g. SHA-1) content on Linux? Want to avoid linking against bloated crypto libraries? There's no need to roll your own; use the Linux kernel's AF_ALG functionality! It's fast, supports many hash algorithms, and plumbs easily into existing network or disk I/O pipelines.

Look at these benchmark results, comparing io_uring kdigest and openssl performance:

kdigest

FILE SIZE	512 bytes	4096	65536	1M	16M	32M
md5	0.0010984 +-1.44%	0.0008265 +-2.65%	0.0009556 +-1.52%	0.0024098 +-0.35%	0.0266533 +-0.06%	0.0521402 +-0.19%
sha1	0.0011012 +-1.59%	0.0009430 +-2.21%	0.0009173 +-1.89%	0.0019097 +-1.29%	0.0186466 +-0.18%	0.0361400 +-0.13%
sha224	0.0010983 +-1.29%	0.0010970 +-1.72%	0.0010350 +-1.32%	0.0036209 +-0.42%	0.0425834 +-0.06%	0.0841893 +-0.06%
sha256	0.0010996 +-1.34%	0.0011159 +-1.74%	0.0010299 +-1.09%	0.0036085 +-0.40%	0.0426466 +-0.12%	0.0840805 +-0.03%
sha384	0.0011094 +-1.58%	0.0011204 +-1.48%	0.0009555 +-2.96%	0.0027775 +-0.58%	0.0296736 +-0.27%	0.058271 +-0.27%
sha512	0.0010909 +-1.71%	0.0010763 +-3.21%	0.0009746 +-1.77%	0.0027719 +-0.74%	0.0297744 +-0.26%	0.058239 +-0.25%

openssl 3.1.4-3.2

FILE SIZE	512 bytes	4096	65536	1M	16M	32M
md5	0.0039263 +-0.81%	0.0029834 +-1.69%	0.0030833 +-1.28%	0.0044167 +-0.87%	0.0286969 +-0.20%	0.0536009 +-0.16%
sha1	0.0039302 +-1.16%	0.0029809 +-1.62%	0.0030169 +-0.99%	0.0040051 +-1.04%	0.0220672 +-0.29%	0.0414711 +-0.19%
sha224	0.0039211 +-0.99%	0.0039417 +-0.95%	0.0031360 +-1.43%	0.0055392 +-0.69%	0.0408525 +-0.18%	0.078564 +-0.20%
sha256	0.0039277 +-0.60%	0.0039653 +-0.97%	0.0031659 +-1.26%	0.0055284 +-0.76%	0.0408774 +-0.11%	0.0788206 +-0.10%
sha384	0.0039370 +-1.13%	0.0039494 +-0.87%	0.0030840 +-1.26%	0.0047742 +-0.81%	0.029442 +-0.35%	0.056091 +-0.24%
sha512	0.0039456 +-1.02%	0.0039779 +-1.09%	0.0030739 +-1.26%	0.0047586 +-0.55%	0.0294435 +-0.20%	0.056350 +-0.31%

Benchmark System

    Linux Kernel: openSUSE Tumbleweed 6.11.0-1-default
    CPU: Intel(R) Xeon(R) CPU E3-1260L v5 @ 2.90GHz
    Thread(s) per core:   2
    Core(s) per socket:   4
    Socket(s):            1
    RAM: 64GB

Benchmark Script

for size in $((32 * 1024 * 1024)) $((16 * 1024 * 1024)) $((1024 * 1024)) \
            $((64 * 1024)) $((4 * 1024)) 512; do
    dd if=/dev/urandom of="${size}.data" bs="$size" count=1 || break
    echo "==== hashing file of size $size ===="
    for i in md5 sha1 sha224 sha256 sha384 sha512; do
        # prime cache
        cat "${size}.data" > /dev/null
        perf stat --null -r 5 --table \
             openssl "$i" "${size}.data" \
             >/dev/null 2>openssl.${size}.${i}.perf
        perf stat --null -r 5 --table \
             ~/liburing/examples/kdigest "$i" "${size}.data" \
             >/dev/null 2>kdigest.${size}.${i}.perf
    done
done

Tuesday, November 29, 2022

Btrfs Seed Devices for A/B System Updates

Sunflower seedling - Creative Commons Attribution-Share Alike 3.0 Unported

A/B system updates, as described here, provide a way for an operating system (OS) to seamlessly update from an old version to a new version, while ensuring that any failure in the upgrade process will allow for fallback to the known-working old version of the OS.

Typically A/B updates are implemented using separate old and new filesystem images, atop separate, equally sized disk partitions. However, modern copy-on-write filesystems offer some more performant and space efficient possibilities, as described below.

A/B Updates Using Btrfs Subvolume Snapshots

Linux's Btrfs filesystem provides support for snapshots at a subvolume level, which can be used for A/B system updates. A typical procedure would be:

The current OS version is running atop an old read-only subvolume
When an update is available, the old subvolume is cloned as a writeable snapshot under a newly created path within the filesystem
The upgrade is written to the new snapshot subvolume path (e.g. via btrfs receive)
The new snapshot is configured as the default subvolume, causing it to be mounted on next boot
If any issues are encountered during or post update, any default subvolume change is reverted, the old OS version is booted and the new subvolume is subsequently discarded

This procedure works well; it's space efficient, allows for as many old versions to be retained as desired and also doesn't require any specific block device partitioning scheme. Given these benefits, it's unsurprising that SUSE uses a similar approach to provide Transactional Update functionality. However, there are still some minor caveats:

Currently Btrfs only provides atomic snapshots for single subvolumes, meaning that the above procedure shouldn't be used if an OS update modifies multiple subvolumes
The update procedure must be aware of the new subvolume path to target for I/O

An alternative may be to create a read-only snapshot before upgrading in-place, similar to snapper based rollback

A/B Updates Using Btrfs Seed Devices

Btrfs seed devices offer copy-on-write support at a block device level, which also can be used to provide A/B system updates, with fallback between new and old block devices instead of subvolumes.

The following seed device example requires two or more separate block devices (or partitions), with one acting as a read-only seed device and one a read-write "sprout" device.

The currently running OS version is backed by an old block device, flagged as a read-only seed via
```
btrfstune -S 1 /dev/old_block_dev
```
When an update is available, the new writeable "sprout" device is added to the Btrfs filesystem via
```
btrfs device add /dev/new_block_device /
```
The filesystem is remounted read-write
The update is written in-place, with Btrfs ensuring that all update I/O is written to the newly added block device
The new block device is flagged for the bootloader as the default boot device
If any issues are encountered during or post update, any default boot device change is reverted and the new block device can be discarded

The previous OS version remains untouched on the old device for fallback

Once the new OS version is deemed stable, the old seed device should be removed from the filesystem, which will cause dependent data from the old device to be merged into the new

This seed device approach removes some of the constraints of the Btrfs subvolume approach, namely:

The update procedure can atomically apply changes across multiple subvolumes, with seed-device rollback safely reverting all subvolume changes made
After read-write remount, the update process can perform I/O to the running system in-place, without any specific knowledge of the seed device usage or underlying filesystem

This functionality may be attractive for Linux distributions, particularly if adding A/B update support to an existing update process with little filesystem integration. However, there remain a number of trade-offs to consider:

Seed devices are significantly less space efficient compared to snapshot based A/B updates

Each block device must have sufficient capacity to store the OS

I/O performed when the old seed device is removed from the updated filesystem is a significant overhead and is avoided with snapshot based A/B updates

Btrfs at least provides some compensation for this by verifying data checksums

Btrfs seed device support appears somewhat niche compared to regular subvolume snapshots, so it likely receives less filesystem test focus

A/B Updates Using Copy On Write Virtual Block devices

2024-10-10 update: Interestingly, since I wrote this article a couple of years ago, Android has moved from a fully provisioned, partition based A/B approach to now using device-mapper, which provides layered, block level copy-on-write A/B updates.

Conclusions

Btrfs subvolume snapshots and seed devices can both be used to provide seamless and reliable A/B system updates. Snapshot based updates offer more efficient storage and CPU resource utilization, so should likely be considered the optimal choice for implementers.

Seed device based updates are a viable alternative, particularly for multi-subvolume updates, but implementers should carefully consider the described trade-offs.

Animated gif of a sunflower seed sprouting - Creative Commons Attribution-Share Alike 4.0 International

Thanks

Linux Btrfs developers
My employer, SUSE
Wikipedia users anon and Naturenow, for publishing the image and animation under open licenses

Changelog

2024-10-10: add "A/B Updates Using Copy On Write Virtual Block devices" note

Saturday, April 21, 2018

Samsung Android Full Device Backup with TWRP

Warning

Following these instructions, correctly or incorrectly, may leave you with a completely broken or bricked device. Furthermore, flashing your device may void your warranty - Samsung uses eFuses to permanently flag occurrences of a device running non-Samsung software, such as TWRP.
I take no responsibility for what may come of using these instructions.

With the warning out of the way, I will say that I tested this process with the following environment:

Android Device: Samsung Galaxy S3 (i9300)
TWRP: 3.2.1-0
Desktop OS: openSUSE Leap 42.3

Flashing and Booting into Recovery

Download the official TWRP image for your device, and corresponding PGP signature

https://dl.twrp.me

Use gpg to verify your TWRP image
Download and install Heimdall on your Linux or Windows PC
Boot your Samsung device into Download Mode

Simultaneous hold the Volume-down + Home/Bixby + Power buttons

Using Heimdall on your desktop, flash the TWRP image to your device's recovery partition:

heimdall flash --no-reboot --RECOVERY <recovery.img>
Wait for Heimdall to output "RECOVERY upload successful"

From Download Mode, boot your Samsung device into TWRP

Simultaneous hold the Volume-up + Home/Bixby + Power buttons
If you accidentally boot into regular Android, then you'll likely have to boot into Download Mode and reflash, as regular boot restores the recovery partition to its default contents

Exposing the Device as USB Mass Storage

Unmount all partitions:

From the TWRP main menu, select Mount, then uncheck all partitions

Bring up a shell

From the TWRP main menu, select Advanced -> Terminal
adb shell could be used instead here, but the adb connection from the desktop to the device will be lost when all USB roles are disabled

Determine which block device you wish to backup

# cat /etc/fstab

In my case (i9300), all data is stored on /dev/block/mmcblk0 partitions

Check the current state of the TWRP USB gadget

# cat /sys/devices/virtual/android_usb/android0/functions
mtp,adb

Configure a read-only USB Mass Storage gadget

# echo 1 > /sys/devices/virtual/android_usb/android0/f_mass_storage/lun0/ro
# echo /dev/block/mmcblk0 > /sys/devices/virtual/android_usb/android0/f_mass_storage/lun0/file

Disable all USB roles

# echo 0 > /sys/devices/virtual/android_usb/android0/enable

Enable the Mass Storage gadget USB role

# echo mass_storage,adb > /sys/devices/virtual/android_usb/android0/functions
# echo 1 > /sys/devices/virtual/android_usb/android0/enable

If not already done, connect the device to your desktop or laptop

The attached device should appear as regular USB storage

Backup

Any Linux, Windows or macOS program capable of fully backing up a USB storage device should be usable from this point. The procedure below uses the dd command on Linux.

From your computer, determine which USB storage device to back up

ddiss@desktop:~> lsscsi
...
[2:0:0:0]    disk    SAMSUNG  File-Stor Gadget 0001  /dev/sdb

As root, start copying the data from the device

ddiss@desktop:~> sudo dd if=/dev/sdb of=/home/ddiss/samsung_backup.img bs=1M

dd will take a long time to complete, depending on the size of your device, USB connection speed, etc.
Once completed, unplug your Android device and reboot it
The image file can be compressed

With the image now obtained, you could mount it on your desktop, or restore it to the device at a later date. I'll hopefully get around to writing separate posts for both in future.

Monday, January 29, 2018

Building Ceph master with C++17 support on openSUSE Leap 42.3

Ceph now requires C++17 support, which is available with modern compilers such as gcc-7. openSUSE Leap 42.3, my current OS of choice, includes gcc-7. However, it's not used by default.

Using gcc-7 for the Ceph build is a simple matter of:

> sudo zypper in gcc7-c++
> CC=gcc-7 CXX=/usr/bin/g++-7 ./do_cmake.sh ...
> cd build && make -j$(nproc)

Monday, July 3, 2017

Multipath Failover Simulation with QEMU

While working on a Ceph OSD multipath issue, I came across a helpful post from Dan Horák on how to simulate a multipath device under QEMU.

qemu-kvm ... -device virtio-scsi-pci,id=scsi \
  -drive if=none,id=hda,file=<path>,cache=none,format=raw,serial=MPIO \
  -device scsi-hd,drive=hda \
  -drive if=none,id=hdb,file=<path>,cache=none,format=raw,serial=MPIO \
  -device scsi-hd,drive=hdb"

<path> should be replaced with a file or device path (the same for each)
serial= specifies the SCSI logical unit serial number

This attaches two virtual SCSI devices to the VM, both of which are backed by the same file and share the same SCSI logical unit identifier.
Once booted, the SCSI devices for each corresponding path appear as sda and sdb, which are then detected as multipath enabled and subsequently mapped as dm-0:

         Starting Device-Mapper Multipath Device Controller...
[  OK  ] Started Device-Mapper Multipath Device Controller.
...
[    1.329668] device-mapper: multipath service-time: version 0.3.0 loaded
...
rapido1:/# multipath -ll
0QEMU_QEMU_HARDDISK_MPIO dm-0 QEMU,QEMU HARDDISK
size=2.0G features='1 retain_attached_hw_handler' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 0:0:0:0 sda 8:0  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 0:0:1:0 sdb 8:16 active ready running

QEMU additionally allows for virtual device hot(un)plug at runtime, which can be done from the QEMU monitor CLI (accessed via ctrl-a c) using the drive_del command. This can be used to trigger a multipath failover event:

rapido1:/# mkfs.xfs /dev/dm-0
meta-data=/dev/dm-0              isize=256    agcount=4, agsize=131072 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0
data     =                       bsize=4096   blocks=524288, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
rapido1:/# mount /dev/dm-0 /mnt/
[   96.846919] XFS (dm-0): Mounting V4 Filesystem
[   96.851383] XFS (dm-0): Ending clean mount

rapido1:/# QEMU 2.6.2 monitor - type 'help' for more information
(qemu) drive_del hda
(qemu) 

rapido1:/# echo io-to-trigger-path-failure > /mnt/failover-trigger
[  190.926579] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[  190.926588] sd 0:0:0:0: [sda] tag#0 Sense Key : 0x2 [current] 
[  190.926589] sd 0:0:0:0: [sda] tag#0 ASC=0x3a ASCQ=0x0 
[  190.926590] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 00 00 00 02 00 00 01 00
[  190.926591] blk_update_request: I/O error, dev sda, sector 2
[  190.926597] device-mapper: multipath: Failing path 8:0.

rapido1:/# multipath -ll
0QEMU_QEMU_HARDDISK_MPIO dm-0 QEMU,QEMU HARDDISK
size=2.0G features='1 retain_attached_hw_handler' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| `- 0:0:0:0 sda 8:0  failed faulty running
`-+- policy='service-time 0' prio=1 status=active
  `- 0:0:1:0 sdb 8:16 active ready  running

The above procedure demonstrates cable-pull simulation while the broken path is used by the mounted dm-0 device. The subsequent I/O failure triggers multipath failover to the remaining good path.

I've added this functionality to Rapido (pull-request) so that multipath failover can be performed in a couple of minutes directly from kernel source. I encourage you to give it a try for yourself!

Friday, June 9, 2017

Rapido: Quick Kernel Testing From Source (Video)

I presented a short talk at the 2017 openSUSE Conference on Linux kernel testing using Rapido.

There were many other interesting talks during the conference, all of which can be viewed on the oSC 2017 media site.
A video of my presentation is embedded below.

Many thanks to the organisers and sponsors for putting on a great event.