Notes on DRM Format Modifiers

Over the past few weeks, I have been working for Collabora on plumbing DRM format modifier support across a number of components in the graphics stack. This post documents the work and the related consequences/implications.

the need for modifiers

Until recently, the FourCC format code of a buffer provided a nearly comprehensive description of its contents. But as hardware design has advanced, placing the buffers in a non-traditional (nonlinear) fashion has become more important in order to gain performance. A given buffer layout can prove to be more suited for particular use-cases than others, such as tiled formats to improve locality of access and enable fast local buffering.

Alternatively, a buffer may hold compressed data that requires external metadata to decompress before the buffer is used. Just using the FourCC format code then falls short of being able to convey the complete information about how a buffer is placed in memory, especially when the buffers are to be shared across processes or even IP blocks. Moreover, newer hardware generation may add further usage capabilities for a layout.

Intel Buffer Tiling Layouts
Memory organization per-tile for two of the Intel-hardware supported buffer tiling layouts. An XMAJOR 4KB tile is stored as an 8×32 WxH array of 16Byte data (row order) while a YMAJOR 4KB tile is laid as 32×8 WxH 16B data (column order).CC-BY-ND from Intel Graphics PRM Vol 5: Memory Views

As an example, modern Intel hardware can use multiple layout types for performance depending on the memory access patterns. The Linear layout places data in row-adjacent format, making the buffer suited for scanline-order traversal. In contrast, the Y-Tiled layout splits and packs pixels in memory such that the geometrically close-by pixels fit into the same cache line, reducing misses for x- and y- adjacent pixel data when sampled – but couldn’t be used for scan-out buffers before the Skylake generation due to hardware constraints. Skylake also allows the single-sampled buffer data to be compressed in-hardware before being passed around to cut down bus bandwidth, with an extra auxiliary buffer to contain the compression metadata. The Intel Memory Views Reference Manual describes these and more layout orders in depth.

Besides access-pattern related optimizations, a similar need for layout communication arises when propagating buffers through the multimedia stack. The msm vidc video decoder (present on some Samsung and Qualcomm SoCs) arranges decoded frames in a tiled variant of the NV12 FourCC format – the NV12_64Z32. With a modifier associated with the buffer, the GPU driver can use the information to program the texture samplers accordingly – as is the case with the a3xx, using the hardware path to avoid explicitly detiling the frame.

Buffer Layout for NV12 Format with Tiled Modifier
The data is laid out as 64×32 WxH tiles similar to NV12, but the tiles appear in a zigzag order instead of linear.

To ease this situation, an extra 64b ‘modifier’ field was added into DRM’s kernel modesetting to contain this vendor specific buffer layout information. These modifier codes are defined in the drm_fourcc.h header file.

current state of affairs across components

With the AddFB2 ioctl, DRM lets userspace attach a modifier to buffers it imports. For userspace, libdrm support is also planned to allow probing the kernel support and scanout-ability of a given buffer layout, now representable as a fourcc+modifier combination. A modifier aware GetPlane ioctl analog, ‘GetPlane2’ is up for consideration. With recent patches from Intel, GBM becomes capable of allocating modifier-abiding surfaces too, for Wayland compositors and the X11 server to render to.

Collabora and others recently published an EGL extension, EGL_EXT_image_dma_buf_import_modifiers, which makes it possible to create EGLImages out of dmabuf buffers with a format modifier via eglCreateImageKHR, which can then be bound as external textures for the hardware to render into and sample from. It also introduces format and modifier query entrypoints that allow for easier buffer constraint negotiation with knowing the GL capabilities beforehand to avoid hit-and-trial guesswork and bailout scenarios that can arise for compositors.

Mesa provides an implementation for the extension, along with driver support on Skylake to import and sample compressed color buffers. The patchset discussions can be found at mesa-dev: ver1 ver2.

The Wayland protocol has been extended to allow the compositor to advertise platform supported format modifiers to its client applications, with Weston supporting this.

With the full end-to-end client-to-display pipeline now supporting tiled and compressed modes, users can transparently benefit from the reduced memory bandwidth requirements.

further reads

Some GPUs find the 64b modifier to be too restrictive and require more storage to convey layout and related metadata. AMDGPU associates 256 bytes of information with each texture buffer to describe the buffer layout.

To standardize buffer allocation and cross-platform sharing, the Unix Device Memory Allocation project is being discussed.

Thanks to Google for sponsoring a large part of this work as part of ChromeOS development.


Freedreno on Android: Current Status

Freedreno now supports dma-buf passing instead of GEM flinks on kitkat-x86, but with a catch. Enabling freedreno’s userspace bo cache in libdrm (bo_reuse=1) results in multiple GEM_SUBMIT calls on the same bo, and the GPU hangchecks after some activity. HACK: setting bo_reuse=0 runs Android smoothly.

While fighting some userspace GEM leaks that showed up with bo cache disabled, I started looking into adding a minigbm (used by chrome-os) backend for msm. The idea was to let GBM deal with hardware specifics and leave gralloc platform independent. Another priority was doing away with GEM names to bring in fd passing. With bo_reuse=1, fd_bo_del() calls would segfault during fd_cache_cleanup(), causing SurfaceFlinger to restart.

After watching `/sys/kernel/debug/dri/0/gem`,counting allocs and frees and scratching my head for a while, I discovered that the GEM leaks lay in AOSP’s default Home launcher, and moving to CyanogenMod’s Trebuchet launcher made the leaks go away.

Making gralloc_drm switch to dma-buf fd’s, I also got to learn how Mesa’s EGL implementation works, and wrote a couple patches ([1] and [2]) to implement __DRI_IMAGE_LOADER DRI extension that helps EGL allocate textures without dealing in flink names. These let a __DRIimage be tied to ANativeWindowBuffer and used for textures.

One possible reason with fd_cache_cleanup() segfault, assumingly, was using an old (3.19) kernel that was missing some related fix, so we decided to switch to 4.2 instead. I had to make some changes to resurrect the android logger driver since it was deprecated after Kitkat + 3.19. I learned of kernel’s Asynchronous IO support, and forward ported logger to be compatible with the kiocb API changes.This didn’t help the segfault, but fixed a screen flicker issue on the sides.

I traced the bo life-cycles through drm_{prime,gem}.c and libdrm, and finally found the problem – importing a bo present in the cache would break the cache bucket, causing any following bo_del()s to crash. The fix turned out to be simple to write, but difficult to point out.

It’s been an incredible learning experience where I got to explore a great number of modules up close, that come together to interact and form the display subsystem. I’m excited to stick around, improve support and dig up more items to work on. Thanks Rob, Emil, #freedreno, #android-x86 and #linaro for all the support!

Up next:
I’m looking into the `*ERROR* handle X at index Y already on submit list` + hangchecks that pop up with bo_reuse=1. Also need figure out how to get some games running (dexopt won’t let me install .apks : ( ).

Initial Support for Freedreno on Android

I now have Android-x86 running Freedreno on my IFC6410 \o/!

Freedreno on Android

The setup:

on an IFC6410 (A320 GPU) with an ancient CRT

The gralloc module carries two responsibilities: providing an interface to /dev/graphics/fb0 for creating FrameBufferNativeWindow-s and, handling buffer requests that come in through a singleton system wide GraphicBufferAllocator object. Each request is stamped with a usage flag to help gralloc decide how to provide it.

Current setup uses the default eglSwapBuffers() based HWC, i.e. all-GPU composition for now:

$ dumpsys SurfaceFlinger
numHwLayers=3, flags=00000000
type    |  handle  |   hints  |   flags  | tr | blend |  format  |          source crop            |           frame           name
GLES | 2a2a31e0 | 00000000 | 00000000 | 00 | 00100 | 00000005 | [      0,     13,   1024,    744] | [    0,   13, 1024,  744]
GLES | 2a060ad8 | 00000000 | 00000000 | 00 | 00105 | 00000005 | [      0,      0,   1024,     13] | [    0,    0, 1024,   13] StatusBar
GLES | 2a0614d8 | 00000000 | 00000000 | 00 | 00105 | 00000005 | [      0,      0,   1024,     24] | [    0,  744, 1024,  768] NavigationBar

Shortlog of issues faced:

I started with getting bare kitkat-x86 up on the IFC6410 with an ARM build setup. Fast forward multiple tries figuring out the build system and setting up the board-specific init scripts, I added Mesa to the build. The default HWC doesn’t implement blank(), which took some time to realize and fix a segfault. With all parts in place, I had the boot logo, with two boot flows due to a race condition. Disabling GPU authentication was another hack that needed to be put in.

Finally, I had Skia causing trouble while starting the UI. This turned out to be a simple fix to drm_gralloc, but had me looking everywhere through the sources and mistaking it to be a NEON specific issue with Skia.


Currently, Android only renders a few windows before SurfaceFlinger crashes while gralloc::unregisterBuffer() and restarts. Also, dexopt dies causing app installation failure, so not many apps to play around with.


  • Add a KMS based Hardware Composer.
  • GBM for drm_gralloc

Here is my trello board:

Thanks Rob, Emil, #freedreno, #android-x86 and #linaro!

Android-x86 on IFC6410

Documenting how to get Android-x86 up on an IFC6410.


0. Install repo.
1. Get the source. Shallow syncing to save space (full source >30G):

mkdir android-x86
cd android-x86
repo init --depth=1 -u -b kitkat-x86 -g default,arm,pdk,-darwin
repo sync -j16

2. add local_manifests ( with drm_gralloc and mesa at the right location.


Compiling the kernel:

EDIT: apply kernel-* at over 4.2 (integration-linux-qcomlt) to get logger and disable gpu auth.

git clone
git checkout -t origin/integration-linux-qcomlt
cd kernel
make ARCH=arm CROSS_COMPILE=~/devel/android-x86/prebuilts/gcc/linux-x86/arm/arm-eabi-4.6/bin/arm-eabi- qcom_defconfig
make -j16

The zImage needs to have a binary blob prepended to it to boot with the LK bootloader on IFC6410.

cat fixup.bin arch/arm/boot/zImage arch/arm/boot/dts/qcom-apq8064-ifc6410.dtb > zImage-dtb

Building Android-x86 requires JDK (see build requirements). I used JDK 7 on openSUSE, and needed a bunch of patches to get a successful build:

Single Patch:

This may be helpful when applying the first patch, since repo doesn’t support diff style patching.

EDIT: use repo diff at instead.!topic/repo-discuss/43juvD1qGIQ should work to apply.

The Qualcomm specific init scripts (not packaged with Android-x86) are available here.

git clone ~/devel/qcom-android

cd android-x86/system/core

git apply ~/devel/qcom-android/android-x86-avoid-parentheses-from-fdt.patch

EDIT: ^ repo diff above contains these changes.

Building the source:

cd android-x86
source build/
lunch mini_armv7a_neon-userdebug
make -j8 TARGET_PREBUILT_KERNEL=<path-to-zImage-dtb> libGLES_mesa gralloc.drm

This prepares the necessary images at `android-x86/out/target/product/armv7-a-neon`.

Preparing boot.img:

Extract the qualcomm-specific init scripts (for fstab partitions) into `android-x86/out/target/product/armv7-a-neon/root` for the ramdisk. These are taken from Linaro’s android-ifc6410.

cd android-x86/out/target/product/armv7-a-neon/root
mv ~/devel/qcom-android/init.* .
mv ~/devel/qcom/android/fstab.* .

# prepare initrd
find . | cpio -o -H newc | gzip > ../qcom-ramdisk.img

# make boot.img (= zImage + ramdisk)
# from armv7-a-neon:
abootimg --create qcom-boot.img -k  -r qcom-ramdisk.img -f ~/devel/qcom-android/bootimg.cfg

Preparing system_ext4.img

EDIT: run from out/target/product/armv7-a-neon/ .

The system.img created isn’t ext4 formatted (shows as dBase IV DBT!), and init refuses to mount it as /system : (. Repacking:

dd if=/dev/zero of=system_ext4.img bs=4K count=100000  # large enough
mkfs.ext4 system_ext4.img
tune2fs -c0 -i0 system_ext4.img
mkdir system-ext4
mount -o loop system_ext4.img system-ext4/
cp -v -r -p system/* system-ext4/
umount system-ext4/



Flashing images:

Boot IFC6410 into fastboot mode by holding SW4 during power up.

fastboot flash boot qcom-boot.img
fastboot flash system system_ext4.img
fastboot continue

Reboot with a UART cable attached to get serial console.

Freedreno on Android – Overview

I’m working on adding Freedreno support to Android this summer with the Foundation! This post documents the technical specifics of what I’ll be doing.

Android abstracts its hardware interfaces behind device-specific Hardware Abstraction Layers which can be customized by the vendors. The compositor, called SurfaceFlinger uses the Hardware Composer HAL to
1) decide if a layer must be processed through OpenGL/GPU (HWC_FRAMEBUFFER) or the SoC display controller (HWC_OVERLAY) during prepare().
2) handle VSYNCs through vsync().
3) select displays and provide modesetting.
The OpenGL composition pathway requires libEGL and libGLES to be present. In the absence of a HWC, all compositions use this pathway.

Buffer allocations happen through gralloc HAL, which uses an in-kernel memory manager to provide alloc() and free() calls. It allocates suitable buffers depending on the requested usage type. HWC and gralloc API documentation is available at hwcomposer.h and gralloc.h.
A KMS based HWC would reduce GPU dependency and improve performance by handling composition through the display controller.

This project targets providing a functional Android graphics path using DRM/KMS based gralloc and HWC with atomic pageflips using Freedreno/Mesa EGL/GLES running on upstream Android or Android-x86 on an IFC6410. A stretch goal is to have Freedreno support with an Android distro (CyanogenMod).

Implementation Plans:
I am starting with testing and fixing the existing bits and proceeding to interface HWC with KMS APIs to use the SoC display controller.

The libEGL and libGLES requirements would be provided using Mesa, similar to Android-x86. drm_gralloc support for Freedreno has been developed, but remains largely untested. The existing reference HWC implementation just uses eglSwapBuffers (i.e. GPU composition).

Initial task would be assembling these untested parts to run on the IFC6410. After fixing the issues discovered, we would have Freedreno running on Android, but *without* any modesetting support in place.

The project would then proceed to implementing the HWC with KMS APIs. A reference implementation exists using userspace fences – sw_sync, but atomic modesetting support within MSM kernel can provide a way to do away with these.

Android Graphics Stack Requirements
Freedreno drm_gralloc in Android-x86

Porting UEFI to BeagleBoneBlack: Technical Details I

I’m adding a BeagleBoneBlack port to the Tianocore/EDK2 UEFI implementation. This post details the implementation specifics of the port so far.

About the hardware:
BeagleBoneBlack is a low-cost embedded board that boasts an ARM AM335x SoC. It supports Linux, with Android, Ubuntu and Ångström ports already available. It comes pre-loaded with the MLO and U-Boot images on its eMMC which can be flashed with custom binaries. Bootup can also be done from a partitioned sdcard or by transferring binaries over UART (ymodem) or USB (TFTP). The boot flow is presented here:

The Tianocore Project / Build System

The EDK2 framework provides an implementation of the UEFI specifications. It’s got its own customizable pythonic build system that works based on the config details provided through build meta-files. The build setup is described in this document.

(TL;DR: the build tool parses INF, DSC and DEC files for each package that describe its dependencies, exports and the back-end library implementations it shall use. This makes EDK2 highly modular to support all kinds of hardware platforms. It generates Firmware Volume images for each section in the Flash Description File, which are put into a Flash Description binary with addressing as specified in the FDF. The DSC specifies which library in code should point to which implementation, and the INF keeps a record of a module’s exports and imports. If these don’t match, the build simply fails.)


I started out with an attempt to write a bare-metal binary that prints over some letters to UART to get a hang of how low-level development works. Here‘s a great guide to the basics for bare-metal on ARM. All the required hardware has to be initialized in the binary before use, and running C requires an execution environment set up that provides stacks and handles placement of segments in memory. Since U-Boot already handles that in its SPL phase, I wrote a standalone that could be called by U-Boot instead.

The BeagleBoneBlackPkg is derived from the ArmPlatformPkg. I began with echoing the “second stage” steps mentioned here – implement the libraries available and perform platform specific tasks – as I intended to take over boot from U-Boot/MLO.  This also eased me from having to do the IRQ and memory initializations.

I’m using the PrePeiCore (and not Sec) module’s entry point to load the image. It builds the PPIs necessary to pass control over to PEI and calls PeiMain.

Running the FD:  The build generates an .Fd image that will be used to boot the device. The MLO binary I’m using is built to look for and launch a file named ‘u-boot.img’ on the MMC (there’s a CONFIG_ macro to change this somewhere in u-boot), so I just rename the FD to u-boot.img before flashing it.

UEFI over BeagleBone Black: Notes

The Tianocore/EDK2 project provides an opensource implementation of UEFI specification. It has its own Python-scripted build system that supports configuring the build parameters on the go using build metadata files []. These files decide which library instances are required for a package; which instance implementation is to be used, what interfaces it exports, compiler specifications for the package and the generated image’s flash layout.

I am working to bring up UEFI support for a BeagleBone Black. Currently, I am using u-boot’s SPL to call the UEFI image (by placing generated .Fd on an MMC as “u-boot.img”), which, in turn, would provide the UEFI console and kernel loading functionality.

Since SPL does memory, stack and irq initialization, the SEC/PEI phases have little work. As per the BBB SoC (an AM335x), all multicore and AArch64 code can be safely removed from the package. UART being similar to the 16550 module can be written to by implementing SerialPortLib accordingly. Console services can be made available only after the EFI_BOOT_SERVICES table has been populated, which requires DXE phase completion.

A U-Boot Independent Standalone Application

U-Boot allows you to load your own applications at the console. The application already has the hardware interfaces available for use (u-boot does it), and everything does not need to be brought up from scratch.

It comes with a sample hello_world program at u-boot/examples/standalone/hello_world.c, which is supposed to print stuff to console. It depends on U-Boot interfaces, but by tracing back the source code, it can be easily re-written to have nothing to do with the U-Boot API.

In the end, hello_world.c:printf()’s job is to write the characters to UART’s address. Implementing this on a BeagleBone Black is pretty easy:

The ARM AM335x TRM mentions the address-offsets of all registers available with the processor. The UART0_BASE is defined at 0x44E09000. Memory mapped registers need to be kept volatile to prevent compiler optimizing them away.

Here’s the code:

/* a simple independent U-Boot loadable application */
/* use U-Boot's example makefile to compile */
#define UART0_BASE 0x44E09000
#define REG(x) (*((volatile unsigned int *) x))
void main(){
int c = 'x';
REG(UART0_BASE + 0) = c;

Compile it similarly to U-Boot’s examples to get a .bin.

To execute it, you can either: 1) copy the SREC over serial, 2) set up TFTP or 3) put it on an sdcard

The load_address below is an env var that specifies where the application will be loaded. It can be changed when building U-Boot.
For the current build, find it from the console using

U-Boot# printenv

U-Boot# fatload mmc 0 <load_address>

The entry point’s address of the application can be found by looking at the objdump. See this if you have larger applications.
Launching it,

U-Boot# go <entry_point_address>

If you plan to write a purely standalone binary, you are required to initialize the hardware manually and provide a functioning C execution environment. It also requires information regarding placement and relocation of the text and data segments, allocation and zeroing of BSS segment and placement of the stack space. See this and this.

Transferring a .bin from openSUSE to U-Boot, or, Rightly Configuring TFTP on openSUSE

After spending a *lot* of time figuring out why I could not transfer a standalone binary to u-boot running on my BeagleBone Black, I finally discovered it was a firewall issue. This post is to save anyone in the future from suffering the same nightmare as I just went through on openSUSE 13.1.

The Problem:

I needed to put a .bin on my BBB which has U-Boot. The available options are:

Transferring the S-Record

SREC is the hex encoding of binary data generated on compilation. To load this, U-Boot provides the `loads` command at its console. You just need to pass the ASCII-Hex data from the .srec to the serial console (see this). The problem is, the speed of sending this data must be okay with the U-Boot console. Gave me a `data abort` message and my board reset.

Using TFTP

Better option: tftp. Have static IP setup for the host and the board (set env vars ipaddr and serverip on u-boot) and call tftp. It gave me this:

U-Boot# tftp
link up on port 0, speed 100, full duplex
Using cpsw device
TFTP from server; our IP address is
Filename 'hello_world.bin'.
Load address: 0x80200000
Loading: T T T T T T T T T T T T T T T 

(**T = Timeout.**)


TFTP uses UDP port 69 for transfers. I needed to explicitly check “Open port in firewall” from the TFTP server config from YaST and add port 69 to Firewall->Allowed Services->UDP Ports.


X-Loader / MLO With a Custom Payload

X-Loader (MLO) (u-boot/common/spl/spl.c, responsible for eMMC initialization and executing the u-boot binary) first parses the image header info (u-boot/common/spl/spl.c:spl_parse_image_header) which effectively does this:

if (magic number of header == IH_MAGIC) {
  set spl_image to the detected image
else {
  fill in spl_image assuming u-boot.bin
call spl_image->entry_point()

IH_MAGIC is set to 0x27051956, the default magic number when creating an image with mkimage. This image can be called from within u-boot at the u-boot command line. By default, the SPL assumes a `uImage` payload and if not found, tries to launch u-boot.