BlackBerry 10 – Native algorithm optimization

When looking for performance it’s important to understand what is going on at the compiler level and how the compiler is optimising our code. For example, enabling the assembly output will allow us to see what is exactly generating and have a greater understanding of why some parts of the code execute faster or the reason why it’s not performing as it should.
BlackBerry 10 uses the QCC compiler (more information here: QCC). Enabling the right flags will generate a .s file of the generated assembly:
qcc_flags2
qcc_flags3
Once we got the flags enabled we can start optimising the code. Let’s take as example a YUV2RGB converter and optimise it for BlackBerry 10 devices (used in this app http://appworld.blackberry.com/webstore/content/21198027/) . This converter can be used to read from camera preview buffers and convert it to RGB to save it as an image or show it on the screen. YUV is a format the encodes luma & chroma independently:
yuv_format
To make this example simpler, assume the following values (hardcoded from a camera preview buffer size)

int src_height = 768;
int src_width = 1024;
int src_stride = 1024;
int uvStart = 1024 * 768; // offset to the UV data (after all the Y data)

This is our first approach:

void inline renderFrame(unsigned char *src, char *dst, int rect[4], int frame, int dst_stride) {
    int i, j;
    for (i = 0; i < src_height; i++) {
        for(j = 0; j < src_width; j++) {
            int y_rpos = i * src_stride + j;
            int uv_rpos = uvStart + ((i/2) * src_stride) + (j/2) * 2;
            int wpos = i * dst_stride + j * 4;
            float y = src[y_rpos     ];
            float u = src[uv_rpos    ];
            float v = src[uv_rpos + 1];
            int r = clip((int) ((y - 16) * 1.164                     + 1.596 * (v - 128)));
            int g = clip((int) ((y - 16) * 1.164 - 0.391 * (u - 128) - 0.813 * (v - 128)));
            int b = clip((int) ((y - 16) * 1.164 + 2.018 * (u - 128)));
            dst[wpos    ] = b;
            dst[wpos + 1] = g;
            dst[wpos + 2] = r;
            dst[wpos + 3] = 0xff;
        }
    }
}

As with this YUV format a pair of Y values share the same UV values we can generate 2 pixels with just 2 Y and 1 single UV pair, generating 2 pixels per iteration.

float y0 = src[y_rpos    ];
float y1 = src[y_rpos + 1];
float u = src[uv_rpos    ];
float v = src[uv_rpos + 1];
int r0 = clip((int) ((y0 - 16) * 1.164                     + 1.596 * (v - 128)));
int g0 = clip((int) ((y0 - 16) * 1.164 - 0.391 * (u - 128) - 0.813 * (v - 128)));
int b0 = clip((int) ((y0 - 16) * 1.164 + 2.018 * (u - 128)));
int r1 = clip((int) ((y1 - 16) * 1.164                     + 1.596 * (v - 128)));
int g1 = clip((int) ((y1 - 16) * 1.164 - 0.391 * (u - 128) - 0.813 * (v - 128)));
int b1 = clip((int) ((y1 - 16) * 1.164 + 2.018 * (u - 128)));
dst[wpos    ] = b0;
dst[wpos + 1] = g0;
dst[wpos + 2] = r0;
dst[wpos + 3] = 0xff;
dst[wpos + 4] = b1;
dst[wpos + 5] = g1;
dst[wpos + 6] = r1;
dst[wpos + 7] = 0xff;

So far so good, but there are many arithmetic operations that can be shared and there is no need to do them twice

float y0 = src[y_rpos    ] - 16;
float y1 = src[y_rpos + 1] - 16;
float u = src[uv_rpos    ] - 128;
float v = src[uv_rpos + 1] - 128;
float y0v = y0 * 1.164;
float y1v = y1 * 1.164;
float chromaR = 1.596 * v;
float chromaG = -0.391 * u - 0.813 * v;
float chromaB = 2.018 * u;
int r0 = clip((int) (y0v + chromaR));
int g0 = clip((int) (y0v + chromaG));
int b0 = clip((int) (y0v + chromaB));
int r1 = clip((int) (y1v + chromaR));
int g1 = clip((int) (y1v + chromaG));
int b1 = clip((int) (y1v + chromaB));
...

and we’ve improved 14 multiplications and 20 additions/subtractions to 6 multiplications and 12 additions/substractions. Pretty sure our CPU would like that although we’re far from done. There are many values that doesn’t change in every pixel so we can move those to the external loop.

void inline renderFrame(unsigned char *src, char *dst, int rect[4], int frame, int dst_stride) {
    int i, j;
    for (i = 0; i < src_height; i++) {
        int y_rpos = i * src_stride;
        int uv_rpos = uvStart + ((i/2) * src_stride);
        int wpos = i * dst_stride;
        for(j = 0; j < src_width; j++) {
            float y0 = src[y_rpos    ] - 16;
            float y1 = src[y_rpos + 1] - 16;
            float u = src[uv_rpos    ] - 128;
            float v = src[uv_rpos + 1] - 128;
            float y0v = y0 * 1.164;
            float y1v = y1 * 1.164;
            float chromaR = 1.596 * v;
            float chromaG = -0.391 * u - 0.813 * v;
            float chromaB = 2.018 * u;
            int r0 = clip((int) (y0v + chromaR));
            int g0 = clip((int) (y0v + chromaG));
            int b0 = clip((int) (y0v + chromaB));
            int r1 = clip((int) (y1v + chromaR));
            int g1 = clip((int) (y1v + chromaG));
            int b1 = clip((int) (y1v + chromaB));
            dst[wpos    ] = b0;
            dst[wpos + 1] = g0;
            dst[wpos + 2] = r0;
            dst[wpos + 3] = 0xff;
            dst[wpos + 4] = b1;
            dst[wpos + 5] = g1;
            dst[wpos + 6] = r1;
            dst[wpos + 7] = 0xff;
            wpos += 8;
            y_rpos += 2;
            uv_rpos += 2;
        }
    }
}

As some CPU doesn’t contain hardware accelerated floating point processors, floating point operations are emulated in software. In others, plain integer operations are usually faster. We can simulate floating point arithmetics with plain integers using Fixed point arithmetics

int y0v = y0 * 298;   // 1.164 * 256
int y1v = y1 * 298;   // 1.164 * 256
int chromaR = 408 * v;                // 1.596 * 256
int chromaG = -100 * u - 208 * v;     // - 0.391 * 256, 0.813 * 256
int chromaB = 517 * u;                // 2.018 * 256
int r0 = clip((y0v + chromaR) >> 8);  // >> 8 is equivalent to dividing by 256
int g0 = clip((y0v + chromaG) >> 8);
int b0 = clip((y0v + chromaB) >> 8);
int r1 = clip((y1v + chromaR) >> 8);
int g1 = clip((y1v + chromaG) >> 8);
int b1 = clip((y1v + chromaB) >> 8);

As YUV values are always 0 <= YUV <= 255 some operations can be already recalculated in small arrays

void precalc() {
    int i;
    for(i = 0; i > 8;
        factorRV[i] = ( 408 * (i - 128)) >> 8;
        factorGU[i] = (-100 * (i - 128)) >> 8;
        factorGV[i] = (-208 * (i - 128)) >> 8;
        factorBU[i] = ( 517 * (i - 128)) >> 8;
        clipVals[i] = min(max(i - 300, 0), 255);
    }
}

We also precalculated the clipping values, so we remove the comparisons in the inner loop.
No we can use these precalc tables in our code:

int y0 = factorY[src[y_rpos    ]];
int y1 = factorY[src[y_rpos + 1]];
int chromaR = factorRV[v];
int chromaG = factorGU[u] + factorGV[v];
int chromaB = factorBU[v];
int r0 = clipVals[y0 + chromaR + 300];
int g0 = clipVals[y0 + chromaG + 300];
int b0 = clipVals[y0 + chromaB + 300];
int r1 = clipVals[y1 + chromaR + 300];
int g1 = clipVals[y1 + chromaG + 300];
int b1 = clipVals[y1 + chromaB + 300];

We introduced a 300 offset into the array indexes to avoid problems when values are negative. We can avoid that index offset by doing the following:

int *clipVals_ = &clipVals[300];
...
int r0 = clipVals_[y0 + chromaR];
int g0 = clipVals_[y0 + chromaG];
int b0 = clipVals_[y0 + chromaB];
int r1 = clipVals_[y1 + chromaR];
int g1 = clipVals_[y1 + chromaG];
int b1 = clipVals_[y1 + chromaB];

We could write pixels in one single operation instead of using 4 write operations (one per single byte). We’ll create different precalc tables to store values for the R, G and B components with shift masks and alpha transparency already in place.

for(i = 0; i < 1024; i++) {
    clipVals[i]  = min(max(i - 300, 0), 255);
    clipValsR[i] = 0xFF000000 | (min(max(i - 300, 0), 255) << 16);
    clipValsG[i] = min(max(i - 300, 0), 255) << 8;
    clipValsV[i] = min(max(i - 300, 0), 255);
}
void inline renderFrame(unsigned char *src, char *dst, int rect[4], int frame, int dst_stride) {
    int i, j;
    int *clipValsR_ = &clipValsR[300];
    int *clipValsG_ = &clipValsG[300];
    int *clipValsB_ = &clipValsB[300];
    int *dsti = (int *) dst;
    for (i = 0; i < src_height; i++) {
        int y_rpos = i * src_stride;
        int uv_rpos = uvStart + ((i/2) * src_stride);
        int wpos = i * dst_stride;
        for(j = 0; j < src_width; j++) {
            int u = src[uv_rpos    ];
            int v = src[uv_rpos + 1];
            int y0 = factorY[src[y_rpos    ]];
            int y1 = factorY[src[y_rpos + 1]];
            int chromaR = factorRV[u];
            int chromaG = factorGU[u] + factorGV[v];
            int chromaB = factorBU[u];
            dst[wpos    ] = clipValsR_[y0 + chromaR] | clipValsG_[y0 + chromaG] | clipValsB_[y0 + chromaB];
            dst[wpos + 1] = clipValsR_[y1 + chromaR] | clipValsG_[y1 + chromaG] | clipValsB_[y1 + chromaB];
            wpos += 2;
            y_rpos += 2;
            uv_rpos += 2;
        }
    }
}

Execution time
Small graph showing the execution time of the same function with different optimisations. At the end we achieved around 600% performance increase.
exec_time
There are still quite a lot of things than can be done, stay tuned for the second part of this post!

Outcome BlackBerry 10 presentation

Last thursday I did an introduction to BlackBerry10 for Android developers. I focused most on the current limitations and possibilities of the Android Runtime of BlackBerry10 and I’ve also used some slides explaining how to convert a binary .apk file to a BlackBerry10 .bar file and sign using command line tools. There are, of course, easier tools (eclipse plugin, online repackager) but those don’t really need an explanation. At the end there is also a very quick introduction to QML/Qt and native vs cascades development.
bb_android
Something we learned about the event is people is really tired after 4 days of Mobile World Congress and gets a bit lazy if it’s raining pretty heavily. Luckily the event was live-streamed and there is a recording of the whole presentation.
Find below some links of the event:
Pictures of the event at Google+
Slides (.pdf)
Video capture of the whole presentation
Please stay tuned for the next BlackBerry Group meeting in early April.
http://www.bbdevbcn.org

Presentation about BlackBerry 10

Next week during Mobile World Congress I am going to do a presentation about BlackBerry 10 together with Jorge del Casar, BlackBerry Developer Evangelist in Spain. The event is organized by GDG Barcelona and will be held in the Facultat Informàtica Barcelona
[googlemaps https://maps.google.es/maps?ie=UTF8&cid=17558281092145204566&q=Facultat+d’Inform%C3%A0tica+de+Barcelona&gl=ES&hl=en&t=m&ll=41.389415,2.113323&spn=0.005634,0.00912&z=16&iwloc=A&output=embed&w=425&h=350]
If you’re a developer it’s a good opportunity to get to know BlackBerry 10 as we will speak about both html5 and native development.
Here is the link to the event:
https://plus.google.com/u/0/events/ckc7u8j5b138o6rfas9l8hkb0hk
More information:
https://developer.blackberry.com/