BlackBerry 10 – Native algorithm optimization

When looking for performance it’s important to understand what is going on at the compiler level and how the compiler is optimising our code. For example, enabling the assembly output will allow us to see what is exactly generating and have a greater understanding of why some parts of the code execute faster or the reason why it’s not performing as it should.
BlackBerry 10 uses the QCC compiler (more information here: QCC). Enabling the right flags will generate a .s file of the generated assembly:
qcc_flags2
qcc_flags3
Once we got the flags enabled we can start optimising the code. Let’s take as example a YUV2RGB converter and optimise it for BlackBerry 10 devices (used in this app http://appworld.blackberry.com/webstore/content/21198027/) . This converter can be used to read from camera preview buffers and convert it to RGB to save it as an image or show it on the screen. YUV is a format the encodes luma & chroma independently:
yuv_format
To make this example simpler, assume the following values (hardcoded from a camera preview buffer size)

int src_height = 768;
int src_width = 1024;
int src_stride = 1024;
int uvStart = 1024 * 768; // offset to the UV data (after all the Y data)

This is our first approach:

void inline renderFrame(unsigned char *src, char *dst, int rect[4], int frame, int dst_stride) {
    int i, j;
    for (i = 0; i < src_height; i++) {
        for(j = 0; j < src_width; j++) {
            int y_rpos = i * src_stride + j;
            int uv_rpos = uvStart + ((i/2) * src_stride) + (j/2) * 2;
            int wpos = i * dst_stride + j * 4;
            float y = src[y_rpos     ];
            float u = src[uv_rpos    ];
            float v = src[uv_rpos + 1];
            int r = clip((int) ((y - 16) * 1.164                     + 1.596 * (v - 128)));
            int g = clip((int) ((y - 16) * 1.164 - 0.391 * (u - 128) - 0.813 * (v - 128)));
            int b = clip((int) ((y - 16) * 1.164 + 2.018 * (u - 128)));
            dst[wpos    ] = b;
            dst[wpos + 1] = g;
            dst[wpos + 2] = r;
            dst[wpos + 3] = 0xff;
        }
    }
}

As with this YUV format a pair of Y values share the same UV values we can generate 2 pixels with just 2 Y and 1 single UV pair, generating 2 pixels per iteration.

float y0 = src[y_rpos    ];
float y1 = src[y_rpos + 1];
float u = src[uv_rpos    ];
float v = src[uv_rpos + 1];
int r0 = clip((int) ((y0 - 16) * 1.164                     + 1.596 * (v - 128)));
int g0 = clip((int) ((y0 - 16) * 1.164 - 0.391 * (u - 128) - 0.813 * (v - 128)));
int b0 = clip((int) ((y0 - 16) * 1.164 + 2.018 * (u - 128)));
int r1 = clip((int) ((y1 - 16) * 1.164                     + 1.596 * (v - 128)));
int g1 = clip((int) ((y1 - 16) * 1.164 - 0.391 * (u - 128) - 0.813 * (v - 128)));
int b1 = clip((int) ((y1 - 16) * 1.164 + 2.018 * (u - 128)));
dst[wpos    ] = b0;
dst[wpos + 1] = g0;
dst[wpos + 2] = r0;
dst[wpos + 3] = 0xff;
dst[wpos + 4] = b1;
dst[wpos + 5] = g1;
dst[wpos + 6] = r1;
dst[wpos + 7] = 0xff;

So far so good, but there are many arithmetic operations that can be shared and there is no need to do them twice

float y0 = src[y_rpos    ] - 16;
float y1 = src[y_rpos + 1] - 16;
float u = src[uv_rpos    ] - 128;
float v = src[uv_rpos + 1] - 128;
float y0v = y0 * 1.164;
float y1v = y1 * 1.164;
float chromaR = 1.596 * v;
float chromaG = -0.391 * u - 0.813 * v;
float chromaB = 2.018 * u;
int r0 = clip((int) (y0v + chromaR));
int g0 = clip((int) (y0v + chromaG));
int b0 = clip((int) (y0v + chromaB));
int r1 = clip((int) (y1v + chromaR));
int g1 = clip((int) (y1v + chromaG));
int b1 = clip((int) (y1v + chromaB));
...

and we’ve improved 14 multiplications and 20 additions/subtractions to 6 multiplications and 12 additions/substractions. Pretty sure our CPU would like that although we’re far from done. There are many values that doesn’t change in every pixel so we can move those to the external loop.

void inline renderFrame(unsigned char *src, char *dst, int rect[4], int frame, int dst_stride) {
    int i, j;
    for (i = 0; i < src_height; i++) {
        int y_rpos = i * src_stride;
        int uv_rpos = uvStart + ((i/2) * src_stride);
        int wpos = i * dst_stride;
        for(j = 0; j < src_width; j++) {
            float y0 = src[y_rpos    ] - 16;
            float y1 = src[y_rpos + 1] - 16;
            float u = src[uv_rpos    ] - 128;
            float v = src[uv_rpos + 1] - 128;
            float y0v = y0 * 1.164;
            float y1v = y1 * 1.164;
            float chromaR = 1.596 * v;
            float chromaG = -0.391 * u - 0.813 * v;
            float chromaB = 2.018 * u;
            int r0 = clip((int) (y0v + chromaR));
            int g0 = clip((int) (y0v + chromaG));
            int b0 = clip((int) (y0v + chromaB));
            int r1 = clip((int) (y1v + chromaR));
            int g1 = clip((int) (y1v + chromaG));
            int b1 = clip((int) (y1v + chromaB));
            dst[wpos    ] = b0;
            dst[wpos + 1] = g0;
            dst[wpos + 2] = r0;
            dst[wpos + 3] = 0xff;
            dst[wpos + 4] = b1;
            dst[wpos + 5] = g1;
            dst[wpos + 6] = r1;
            dst[wpos + 7] = 0xff;
            wpos += 8;
            y_rpos += 2;
            uv_rpos += 2;
        }
    }
}

As some CPU doesn’t contain hardware accelerated floating point processors, floating point operations are emulated in software. In others, plain integer operations are usually faster. We can simulate floating point arithmetics with plain integers using Fixed point arithmetics

int y0v = y0 * 298;   // 1.164 * 256
int y1v = y1 * 298;   // 1.164 * 256
int chromaR = 408 * v;                // 1.596 * 256
int chromaG = -100 * u - 208 * v;     // - 0.391 * 256, 0.813 * 256
int chromaB = 517 * u;                // 2.018 * 256
int r0 = clip((y0v + chromaR) >> 8);  // >> 8 is equivalent to dividing by 256
int g0 = clip((y0v + chromaG) >> 8);
int b0 = clip((y0v + chromaB) >> 8);
int r1 = clip((y1v + chromaR) >> 8);
int g1 = clip((y1v + chromaG) >> 8);
int b1 = clip((y1v + chromaB) >> 8);

As YUV values are always 0 <= YUV <= 255 some operations can be already recalculated in small arrays

void precalc() {
    int i;
    for(i = 0; i > 8;
        factorRV[i] = ( 408 * (i - 128)) >> 8;
        factorGU[i] = (-100 * (i - 128)) >> 8;
        factorGV[i] = (-208 * (i - 128)) >> 8;
        factorBU[i] = ( 517 * (i - 128)) >> 8;
        clipVals[i] = min(max(i - 300, 0), 255);
    }
}

We also precalculated the clipping values, so we remove the comparisons in the inner loop.
No we can use these precalc tables in our code:

int y0 = factorY[src[y_rpos    ]];
int y1 = factorY[src[y_rpos + 1]];
int chromaR = factorRV[v];
int chromaG = factorGU[u] + factorGV[v];
int chromaB = factorBU[v];
int r0 = clipVals[y0 + chromaR + 300];
int g0 = clipVals[y0 + chromaG + 300];
int b0 = clipVals[y0 + chromaB + 300];
int r1 = clipVals[y1 + chromaR + 300];
int g1 = clipVals[y1 + chromaG + 300];
int b1 = clipVals[y1 + chromaB + 300];

We introduced a 300 offset into the array indexes to avoid problems when values are negative. We can avoid that index offset by doing the following:

int *clipVals_ = &clipVals[300];
...
int r0 = clipVals_[y0 + chromaR];
int g0 = clipVals_[y0 + chromaG];
int b0 = clipVals_[y0 + chromaB];
int r1 = clipVals_[y1 + chromaR];
int g1 = clipVals_[y1 + chromaG];
int b1 = clipVals_[y1 + chromaB];

We could write pixels in one single operation instead of using 4 write operations (one per single byte). We’ll create different precalc tables to store values for the R, G and B components with shift masks and alpha transparency already in place.

for(i = 0; i < 1024; i++) {
    clipVals[i]  = min(max(i - 300, 0), 255);
    clipValsR[i] = 0xFF000000 | (min(max(i - 300, 0), 255) << 16);
    clipValsG[i] = min(max(i - 300, 0), 255) << 8;
    clipValsV[i] = min(max(i - 300, 0), 255);
}
void inline renderFrame(unsigned char *src, char *dst, int rect[4], int frame, int dst_stride) {
    int i, j;
    int *clipValsR_ = &clipValsR[300];
    int *clipValsG_ = &clipValsG[300];
    int *clipValsB_ = &clipValsB[300];
    int *dsti = (int *) dst;
    for (i = 0; i < src_height; i++) {
        int y_rpos = i * src_stride;
        int uv_rpos = uvStart + ((i/2) * src_stride);
        int wpos = i * dst_stride;
        for(j = 0; j < src_width; j++) {
            int u = src[uv_rpos    ];
            int v = src[uv_rpos + 1];
            int y0 = factorY[src[y_rpos    ]];
            int y1 = factorY[src[y_rpos + 1]];
            int chromaR = factorRV[u];
            int chromaG = factorGU[u] + factorGV[v];
            int chromaB = factorBU[u];
            dst[wpos    ] = clipValsR_[y0 + chromaR] | clipValsG_[y0 + chromaG] | clipValsB_[y0 + chromaB];
            dst[wpos + 1] = clipValsR_[y1 + chromaR] | clipValsG_[y1 + chromaG] | clipValsB_[y1 + chromaB];
            wpos += 2;
            y_rpos += 2;
            uv_rpos += 2;
        }
    }
}

Execution time
Small graph showing the execution time of the same function with different optimisations. At the end we achieved around 600% performance increase.
exec_time
There are still quite a lot of things than can be done, stay tuned for the second part of this post!

Outcome BlackBerry 10 presentation

Last thursday I did an introduction to BlackBerry10 for Android developers. I focused most on the current limitations and possibilities of the Android Runtime of BlackBerry10 and I’ve also used some slides explaining how to convert a binary .apk file to a BlackBerry10 .bar file and sign using command line tools. There are, of course, easier tools (eclipse plugin, online repackager) but those don’t really need an explanation. At the end there is also a very quick introduction to QML/Qt and native vs cascades development.
bb_android
Something we learned about the event is people is really tired after 4 days of Mobile World Congress and gets a bit lazy if it’s raining pretty heavily. Luckily the event was live-streamed and there is a recording of the whole presentation.
Find below some links of the event:
Pictures of the event at Google+
Slides (.pdf)
Video capture of the whole presentation
Please stay tuned for the next BlackBerry Group meeting in early April.
http://www.bbdevbcn.org

Més aparicions a TV!

El passat dia 16 Canal Blau TV va parlar de l’aposta de Vilanova pels smartphones. Dins d’aquesta “petita” notícia va fer una refèrencia a Service2Media i a la petita aplicació que vaig desenvolupar per a llegir les notícies de l’ajuntament de Vilanova publicades al seu RSS Feed.
Aquí teniu el vídeo. Ho podeu veure a partir del minut 12:25
http://blog.rafols.org/player_flv_maxi.swf
Link al video original a TDT Garraf
Per a descarregar l’aplicatiu, només per a Android: https://market.android.com/details?id=com.oocit.ajuntamentvilanova o fent servir el següent codi:

També vaig sortir breument en el capítol 91 de La Malla tendències. Aquí us deixo el tros que parla sobre smartphones:
http://blog.rafols.org/player_flv_maxi.swf
Link al video original a TDT Garraf

Barcelona BlackBerry Developer's Group

This last weekend we got the official confirmation from Research In Motion about our proposal to create an official, RIM-Sponsored, Barcelona BlackBerry Developer’s Group. And it’s no longer a proposal, it’s a reality!! Now is time to get as much developers interested and involved in the project and try to do a first meeting as soon as possible.
Meetings will be held in Neàpolis. They are an active part of the group and will give us support with infrastructure and communication.
Location:
[googlemaps https://maps.google.com/maps?f=q&source=s_q&hl=es&geocode=&q=Ne%C3%A0polis,+Vilanova+i+la+Geltr%C3%BA,+Barcelona,+es&aq=t&sll=35.101934,-95.712891&sspn=58.25614,74.179688&ie=UTF8&hq=Ne%C3%A0polis,&hnear=Vilanova+i+la+Geltr%C3%BA,+Barcelona,+Catalu%C3%B1a,+Espa%C3%B1a&cid=13198224724882845218&ll=41.22557,1.735497&spn=0.014525,0.024848&z=15&iwloc=A&output=embed&w=580&h=450]
Our intention is to do a first inaugural meeting in mid June. Please join the facebook group or send an email to info@bbdevbcn.com if you would like to assist.
web: http://www.bbdevbcn.com
facebook: http://www.facebook.com/pages/Barcelona-BlackBerry-Developers-Group/214492638575349?ref=ts
twitter: @bbdevbcn

Fuga de cervells a la vilanovina..

Des de fa uns dies estic intentant posar Vilanova al mapa… En alguns temes la veritat és que ja hi es i a més hi destaca.. però per exemple, amb el terreny de les aplicacions mòbils hi havia com un buit… Afortunadament no estic sol i ja hi ha gent que ha fet moltíssima feina per a potenciar les noves tecnologies aquí a casa.
Un dels meus motius principals per tornar d’Holanda era de montar una oficina de desenvolupament a Barcelona, ja que per raons d’idioma necessitem gent que sàpiga parlar català i castellà per donar suport als nostres clients espanyols (ves per on en aquest país parlant anglès no vas enlloc…) i tot comentant-ho amb en Conrad Rovira (d’A16 Sistemes informàtics) em va proposar la idea de mirar de situar l’oficina aquí a Vilanova i precisament a Neàpolis. A mi em va semblar una idea força interessant ja que a Vilanova tenim una “fuga de cervells” en petita escala. És el que te viure al costat d’una gran capital com Barcelona.
Després de reunir-me amb la gent de Neàpolis i amb l’Ajuntament de Vilanova, els vaig explicar el nostre projecte i entre tots vam fer una proposta, que vaig defensar amb els meus caps holandesos, per a situar l’oficina a Neàpolis, que començaria amb poca gent i mica en mica aniria creixent. A l’equip de direcció holandès no els va semblar malament la idea i el principal problema era, llavors, trobar o bé gent vilanovina o bé gent que estigués disposada a venir a Vilanova (allò que fem molts vilanovins d’agafar el tren a les 6.49am costa molt que passi al revés).
Ara doncs que aquest detall està arreglat només ens falta l’últim pas per a posar-hi l’oficina. Espero que ho pugui confirmar molt aviat!
Mentrestant, a nivell personal, també he estat ocupat intentant despertar aquesta curiositat per les tecnologies mòbils. Si que tenim tenim tècnics perfectament vàlids, com per exemple l’associació empresarial G15, però ens cal que els altres sectors (els que generen contingut) es donguin compte de quines ventatges i quins beneficis en podrien treure de tenir la seva marca a les butxaques d’un nombre cada cop més gran d’usuaris. Sent aquests usuaris consumidors més agressius que els “simples” usuaris d’internet (I aviat inclús en major nombre).
Com a exemples i basant-me en les idees del opendata vaig fer una aplicació molt senzilla per al Diari de Vilanova sobre terminals Android. Aquí teniu la notícia al mateix Diari:
Diari Vilanova
i el link a l’Android Market: https://market.android.com/details?id=com.oocit.diaridevilanova
Des del mateix mòbil podeu anar a la icona de Market i buscar “Vilanova” per exemple.
També, seguint la mateixa línia, vaig fer un petit lector de les notícies publicades per l’Ajuntament de Vilanova:
Android:
https://market.android.com/details?id=com.oocit.ajuntamentvilanova
Windows Phone 7:
http://redirect.zune.net/External/LaunchZuneProtocol.aspx?pathuri=navigate%3FphoneAppID%3D97cc22d3-6b7b-e011-986b-78e7d1fa76f8
Com que necessita el software Zune de Microsoft, també incloc una captura:
ajvil_wp7
Aviat les dues aplicacions estaran disponibles també per a dispositius BlackBerry i iPhone!
M’agradaria donar un parell de tocs d’atenció des d’aquí.. el primer és per a l’Ajuntament.. estaria bé que obríssiu més dades estil com té la Generalitat, així molta més gent podria fer coses (tant per mòbils com per projectes web, …) i també una mica a les universitats.. Com pot ser, realment, que enginyers a punt de treure’s el títol tinguin un nivell d’anglès pèssim o literalment inexistent? Per què l’anglès no és una prioritat, per exemple, en les carreres tècniques on la major part de documentació, exemples i eines estan en anglès? M’ha costat suor i llàgrimes fer entendre als meus caps holandesos com pot ser que els enginyers de casa, en major part, no parlin anglès. Senzillament no ho entenen… i, la veritat, jo tampoc.