BlackBerry 10 – Native algorithm optimization

When looking for performance it’s important to understand what is going on at the compiler level and how the compiler is optimising our code. For example, enabling the assembly output will allow us to see what is exactly generating and have a greater understanding of why some parts of the code execute faster or the reason why it’s not performing as it should.

BlackBerry 10 uses the QCC compiler (more information here: QCC). Enabling the right flags will generate a .s file of the generated assembly:

qcc_flags2

qcc_flags3

Once we got the flags enabled we can start optimising the code. Let’s take as example a YUV2RGB converter and optimise it for BlackBerry 10 devices (used in this app http://appworld.blackberry.com/webstore/content/21198027/) . This converter can be used to read from camera preview buffers and convert it to RGB to save it as an image or show it on the screen. YUV is a format the encodes luma & chroma independently:

yuv_format

To make this example simpler, assume the following values (hardcoded from a camera preview buffer size)

int src_height = 768;
int src_width = 1024;
int src_stride = 1024;

int uvStart = 1024 * 768; // offset to the UV data (after all the Y data)

This is our first approach:

void inline renderFrame(unsigned char *src, char *dst, int rect[4], int frame, int dst_stride) {
    int i, j;
    for (i = 0; i < src_height; i++) {
        for(j = 0; j < src_width; j++) {
            int y_rpos = i * src_stride + j;
            int uv_rpos = uvStart + ((i/2) * src_stride) + (j/2) * 2;
            int wpos = i * dst_stride + j * 4;

            float y = src[y_rpos     ];
            float u = src[uv_rpos    ];
            float v = src[uv_rpos + 1];

            int r = clip((int) ((y - 16) * 1.164                     + 1.596 * (v - 128)));
            int g = clip((int) ((y - 16) * 1.164 - 0.391 * (u - 128) - 0.813 * (v - 128)));
            int b = clip((int) ((y - 16) * 1.164 + 2.018 * (u - 128)));

            dst[wpos    ] = b;
            dst[wpos + 1] = g;
            dst[wpos + 2] = r;
            dst[wpos + 3] = 0xff;
        }
    }
}

As with this YUV format a pair of Y values share the same UV values we can generate 2 pixels with just 2 Y and 1 single UV pair, generating 2 pixels per iteration.

float y0 = src[y_rpos    ];
float y1 = src[y_rpos + 1];
float u = src[uv_rpos    ];
float v = src[uv_rpos + 1];

int r0 = clip((int) ((y0 - 16) * 1.164                     + 1.596 * (v - 128)));
int g0 = clip((int) ((y0 - 16) * 1.164 - 0.391 * (u - 128) - 0.813 * (v - 128)));
int b0 = clip((int) ((y0 - 16) * 1.164 + 2.018 * (u - 128)));

int r1 = clip((int) ((y1 - 16) * 1.164                     + 1.596 * (v - 128)));
int g1 = clip((int) ((y1 - 16) * 1.164 - 0.391 * (u - 128) - 0.813 * (v - 128)));
int b1 = clip((int) ((y1 - 16) * 1.164 + 2.018 * (u - 128)));

dst[wpos    ] = b0;
dst[wpos + 1] = g0;
dst[wpos + 2] = r0;
dst[wpos + 3] = 0xff;

dst[wpos + 4] = b1;
dst[wpos + 5] = g1;
dst[wpos + 6] = r1;
dst[wpos + 7] = 0xff;

So far so good, but there are many arithmetic operations that can be shared and there is no need to do them twice

float y0 = src[y_rpos    ] - 16;
float y1 = src[y_rpos + 1] - 16;
float u = src[uv_rpos    ] - 128;
float v = src[uv_rpos + 1] - 128;

float y0v = y0 * 1.164;
float y1v = y1 * 1.164;

float chromaR = 1.596 * v;
float chromaG = -0.391 * u - 0.813 * v;
float chromaB = 2.018 * u;

int r0 = clip((int) (y0v + chromaR));
int g0 = clip((int) (y0v + chromaG));
int b0 = clip((int) (y0v + chromaB));

int r1 = clip((int) (y1v + chromaR));
int g1 = clip((int) (y1v + chromaG));
int b1 = clip((int) (y1v + chromaB));
...

and we’ve improved 14 multiplications and 20 additions/subtractions to 6 multiplications and 12 additions/substractions. Pretty sure our CPU would like that although we’re far from done. There are many values that doesn’t change in every pixel so we can move those to the external loop.

void inline renderFrame(unsigned char *src, char *dst, int rect[4], int frame, int dst_stride) {
    int i, j;
    for (i = 0; i < src_height; i++) {
        int y_rpos = i * src_stride;
        int uv_rpos = uvStart + ((i/2) * src_stride);
        int wpos = i * dst_stride;

        for(j = 0; j < src_width; j++) {
            float y0 = src[y_rpos    ] - 16;
            float y1 = src[y_rpos + 1] - 16;
            float u = src[uv_rpos    ] - 128;
            float v = src[uv_rpos + 1] - 128;

            float y0v = y0 * 1.164;
            float y1v = y1 * 1.164;

            float chromaR = 1.596 * v;
            float chromaG = -0.391 * u - 0.813 * v;
            float chromaB = 2.018 * u;

            int r0 = clip((int) (y0v + chromaR));
            int g0 = clip((int) (y0v + chromaG));
            int b0 = clip((int) (y0v + chromaB));

            int r1 = clip((int) (y1v + chromaR));
            int g1 = clip((int) (y1v + chromaG));
            int b1 = clip((int) (y1v + chromaB));

            dst[wpos    ] = b0;
            dst[wpos + 1] = g0;
            dst[wpos + 2] = r0;
            dst[wpos + 3] = 0xff;

            dst[wpos + 4] = b1;
            dst[wpos + 5] = g1;
            dst[wpos + 6] = r1;
            dst[wpos + 7] = 0xff;

            wpos += 8;
            y_rpos += 2;
            uv_rpos += 2;
        }
    }
}

As some CPU doesn’t contain hardware accelerated floating point processors, floating point operations are emulated in software. In others, plain integer operations are usually faster. We can simulate floating point arithmetics with plain integers using Fixed point arithmetics

int y0v = y0 * 298;   // 1.164 * 256
int y1v = y1 * 298;   // 1.164 * 256

int chromaR = 408 * v;                // 1.596 * 256
int chromaG = -100 * u - 208 * v;     // - 0.391 * 256, 0.813 * 256
int chromaB = 517 * u;                // 2.018 * 256

int r0 = clip((y0v + chromaR) >> 8);  // >> 8 is equivalent to dividing by 256
int g0 = clip((y0v + chromaG) >> 8);
int b0 = clip((y0v + chromaB) >> 8);

int r1 = clip((y1v + chromaR) >> 8);
int g1 = clip((y1v + chromaG) >> 8);
int b1 = clip((y1v + chromaB) >> 8);

As YUV values are always 0 <= YUV <= 255 some operations can be already recalculated in small arrays

void precalc() {
    int i;
    for(i = 0; i > 8;
        factorRV[i] = ( 408 * (i - 128)) >> 8;
        factorGU[i] = (-100 * (i - 128)) >> 8;
        factorGV[i] = (-208 * (i - 128)) >> 8;
        factorBU[i] = ( 517 * (i - 128)) >> 8;

        clipVals[i] = min(max(i - 300, 0), 255);
    }
}

We also precalculated the clipping values, so we remove the comparisons in the inner loop.

No we can use these precalc tables in our code:

int y0 = factorY[src[y_rpos    ]];
int y1 = factorY[src[y_rpos + 1]];

int chromaR = factorRV[v];
int chromaG = factorGU[u] + factorGV[v];
int chromaB = factorBU[v];

int r0 = clipVals[y0 + chromaR + 300];
int g0 = clipVals[y0 + chromaG + 300];
int b0 = clipVals[y0 + chromaB + 300];

int r1 = clipVals[y1 + chromaR + 300];
int g1 = clipVals[y1 + chromaG + 300];
int b1 = clipVals[y1 + chromaB + 300];

We introduced a 300 offset into the array indexes to avoid problems when values are negative. We can avoid that index offset by doing the following:

int *clipVals_ = &clipVals[300];

...

int r0 = clipVals_[y0 + chromaR];
int g0 = clipVals_[y0 + chromaG];
int b0 = clipVals_[y0 + chromaB];

int r1 = clipVals_[y1 + chromaR];
int g1 = clipVals_[y1 + chromaG];
int b1 = clipVals_[y1 + chromaB];

We could write pixels in one single operation instead of using 4 write operations (one per single byte). We’ll create different precalc tables to store values for the R, G and B components with shift masks and alpha transparency already in place.

for(i = 0; i < 1024; i++) {
    clipVals[i]  = min(max(i - 300, 0), 255);
    clipValsR[i] = 0xFF000000 | (min(max(i - 300, 0), 255) << 16);
    clipValsG[i] = min(max(i - 300, 0), 255) << 8;
    clipValsV[i] = min(max(i - 300, 0), 255);
}

void inline renderFrame(unsigned char *src, char *dst, int rect[4], int frame, int dst_stride) {
    int i, j;

    int *clipValsR_ = &clipValsR[300];
    int *clipValsG_ = &clipValsG[300];
    int *clipValsB_ = &clipValsB[300];
    int *dsti = (int *) dst;

    for (i = 0; i < src_height; i++) {
        int y_rpos = i * src_stride;
        int uv_rpos = uvStart + ((i/2) * src_stride);
        int wpos = i * dst_stride;

        for(j = 0; j < src_width; j++) {
            int u = src[uv_rpos    ];
            int v = src[uv_rpos + 1];

            int y0 = factorY[src[y_rpos    ]];
            int y1 = factorY[src[y_rpos + 1]];

            int chromaR = factorRV[u];
            int chromaG = factorGU[u] + factorGV[v];
            int chromaB = factorBU[u];

            dst[wpos    ] = clipValsR_[y0 + chromaR] | clipValsG_[y0 + chromaG] | clipValsB_[y0 + chromaB];
            dst[wpos + 1] = clipValsR_[y1 + chromaR] | clipValsG_[y1 + chromaG] | clipValsB_[y1 + chromaB];

            wpos += 2;
            y_rpos += 2;
            uv_rpos += 2;
        }
    }
}

Execution time

Small graph showing the execution time of the same function with different optimisations. At the end we achieved around 600% performance increase.

exec_time

There are still quite a lot of things than can be done, stay tuned for the second part of this post!

JS1k 2014 entries

Last year I said I had a very ambitious idea for the js1k 2013 competition but I didn’t had the time and I thought to leave it for the following year. This year, once again, I could only work on my entry the day before the deadline, so I decided to do something else and start from scratch.

I saw this animated gif long time ago (which, by the way, I don’t know the original source or author)

So I thought it would be good to make a 1k version of it.

To be honest I’m not very happy with the result, I’m pretty sure it could be done better & nicer. Anyway I submitted it to the competition and here is the result:

http://js1k.com/2014-dragons/demo/1973

Find below the full source code or check the details page on js1k

//values are just trial & error
w=window
f=w.innerHeight/w.innerWidth
w=1343
h=w*f

_=Math
C=_.cos
S=_.sin
R=_.random

F=512
Z=2200
P=6.28

l=[]
A=[]
B=[]

L=54
p=P/L
Y=N=0

//generate lamp posts in a circle on both sides and pointing to the center
df='000100110010000001011010010110110010010011011010'.split('')
for(i=0;i=3;k--) {
    x=k*F*S(p*i)
    z=k*F*C(p*i)
    x2=(k+v)*F*S(p*i)
    z2=(k+v)*F*C(p*i)

    for(j=0;j9&&j&2){
        xv=x2
        zv=z2
      }
      A[N++]=xv+df[j*3  ]*10
      A[N++]=-df[j*3+1]*512
      A[N++]=zv+df[j*3+2]*8
    }
    i+=.5
    v=-v
  }
}

//store where lamp posts end
K=N

//generate random lines
k=0
for(i=L*8;i--;) {
  l[k++]=R()*P
  l[k++]=R()
  l[k++]=-R()*128-64
  l[k++]=R()*.2+.2;
}

N+=L*96;	//8*12

setInterval(function(){
  Y+=.004
  b.style.background='#000'

  //clear screen
  a.width=w
  a.height=h

  //generate trail lines & move them
  j = K
  k=i=0;
  for(;i<l*8;i++) {
    d=l[k++]
    p=l[k++]
    y=l[k++]
    e=l[k++]
    r=(3.1+p*.8)*F

    v=4*S(y+i)
    g=[0,v,v,0]
    for(v=0;v<4;v++) {
      x=r*S(d)
      z=r*C(d)

      A[j++]=x
      A[j++]=y+g[v]
      A[j++]=z
      if(v&1) d+=e
    }
    l[k-4]+=.02

    // // move lines in both directions
    // if(i= K - trail lines, half red #f22, half white #fff
  k=i=0
  for(;i=K) {
      if(i*12=2) {
        c.lineTo(B[k],B[k+1]);
      }
      k+=3
    }
    c.fill()

    //draw lamp post light after drawing the last top horizontal quad
    //lamp posts are formed of 4 quads:
    //0 & 1 vertical
    //2 & 3 top horizontal
    //avoid drawing lights on trail lights
    if(i % 4 == 3 && i*12<k) {
      c.fillStyle="#fff"
      c.beginPath()
      c.arc(B[k-3], B[k-2], (2-B[k-1])*13,0,P)
      c.fill()
    }
  }
}, 20)

Code is compressed under 1k thanks to Google Closure Compiler and RegPack v3

As I wasn’t really happy with the result I did another quick entry based on the old good daCube2 colors & idea:

Find below the result:

http://js1k.com/2014-dragons/demo/1969

and the source code or check the details page on js1k:

w=window
h=w.innerHeight
w=w.innerWidth

_=Math
mC=_.cos
mS=_.sin
R=_.random

F=512
Z=2048

N=L=X=Y=0

A=[]
B=[]
C=[]
D=[]

addCube=function(s) {
  k=N/3
  j='000100010110001101011111'.split('')
  for(i=0;i<24;i++) {
    A[N++]=(j[i]-.5)*2*s
  }

  l='011332200445514662577367'.split('')
  for(i=0;i<24;i++) {
    D[L++]=k+(l[i]|0)
  }
}

for(z=200;z--;){
  addCube(100+z*5)
}
anim=.004

setInterval(function(){
  Y+=anim*R()+anim
  X+=anim*R()+anim
  y=Y
  x=X
  //X+=.01
  b.style.background='#444'

  //clear screen
  a.width=w
  a.height=h

  k=0
  //dy rotation
  for(i=N;i--;) {
    if(k%24*3==0) y-=anim*2
    C[k+2]=A[k+2]*mC(y)-A[k]*mS(y)
    C[k  ]=A[k+2]*mS(y)+A[k]*mC(y)
    k+=3
  }

  k=0
  //dx rotation
  for(i=N;i--;) {
    if(k%24*3==0) x+=anim*2
    B[k+1]=A[k+1]*mC(x)-C[k+2]*mS(x)
    B[k+2]=A[k+1]*mS(x)+C[k+2]*mC(x)
    k+=3
  }

  //transform 3d-2d
  k=p=0
  for(i=N;i--;) {
    C[k++]=F*C[p  ]/(C[p+2]+Z)+w/2
    C[k++]=F*B[p+1]/(C[p+2]+Z)+h/2
    p+=3
  }


  p=0
  for(i=0;i<l/24;i++) {
    c.strokeStyle="#fff"
    c.globalAlpha=(i/(L/24))
    c.beginPath()
    for(k=24;k--;){
      j=D[p++]
      c.moveTo(C[j*2],C[j*2+1]);
      j=D[p++]
      c.lineTo(C[j*2],C[j*2+1]);
    }
    c.stroke()
  }
}, 20)

BcnDevCon Presentation – Improving Java & Dalvik Code Performance

Last week I did a presentation at BcnDevCon about improving Java Code Performance. The focus of the presentation was showing some examples of compiled java sources and evaluate the performance impact of different ways of looping, string concatenation or using Java 1.5 features as autoboxing or foreach loops. According to java the performance optimizations are always left to the JVM, but we will see we can do many things to improve our code performance by knowing how the compiler works.

Some of these examples are also shown on Davlvik bytecode and performance tests are executed on both a computer and an android device. Even if Dalvik is register based and standard java bytecode is stack base in general terms what works for standard java can be also applied for Android apps.

On future posts I will explain in more detail the performance graphs and other topics I didn’t had time to include in the presentation.

Screen Shot 2013-11-15 at 00.51.53
Slides

What your mom didn't tell you about autoboxing…

Autoboxing is a nice feature added to Java 5 to avoid writing boxing code to add, for example, primitive data types to a Collection as you can only add the appropriate wrapper class (Integer for int, Long for long, …). I’ve recently seen the autoboxing feature being widely used all over the place so I decided to write this post.

Oracle itself warns that “It is plenty fast enough for occasional use, but it would be folly to use it in a performance critical inner loop.” and “An Integer is not a substitute for an int; autoboxing and unboxing blur the distinction between primitive types and reference types, but they do not eliminate it.“. So that basically means do not use autoboxing on portions of code where performance counts.

To illustrate the problem I’ve written an example:


public class Autoboxing {
// method using "accidentally" autoboxing
public void method1() {
long t = System.currentTimeMillis();

Long total = 0l;
for(Integer i = 0; i < 10000000; i++) {
total += i;
}

System.out.println(" total method1: " + total + " time:" + (System.currentTimeMillis() - t));
}

// method not using autoboxing
public void method2() {
long t = System.currentTimeMillis();

long total = 0;
for(int i = 0; i < 10000000; i++) {
total += i;
}

System.out.println(" total method2: " + total + " time:" + (System.currentTimeMillis() - t));
}

public static void main(String[] args) {
Autoboxing abox = new Autoboxing();
abox.method1();
abox.method2();
}
}

The first method method1 uses some autoboxing (the Integer and the Long). The second method method2 only uses primitive data types. Both methods iterate 1.000.000 times and when they finish they print the total and the amount of milliseconds since they started.

The bytecode of the for loop of both methods already indicates, subtly, that method2 will be more efficient:

Bytecode for loop method1:

9: iconst_0
10: invokestatic #4; //Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
13: astore 4
15: aload 4
17: invokevirtual #5; //Method java/lang/Integer.intValue:()I
20: ldc #6; //int 10000000
22: if_icmpge 65
25: aload_3
26: invokevirtual #7; //Method java/lang/Long.longValue:()J
29: aload 4
31: invokevirtual #5; //Method java/lang/Integer.intValue:()I
34: i2l
35: ladd
36: invokestatic #3; //Method java/lang/Long.valueOf:(J)Ljava/lang/Long;
39: astore_3
40: aload 4
42: astore 5
44: aload 4
46: invokevirtual #5; //Method java/lang/Integer.intValue:()I
49: iconst_1
50: iadd
51: invokestatic #4; //Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
54: dup
55: astore 4
57: astore 6
59: aload 5
61: pop
62: goto 15

Bytecode for loop method2:

4: lconst_0
5: lstore_3
6: iconst_0
7: istore 5
9: iload 5
11: ldc #6; //int 10000000
13: if_icmpge 28
16: lload_3
17: iload 5
19: i2l
20: ladd
21: lstore_3
22: iinc 5, 1
25: goto 9

In terms of timing, if we execute this class we get the following result:


rrafols$ java Autoboxing
total method1: 49999995000000 time:110
total method2: 49999995000000 time:9

So method2 takes only 8.18% of the total time of method1. Pretty big difference huh?

What happens if we run the same without enabling the JIT (Just in Time compiler)? Most mobile devices have very limited or non-existent JIT.


rrafols$ java -Xint Autoboxing
total method1: 49999995000000 time:7387
total method2: 49999995000000 time:141

Method2 takes only 1.91% of the total time of method1. Not only that, imagine we’re processing something that we have to show back to the user, there is a huge impact between having the user to wait for ~8 seconds or to wait for 150ms.

Please use common sense and be extra careful when using code you are actually not controlling or the compiler is generating for you.

Aparició en un episodi de Gentilicis

Ara fa unes setmanes l’equip del programa Gentilicis va estar per Vilanova gravant un nou episodi. En aquest episodi entrevisten a Toni Albà i parlen sobre el pasat, present i futur de la ciutat amb altres vilanovins. Concretament parlen amb Anton Font, fundador dels Joglars; Albert Tubau, de l’ADEG i amb mi com a vilanoví (o geltrunenc en el seu defecte) i representant de Service2Media a la ciutat.

Feu click a la imatge per veure el vídeo del programa:

My js1k entry

I was thinking to take part into the javascript 1k competition (here) but as I didn’t had too much free time and never really did that much in javascript targeting at 1k, I discarded the idea. Until I got a very ambitious idea the last day before the deadline.. how appropriate.. So I started trying few things and, to be honest, they didn’t work as I expected.. so I saved my ambitious idea for the next competition 😉

At the end, as I was already programming in Javascript I did a waving catalan flag in roughly 1k without paying attention to optimization. Pretty simple but enough for being my first time on a 1k javascript competition. Anyway, after doing some basic optimizations I still got a couple hundred bytes left so I’ve decided to add some extras. Check the demo for more details 🙂

Click the image below for the demo:
Screen Shot 2013-04-01 at 1.11.46 AM

Just in case anyone is interested here is the fully uncompressed code with few comments.

k = 0
w=window
h=w.innerHeight
w=w.innerWidth
h2=h/2
W=H=40

setInterval(function(){
    // clear screen
    c.width=w
    c.height=h

    o = 0;
    p = []

    //Generate flag points. Before size-optimizing the code they were generated only once and stored in an external array.
    //To save one loop from the code they're generated every single frame.. oh well..
    for(j = 0; j < H; j++) {
        for(i = 0; i < W; i++) {
            x = 1.2 * i * w / W +8 * Math.sin((j + k) * 0.15)
            y = 1.2 * j * h / H + 8 * Math.sin((i + k) * 0.3)
            z = 512 + 32*Math.sin((i+j+k) * 0.2)

            p[o] = (512 * x) / z
            p[o+1] = (512 * y) / z
            p[o+2] = z;
            o+=3
        }
    }

    // draw the flag
    o = 0
    for(j = 0; j < H-4; j++) {
        for(i = 0; i < W-1; i++) {
            // compute cross product to compute light factor
            // this can be probably done in waaaay less bytes
            l = []
            l[0] = p[o+3] - p[o]
            l[1] = p[o+4] - p[o+1]
            l[2] = p[o+5] - p[o+2]

            m = []
            m[0] = p[o+W*3] - p[o]
            m[1] = p[o+W*3+1] - p[o+1]
            m[2] = p[o+W*3+2] - p[o+2]

            n =[]
            n[0] = l[1]*m[2] - l[2]*m[1]
            n[1] = l[2]*m[0] - l[0]*m[2]
            n[2] = l[0]*m[1] - l[1]*m[0]

            n = n[2] / Math.sqrt(n[0] * n[0] + n[1] * n[1] + n[2] * n[2]);

            // square it to make results more dramatic
            n *= n;

            // choose flag color (red/yellow) to render the different color stripes depending on the vertical coordinate
            g = 221
            b = 9
            if(((j / 4) % 2|0) == 0){
                r = 253
            } else {
                r = g
                g = b
            }

            // apply the lightning factor and convert it back to integer using "|0"
            r=r*n|0
            g=g*n|0
            b=b*n|0

            a.fillStyle="rgba("+r+","+g+","+b+",1)"
            a.strokeStyle=a.fillStyle

            // actual render of the quad. Fill & stroke to polish (antialiasing, subpixel,..)
            a.beginPath()
            a.lineTo(p[o], p[o + 1])
            a.lineTo(p[o+3], p[o + 4])
            a.lineTo(p[o + 3 + W*3], p[o + 4 + W*3])
            a.lineTo(p[o + W*3], p[o + 1 + W*3])
            a.fill()
            a.stroke()
            o+=3
        }
        o+=3
    }

    // draw the "nyan cat" like trail
    w4=4*w/6
    w1=w/10
    cols=["#5E96FF","#387DFF","#0F47AF","#0D3E99","#0B347F","#08265B"]
    for(j = 0; j < 32; j++) {
        for(i = 0; i< 6;i++) {
            a.fillStyle=cols[i]
            a.strokeStyle=a.fillStyle
            wb=w4 / 32
            hb=w1 / 6
            x=j * wb
            y=h2 + (i-3) *hb - hb*Math.sin(k*0.2 + j *0.6)
            a.fillRect(x,y,wb,hb)
            a.strokeRect(x,y,wb,hb)
        }
    }

    // draw the star
    a.fillStyle="#ffffff"
    a.strokeStyle=a.fillStyle
    a.beginPath()
    for(i = 0; i < 5; i++) {
        s = Math.sin(k * 0.2)*32
        l = w1 * Math.sin(i* 1.256) + s/4
        m = w1 * -Math.sin(i* 1.256 + 1.57) + s
        n = w/25 * Math.sin(i* 1.256 + 0.628)
        o = w/25 * -Math.sin(i* 1.256 + 2.19) + s
        a.lineTo(w4 + l, h2 + m)
        a.lineTo(w4 + n, h2 + o)
    }
    a.fill()
    a.stroke()
    k+=0.5
},20);

And the compressed result thanks to Google Closure Compiler and JSCrush:

k=wJdow;h&HV;w&Wid^;h2=h/2;H=W=4setInterval(function_{c.wid^;c.hV=h;o=p=[]j<hui<wx=~i*w/W+815*(j!y=~j*h/H+83*`!zK2+Z`+j!]Kx/z,1]Ky/z,2]=z,o=ji=cols[i,wb4/Z,hb1/6,x=j*wb,y=N`-3)*hb-hbk+6*j,;="#ffffff";;;5>is=Zkl1*O/4,m1*-+~57O,nY+628oY-+2.19Ow4+l,Nm)w4+n,No);_;_;k+=5},20);GJ(+X,a.lJeTo(a.strokefor`=~256*i[2]	[0][1]a.fillX=Rect(x,y,wb,hb)],++)3*W*;for(j=a.begJPa^_=w),p[o*n|0,0;2*F #;i]-o+=3!+k)$])&.Jner@*mC=GMa^.sJinK=51L+",Nh2+O)+sQ=[U;jVeightXStyleY/25*Z32^th_()`(i~1.0.';for(Y in $='~`_^ZYXVUQONLKJGC@&$!	')with(_.split($[Y]))_=join(pop());eval(_)