A Random Sample of

LLVM Passes

Nick Desaulniers - Sept 10/11 22

A Catalogue of Optimizing Transformations - Allen, Cook '71

  • Inlining
  • Loop Unrolling / Fusion / Unswitching
  • Common Subexpression Elimination (CSE)
  • Code Motion (CM)
  • Constant folding + propagation
  • Dead Code Elimination (DCE)
  • Strength Reduction
  • Instruction Scheduling
  • Register Allocation
  • Peephole Optimization

LLVM Passes

  • Passes come in Module, CGSCC, Function, or Loop variants.
  • llvm/lib/Passes/PassRegistry.def
  • -mllvm -debug-pass=Structure can tell you which pass operates on which level.
  • Analysis passes and transform passes.
    • Analyses like dominance or scaler evolution can be expensive to compute; reuse if nothing changes.
  • Passes are grouped into Pipelines.
  • Passes can run on LLVM IR, SDAG, or MIR.

LLVM Passes

Transform machinery is likely distinct from policy.

llvm/lib/Transform/Utils/ contains machinery for making changes.

llvm/lib/Transform/*/ contains the policy decisions in passes.

PassBuilder

Check out these methods to get a sense of how optimization passes are ordered.

  • PassBuilder::buildPerModuleDefaultPipeline
  • PassBuilder::buildModuleSimplificationPipeline
  • PassBuilder::buildModuleOptimizationPipeline

Clang Pass Pipeline

  • -O2 (c51a12d598e9) (aarch64-linux-gnu)
  • Prefer -print-before-all to -print-after-all. -print-after-all doesn't print anything if the pass made no changes.

$ clang -mllvm -print-before-all --target=aarch64-linux-gnu \
  -O2 -c /tmp/x.c 2>&1 | grep Dump
...
*** IR Dump Before CoroEarlyPass on [module] ***
*** IR Dump Before LowerExpectIntrinsicPass on foo ***
*** IR Dump Before SimplifyCFGPass on foo ***
*** IR Dump Before SROAPass on foo ***
*** IR Dump Before EarlyCSEPass on foo ***
...
$ !! | wc -l
205
$ clang ... | grep Dump | sort -u | wc -l
168

Caveat Emptor

These slides are non-exhaustive.

Passes generally aren't commutative.

The examples shown here are barely the tip of the iceberg.

Simplify the Control Flow Graph (CFG)


  goto foo;
foo:

// poof!

Scalar Replacement of Aggregates (SROA)


struct foo my_foo {
  int x, y;
} = {
  .x = 42,
  .y = 10
};
return my_foo.x;

int x = 42;
int y = 10;
return x;

SROA invokes promoteMemoryToRegister, ie. the core of mem2reg.

Common Subexpression Elimination (CSE)

Tries to match literally identical subexpressions.


int a = b * c + d;
int e = b * c * e;

int bc = b * c;
int a = bc + d;
int e = bc * e;

Interprocedural Sparse Conditional Constant Propagation (IPSCCP)


static int foo(int x) {
  return x;
}
void bar (void) { foo(42); }

static int foo(int x) {
  return 42;
}
void bar (void) { foo(42); }

static int baz(void) {
  return 42;
}
int quux (void) {
  return baz();
}

static int baz(void) {
  return 42;
}
int quux (void) {
  return 42;
}

Dead Argument Elimination


int foo(int x) {
  return 42;
}

int foo(void) {
  return 42;
}

Combine redundant instructions aka instcombine


y = x + 1;
z = y + 1;

z = x + 2;

if (strlen(x) != 0)

if (*x)

Function Inliner


void foo (int);
void bar (void) {
  int x = 42;
  foo(x);
}
void baz (void) {
  bar();
}

void foo (int);
void bar (void) {
  int x = 42;
  foo(x);
}
void baz (void) {
  {
    int x = 42;
    foo(x);
  }
}

Jump Threading


if (x)
  y = 42;

if (y < 0)
  foo();
else
  bar();

if (x) {
  y = 42;
  bar();
  goto end;
}

if (y < 0)
  foo();
else
  bar();

end:;

Tail Call Elimination


void baz (void) {
    foo();
}

.globl baz
.type baz,@function
foo:
  stp x29, x30, [sp, #-16]!
  mov x29, sp
  bl foo
  ldp x29, x30, [sp], #16
  ret

.globl baz
.type baz,@function
foo:
  b foo

Canonicalize natural loops (Loop Simplify)

Ensure that loops are in a canonical form ("Loop Simplify Form")

  • To put something in canonical form means to reduce all possible representations to one, to simplify checks optimizations need to do.
  • Single non-critical entry edge from outside to header.
  • Exit blocks dominated by loop header.
  • Loops have 1 backedge.

Loop Simplify example

Loops have 1 backedge.


for (...) {
  if (x)
    continue;
  if (y)
    continue;
  ...
}

for (...) {
  if (x)
    goto end;
  if (y)
    goto end;
  ...
  end:;
}

MergedLoadStoreMotionPass


if (...) {
  int x = *y;
  x = foo(x);
  *y = x;
} else {
  int z = *y;
  z = bar(z);
  *y = z;
}

int xz = *y;
if (...)
  xz = foo(xz);
else
  xz = bar(xz);
*y = xz;

Global Value Numbering (GVN)

Unlike CSE, tries to compute equivalency of subexpressions. Requires SSA form.


int w = 3;
int x = 3;
int y = x + 4;
int z = w + 4;

int w = 3;
int x = w;
int y = w + 4;
int z = y;

Dead Code Elimination (DCE)


goto foo;
expensive();
foo:;

goto foo;
foo:;

LLVM has Aggressive, Bit-tracking, and Global DCE variants.

Correlated Value Propagation


if (x)
  foo(x ? b : c);

if (x)
  foo(b);

memcpy optimization


struct foo { ... } a, b;
memcpy(&a, &b, sizeof a);

a = b;

a[0] = 0;
a[1] = 0;
a[2] = 0;
a[3] = 0;

memset(a, 0, 4 * sizeof a[0]);

Dead Store Elimination (DSE)


{
  int x = 42;
  foo(x);
  ...
  x = 0;
}

{
  int x = 42;
  foo(x);
  ...
}

Global Variable Optimizer (globalopt)


int x = 4 + 5;

// only written to
static int y;

static int z;
int foo (void) {
  z = bar();
  return z;
}

// No callers
void baz(void) {}

int x = 9;

// poof! y

void foo (void) {
  int z = bar();
  return z;
}

// poof! baz

EliminateAvailableExternallyPass


__attribute__((gnu_inline))
extern inline int foo(void) {
    return 42;
}

// After inlining
// poof! foo

// -Wextern-initializer
extern int y = 42;
int x (void) {
    return y + 1;
}

int y;
int x (void) {
    return y + 1;
}

Float 2 Int


void foo(int x) {
  if (x < 1.1f)
    bar();
}

void foo(int x) {
  if (x < 1)
    bar();
}

LowerConstantIntrinsicsPass


__builtin_constant_p();
__builtin_object_size( \
    foo, type);

// 0 or 1;
// -1, 0, or min/max size
// of foo;

LoopDistributePass

Setup for loop vectorizor.


void foo (int *a) {
  // Assume 4x32b SIMD.
  // Unaligned loads.
  for (int i = 0; i != 5; ++i)
    ++a[i];
}

void foo (int *a) {
  // Assume 4x32b SIMD.
  // Unaligned loads.
  for (int i = 0; i != 4; ++i)
    ++a[i];
  ++a[5];
}

How might this code have to change if the target doesn't support unaligned loads/stores and had a few more iterations?

Loop Vectorize


void foo (int *a) {
  for (int i = 0; i != 4; ++i)
    ++a[i];
}

#include <arm_neon.h>

void foo (int32x4_t *a) {
  *a += 1;
}

TODO: fix increment/decrement on vectors

LoopLoadEliminationPass


for (int i = 0; i < 100; ++i)
  a[i+1] = a[i] + b[i];

for (int i = 0, tmp = a[i];
     i < 100; ++i) {
  a[i + 1] = tmp = tmp + b[i];
}

Many more loop passes

  • Superword Level Parallelism (SLP) Vectorizer
  • Vector Combine
  • Unroll and Jam
  • Fusion
  • Unswitching
  • Versioning

DivRemPairsPass

Division and remainder can often times be computed together. Looks for matching denominators and tries to move operations closer together.

ConstantMergePass


static const int x =
  0xdeadbeef;
static const int y =
  0xdeadbeef;
int foo (void) { return x; }
int bar (void) { baz(y); }

static const int xy =
  0xdeadbeef;
int foo (void) { return xy; }
int bar (void) { baz(xy); }

Expand large div/rem


__int128 div (__int128 a,
              __int128 b) {
  return a / b;
}

__int128 div (__int128 a,
              __int128 b) {
  // super complicated loop
}

Target may be incapable of large div/rem.

Expand Atomic instructions

Target may need libcall for _Atomic operation.

Merge contiguous icmps


// int a [200];
// int b [200];
if (a[10] == b[20] &&
    a[11] == b[21]
    a[12] == b[22]
    a[13] == b[23])
  goto foo;

// int a [200];
// int b [200];
if (memcmp(a + 10, b + 20,
    4 * sizeof(a[0])))
  goto foo;

Expand memcmp to loads+cmp

Take advantage of target specifics.


// int a [200];
// int b [200];
if (memcmp(a + 10, b + 20,
    4 * sizeof(a[0])))
  goto foo;

// long a [100];
// long b [100];
if (a[10] == b[10] &&
    a[11] == b[11])
  goto foo;

Constant Hoist


if (x)
  return 0xdeadbeefdeadbeaf;
if (y)
  return 0xdeadbeefdeadbeaf;

unsigned long z =
  0xdeadbeefdeadbeaf;
if (x)
  return z;
if (y)
  return z;

Some constants are too large to encode in one instruction. Reuse (when possible) rather than rebuild repeatedly.

Type Promotion


for (char i = 0; i < 50; ++i)

for (int i = 0; i < 50; ++i)

We have to be careful to prove that operation would produce similar results on both types. (Overflow could be a problem.) Instruction selection will later do legalization, but this pass is better suited for cyclic code (inter-block).

ISEL

Requires (multiple) separate talks.

ISEL does do many optimizations, too.

Tail Duplication


goto *(v ? &&x : &&y);
x:
  k = 5;
  goto z;
y:
  k = 42;
  goto z;
z:
  foo();
w:

goto *(v ? &&x : &&y);
x:
  k = 5;
  foo();
  goto w;
y:
  k = 42;
  foo();
  goto w;
z: // unreachable!
  foo();
w:

There's pre-RA ("early") and post-RA variants.

If Conversion


goto *(v ? &&x : &&y);
x:
  k = 5;
  foo();
  goto w;
y:
  k = 42;
  foo();
  goto w;
w:

x = v ? 5 : 42;
foo();

Peephole


mov w0, #1
add w1, w1, w0

add w1, w1, #0

Pre-RA Scheduling

Regalloc

Requires multiple separate talks.

Depending on which register allocator is being used, may involve multiple passes.

Prolog Epilog Insertion (PEI)

Expand Psuedos

Post-RA Scheduling

Block Placement