GPU Compute Shaders part 2 — fluids

30 min readFeb 3, 2022

It is the second part of the mini compute shader serie. See part1 on more technical stuff and Gray Scott simulation published before.

This tutorial is about implementation of standard Lattice Boltzmann Method in compute shaders.

Realtime simulation of 3D fluid flow described in this tutorial article.

Intro

I assume that you already know what Lattice Boltzmann Method is and you’re able to write solver for CPU. If not, I recommend wonderful websites and books about it (just Google for M. Sukop, S. Succi or Z. Guo and take one of them). In short: this is the method that allows to solve (approximate in some compressibility limit) Navier-Stokes equations without need to solve Poisson equation. The algorithm may be seen as very similar to (previously introduced) Gray-Scott algorithm. We are working on two copies of the system (A and B) and perform ping-pong operation of local relaxation and non-local transfer of some data to nearest neighbours. Is that serious method for CFD and have some applications? Well… yes. In this tutorial we will work on 3D simulation. Everyone is showing 2D always (which is the simplest and this is why). But my point is not to convince you that LBM is simple but that shaders can do the job even in 3D, so let’s go with something more fancy here.

Fluid simulation introduced in this tutorial.

Starting from scratch

We are starting with empty Open Frameworks project and will fill out the classes and write shaders. Be prepared for fast forward and get back to part1 if you don’t get some technical details.

— — — — — — — — — — — — — — — — — ofApp.h
#pragma once
#include “ofMain.h”class ofApp : public ofBaseApp
{
 public:
 void setup();
 void update();
 void draw();
}; 
 — — — — — — — — — — — — — — — — — ofApp.h (end) — — — — — — — — — — — — — — — — — ofApp.cpp
#include “ofApp.h”
#include “ofConstants.h”void ofApp::setup()
{
}void ofApp::update()
{
}void ofApp::draw()
{
}
 — — — — — — — — — — — — — — — — — ofApp.cpp (end)

We will use 3D q=15 velocity variant of the LBM method (called D3Q15). First we will setup all buffers and solver data on CPU, namely — f0/f1 for velocity distribution function, F (integer) for flags denoting solid/fluid nodes, U,V and W for velocity components. Moreover, we need some basic constants like lattice velocity vectors and inverse (inv) tab. We will setup some weights (w[q]). All this is pretty standard in LBM world and I will just put them above ofApp class first. Side note: constants will be later copied (hard coded) in shader which is ugly and bad but it works for me in this simple program — you may correct this and send some of them to shader as uniforms, I was just too lazy to do it.

— — — — — — — — — — — — — — — — — ofApp.h
#pragma once
#include “ofMain.h”const int WIDTH = 1280;  // window width
const int HEIGHT = 720;  // window height
const int NX = 200; // mesh size x
const int NY = 70;
const int NZ = 70;class ofApp : public ofBaseApp
{
 public:
 void setup();
 void update();
 void draw();
}; 
 — — — — — — — — — — — — — — — — — ofApp.h (end) — — — — — — — — — — — — — — — — — ofApp.cpp
#include “ofApp.h”
#include “ofConstants.h”const int q = 15;
float f0cpu[NX*NY*NZ*q]={0};
float f1cpu[NX*NY*NZ*q]={0};
int Fcpu[NX*NY*NZ]={0};
float Ucpu[NX*NY*NZ]={0};
float Vcpu[NX*NY*NZ]={0};
float Wcpu[NX*NY*NZ]={0};
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 1
#define C_BND 0void ofApp::setup()
{
}
void ofApp::update()
{
}
void ofApp::draw()
{
}
 — — — — — — — — — — — — — — — — — ofApp.cpp (end)

These CPU data must be transferred to GPU via shader buffer objects. Thus, we will now define them in ofApp.h and allocate in ofApp.cpp. It’s quite a lot of code, but believe me, it’s nothing facy, just some array initializations for initialization of LBM and buffer copying from CPU to GPU. I also add periodic function which is helpful in setting nodes for obstacle spheres. Note that we use 1D tables and map the index (idx) to code multidimensional tables in them. This way life is a bit easier (LBM works on MULTI dimensional tables, see shader comment below).

— — — — — — — — — — — — — — — — — ofApp.h
#pragma once
#include “ofMain.h”const int WIDTH = 1280;  // window width
const int HEIGHT = 720;  // window height
const int NX = 200; // mesh size x
const int NY = 70;
const int NZ = 70;class ofApp : public ofBaseApp
{
 public:
 void setup();
 void update();
 void draw();ofBufferObject f0, f1, F, U, V, W;}; 
 — — — — — — — — — — — — — — — — — ofApp.h (end) — — — — — — — — — — — — — — — — — ofApp.cpp
#include “ofApp.h”
#include “ofConstants.h”const int q = 15;
float f0cpu[NX*NY*NZ*q]={0};
float f1cpu[NX*NY*NZ*q]={0};
int Fcpu[NX*NY*NZ]={0};
float Ucpu[NX*NY*NZ]={0};
float Vcpu[NX*NY*NZ]={0};
float Wcpu[NX*NY*NZ]={0};
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 1
#define C_BND 0int per(int x, int nx)  // periodic bnd's
{
 if(x<0) x += nx;
  else 
 if(x>nx-1) x -= nx;
 return x;
}void ofApp::setup()
{
 // initialize density function and fluid flags on CPU
 for(int x=0; x<NX; x++)
 for(int y=0; y<NY; y++)
 for(int z=0; z<NZ; z++)
 {
  int idx = x+y*NX+z*NX*NY;
  for(int k=0; k<q; k++)
   f0cpu[k + q*(idx)] = w[k];
   Fcpu[idx] = 0;  // all sites empty (fluid)
 }
 
 // put some obstacles (spheres)
 srand(time(NULL));
 #define S (NY/4)
 for(int i=0; i<1; i++)  
 {
  int px = NX * (rand()/float(RAND_MAX));
  int py = NY * (rand()/float(RAND_MAX));
  int pz = NZ * (rand()/float(RAND_MAX));
  
  for(int x=px-S; x<px+S; x++)
  for(int y=py-S; y<py+S; y++)
  for(int z=pz-S; z<pz+S; z++)
  if((x-px)*(x-px) + (y-py)*(y-py) + (z-pz)*(z-pz) < S*S)
  { 
   int kx = per(x,NX);
   int ky = per(y,NY);
   int kz = per(z,NZ);
   int idx = kx+ky*NX+kz*NX*NY;
   Fcpu[idx] = 1;
  }
 }// allocate GPU buffers
 f0.allocate(NX*NY*NZ*q*sizeof(float),f0cpu,GL_STATIC_DRAW);
 f1.allocate(NX*NY*NZ*q*sizeof(float),f1cpu,GL_STATIC_DRAW);
 F.allocate(NX*NY*NZ*sizeof(int),Fcpu,GL_STATIC_DRAW);
 U.allocate(NX*NY*NZ*sizeof(float),Ucpu,GL_STATIC_DRAW);
 V.allocate(NX*NY*NZ*sizeof(float),Vcpu,GL_STATIC_DRAW);
 W.allocate(NX*NY*NZ*sizeof(float),Wcpu,GL_STATIC_DRAW);// binding
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, 0);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1);
 F.bindBase(GL_SHADER_STORAGE_BUFFER, 2);
 U.bindBase(GL_SHADER_STORAGE_BUFFER, 3);
 V.bindBase(GL_SHADER_STORAGE_BUFFER, 4);
 W.bindBase(GL_SHADER_STORAGE_BUFFER, 5);
}void ofApp::update()
{
}void ofApp::draw()
{
}
 — — — — — — — — — — — — — — — — — ofApp.cpp (end)

The LBM compute shader

Now, when buffers are binded, we need to create a proper compute shader program. The compute shader purpose will be to do pingpong from buffer f0 to f1 (velocity distribution function) and do two steps — propagation and relaxation towards equilibrium. We start with empty shader:

— — — — — — — — — — — — — — — — — computeshader.cs
#version 440
void main() 
{
}
— — — — — — — — — — — — — — — — — computeshader.cs (end)

Add LBM constants as in ofApp.cpp file (again, yes, you can and should use uniforms to update these and keep the same in both files).

— — — — — — — — — — — — — — — — — computeshader.cs
#version 440
#define tau 0.91//0.6
#define omega (1.0/tau) // viscosity, etc.
const int q = 15;
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 0
#define C_BND 1
void main() 
{
}
— — — — — — — — — — — — — — — — — computeshader.cs (end)

Next, we need to update our compute shader with all needed data and buffers that were binded from CPU side (in the order we assumed there). Additionally, here we will specify the size of working groups (I use ¹⁰³ but you may experiment with other values, just don’t get over your GPU limit).

— — — — — — — — — — — — — — — — — computeshader.cs
#version 440
#define tau 0.91//0.6
#define omega (1.0/tau) // viscosity, etc.
const int q = 15;
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 0
#define C_BND 1layout( binding=0 ) buffer df0 { float f0[  ]; };
layout( binding=1 ) buffer df1 { float f1[  ]; };
layout( binding=2 ) buffer dcF { int F[  ]; };
layout( binding=3 ) buffer dcU { float U[  ]; };
layout( binding=4 ) buffer dcV { float V[  ]; };
layout( binding=5 ) buffer dcW { float W[  ]; };
layout( local_size_x = 10, local_size_y = 10, local_size_z = 10 ) in;void main() 
{
}
— — — — — — — — — — — — — — — — — computeshader.cs (end)

From here we start the fun part, just add per() periodic helpful function, define external force (devF) and get thread number from gl_GlobalInvocationID:

— — — — — — — — — — — — — — — — — computeshader.cs
#version 440
#define tau 0.91//0.6
#define omega (1.0/tau) // viscosity, etc.
const int q = 15;
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 0
#define C_BND 1layout( binding=0 ) buffer df0 { float f0[  ]; };
layout( binding=1 ) buffer df1 { float f1[  ]; };
layout( binding=2 ) buffer dcF { int F[  ]; };
layout( binding=3 ) buffer dcU { float U[  ]; };
layout( binding=4 ) buffer dcV { float V[  ]; };
layout( binding=5 ) buffer dcW { float W[  ]; };
layout( local_size_x = 10, local_size_y = 10, local_size_z = 10 ) in;void main() 
{
 int NX=200;
 int NY=70;
 int NZ=70;
 float devFx = 0.0002;
 float devFy = 0.0;
 float devFz = 0.0;int i = int(gl_GlobalInvocationID.x);
 int j = int(gl_GlobalInvocationID.y);
 int k = int(gl_GlobalInvocationID.z);
 int idx = i+j*NX+k*NX*NY;}
— — — — — — — — — — — — — — — — — computeshader.cs (end)

Above, idx will be an index used in mapping my 18d tables (yes, we will use 18 dimensional table here!) and 3d tables to 1d.

The main LBM algorithm

In the next step, we will implement the actual LBM algorithm. The algorithm starts with checking if we are in the FLD node and then performes 3 main steps: 1- computing of macroscopic variables (velocity and density), adding forces, and 2- doing relaxation and propagation steps. Again I assume that you know what these steps are, if not — just copy that piece of code for the moment. I use BGK approximation and code two steps in one expression doing just quick inversion if the outcoming node is boundary (no-slip bounce-back condition). Feel free to experiment with “tau” value on top of the shader (and change the one in ofApp.cpp accordingly) to get different properties of the flow. That’s being said — let’s paste the solver into our shader:

— — — — — — — — — — — — — — — — — computeshader.cs
#version 440
#define tau 0.91//0.6
#define omega (1.0/tau) // viscosity, etc.
const int q = 15;
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 0
#define C_BND 1layout( binding=0 ) buffer df0 { float f0[  ]; };
layout( binding=1 ) buffer df1 { float f1[  ]; };
layout( binding=2 ) buffer dcF { int F[  ]; };
layout( binding=3 ) buffer dcU { float U[  ]; };
layout( binding=4 ) buffer dcV { float V[  ]; };
layout( binding=5 ) buffer dcW { float W[  ]; };
layout( local_size_x = 10, local_size_y = 10, local_size_z = 10 ) in;void main() 
{
 int NX=200;
 int NY=70;
 int NZ=70;
 float devFx = 0.0002;
 float devFy = 0.0;
 float devFz = 0.0;int i = int(gl_GlobalInvocationID.x);
 int j = int(gl_GlobalInvocationID.y);
 int k = int(gl_GlobalInvocationID.z);
 int idx = i+j*NX+k*NX*NY;if( F[ idx ] == C_FLD )  // if the node is fluid
 { 
  for(int m=0; m<q; m++)   // calculate density and velocity loop
  {
   rho = rho + f0[idx*q+m];
   ux  = ux + f0[idx*q+m]*ex[m];
   uy  = uy + f0[idx*q+m]*ey[m];
   uz = uz + f0[idx*q+m]*ez[m];
  }
  ux /= rho;
  uy /= rho;
  uz /= rho;
  U[ idx ] = ux;  // store velocity
  V[ idx ] = uy;
  W[ idx ] = uz;
  ux = ux + 0.5 * devFx; // update velocity with force
  uy = uy + 0.5 * devFy;
  uz = uz + 0.5 * devFz;
  for(int m=0; m<q; m++)  // collision + streaming  
  {                       // (the main solver is here, really)
   int ip = i+ex[m];
   int jp = j+ey[m];
   int kp = k+ez[m];ip=per(ip,NX-1); // periodic neighbour...
   jp=per(jp,NY-1);
   kp=per(kp,NZ-1);int idxp = ip+jp*NX+kp*NX*NY; //... and his indexif( F[ idxp ] == C_BND ) // if outcomming is boundary node
    f1[ idx*q + inv[m] ]    = (1-omega) * f0[idx*q+m] + omega * w[m]    
    * rho * (1.0 - (3.0/2.0) * (ux*ux + uy*uy * uz*uz)+ 3.0 * (ex[m]  
    * ux + ey[m]*uy + ez[m]*uz)+ (9.0/2.0) * (ex[m] * ux + ey[m]*uy 
    + ez[m]*uz) * (ex[m] * ux + ey[m]*uy + ez[m]*uz));
   else
    f1[ idxp*q + m ] = (1-omega) * f0[idx*q+m] + omega * w[m] * rho 
    * (1.0 - (3.0/2.0) * (ux*ux + uy*uy * uz*uz)+ 3.0 * (ex[m] * ux 
    + ey[m]*uy + ez[m]*uz)+ (9.0/2.0) * (ex[m] * ux + ey[m]*uy + 
    ez[m]*uz) * (ex[m] * ux + ey[m]*uy + ez[m]*uz));
  }
 }}
— — — — — — — — — — — — — — — — — computeshader.cs (end)

What are these long equations at the end? These two lines are the main core of the LBM algorithm for BGK approximation and single phase flow. Yes, it is amazing that the whole solver is kept in single line of code (actually copied to reverse on boundaries but in fact it is the same).

Dispatch Compute our LBM solver

Ok, that is it, the solver shader is done. Now we just need to load the shader, link it and run (DispatchCompute) it in order to have it executed on GPU and produce our velocity field (u,v,w) which we will use for visualization. We will now leave the computeshader.cs and get back to our CPU programme in which these three (define, load, and execute) will be done.

— — — — — — — — — — — — — — — — — ofApp.h
#pragma once
#include “ofMain.h”const int WIDTH = 1280;  // window width
const int HEIGHT = 720;  // window height
const int NX = 200; // mesh size x
const int NY = 70;
const int NZ = 70;class ofApp : public ofBaseApp
{
 public:
 void setup();
 void update();
 void draw();ofShader shader;ofBufferObject f0, f1, F, U, V, W;}; 
— — — — — — — — — — — — — — — — ofApp.h (end) 
— — — — — — — — — — — — — — — — — ofApp.cpp
#include “ofApp.h”
#include “ofConstants.h”const int q = 15;
float f0cpu[NX*NY*NZ*q]={0};
float f1cpu[NX*NY*NZ*q]={0};
int Fcpu[NX*NY*NZ]={0};
float Ucpu[NX*NY*NZ]={0};
float Vcpu[NX*NY*NZ]={0};
float Wcpu[NX*NY*NZ]={0};
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 1
#define C_BND 0int per(int x, int nx)  // periodic bnd's
{
 if(x<0) x += nx;
  else 
 if(x>nx-1) x -= nx;
 return x;
}void ofApp::setup()
{shader.setupShaderFromFile(GL_COMPUTE_SHADER,"computeshader.cs");
 shader.linkProgram();// initialize density function and fluid flags on CPU
 for(int x=0; x<NX; x++)
 for(int y=0; y<NY; y++)
 for(int z=0; z<NZ; z++)
 {
  int idx = x+y*NX+z*NX*NY;
  for(int k=0; k<q; k++)
   f0cpu[k + q*(idx)] = w[k];
   Fcpu[idx] = 0;  // all sites empty (fluid)
 }
 
 // put some obstacles (spheres)
 srand(time(NULL));
 #define S (NY/4)
 for(int i=0; i<1; i++)  
 {
  int px = NX * (rand()/float(RAND_MAX));
  int py = NY * (rand()/float(RAND_MAX));
  int pz = NZ * (rand()/float(RAND_MAX));
  
  for(int x=px-S; x<px+S; x++)
  for(int y=py-S; y<py+S; y++)
  for(int z=pz-S; z<pz+S; z++)
  if((x-px)*(x-px) + (y-py)*(y-py) + (z-pz)*(z-pz) < S*S)
  { 
   int kx = per(x,NX);
   int ky = per(y,NY);
   int kz = per(z,NZ);
   int idx = kx+ky*NX+kz*NX*NY;
   Fcpu[idx] = 1;
  }
 }// allocate GPU buffers
 f0.allocate(NX*NY*NZ*q*sizeof(float),f0cpu,GL_STATIC_DRAW);
 f1.allocate(NX*NY*NZ*q*sizeof(float),f1cpu,GL_STATIC_DRAW);
 F.allocate(NX*NY*NZ*sizeof(int),Fcpu,GL_STATIC_DRAW);
 U.allocate(NX*NY*NZ*sizeof(float),Ucpu,GL_STATIC_DRAW);
 V.allocate(NX*NY*NZ*sizeof(float),Vcpu,GL_STATIC_DRAW);
 W.allocate(NX*NY*NZ*sizeof(float),Wcpu,GL_STATIC_DRAW);// binding
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, 0);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1);
 F.bindBase(GL_SHADER_STORAGE_BUFFER, 2);
 U.bindBase(GL_SHADER_STORAGE_BUFFER, 3);
 V.bindBase(GL_SHADER_STORAGE_BUFFER, 4);
 W.bindBase(GL_SHADER_STORAGE_BUFFER, 5);
}void ofApp::update()
{// bindings (ping pong)
 static int c=0;
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, c);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1-c);
 c=1-c;
 
 // run shader (LBM)
 shader.begin();
 shader.dispatchCompute(NX/10, NY/10, NZ/10);
 shader.end();
}void ofApp::draw()
{
}
 — — — — — — — — — — — — — — — — — ofApp.cpp (end)

OK, we are done. Now you should be able to compile the code and run it. The resulting velocity will be stored on GPU in U/V/W tables and distribution function tables f0 and f1.

Visualization

Our program, however is not interesting for end-user yet as it doesn’t show anything. What we need is to visualize the flow. Ekhm.. what? 3D? Visualization of fluids? Bulk fluid with velocity in entire space without free surface? How to visualize this is not an easy task with one solution. One can make some cut and show 2D crossection plane, one may do some fancy volume rendering of velocity data. What I suggest is to allocate thousands of particles, put them on GPU, advect on top of the flow (massless particles are advected without inertia) and then plot them directly from Vertex Buffer Object (which is fast). OK, let’s do it.

Defining particles

The particle will be an obect with position (vec4) and color (ofFloatColor here). We will define the structure for single particle within ofApp.h class now. Moreover, we will add so-called Vertex Buffer Object (vbo) to later on use it for fast rendering. Our particles will be stored (initially) on cpu in standard STD vector. Initial positions, allocation on GPU and binding is done now in ofApp.cpp as well (see below). NUMP will keep the number of particles.

— — — — — — — — — — — — — — — — — ofApp.h
#pragma once
#include “ofMain.h”const int WIDTH = 1280;  // window width
const int HEIGHT = 720;  // window height
const int NX = 200; // mesh size x
const int NY = 70;
const int NZ = 70;
const int NUMP = 550000; // number of particlesclass ofApp : public ofBaseApp
{
 public:
 void setup();
 void update();
 void draw();ofShader shader;ofBufferObject f0, f1, F, U, V, W;// particles
  struct Particle        // single particle
  {
   ofVec4f pos;
   ofFloatColor col;
  };ofVbo vbo;             // Vertex Buffer Objectvector<Particle> particles;   // array on CPUofBufferObject particlesBuffer; // GPU buffer}; 
— — — — — — — — — — — — — — — — ofApp.h (end) 
— — — — — — — — — — — — — — — — — ofApp.cpp
#include “ofApp.h”
#include “ofConstants.h”const int q = 15;
float f0cpu[NX*NY*NZ*q]={0};
float f1cpu[NX*NY*NZ*q]={0};
int Fcpu[NX*NY*NZ]={0};
float Ucpu[NX*NY*NZ]={0};
float Vcpu[NX*NY*NZ]={0};
float Wcpu[NX*NY*NZ]={0};
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 1
#define C_BND 0int per(int x, int nx)  // periodic bnd's
{
 if(x<0) x += nx;
  else 
 if(x>nx-1) x -= nx;
 return x;
}void ofApp::setup()
{shader.setupShaderFromFile(GL_COMPUTE_SHADER,"computeshader.cs");
 shader.linkProgram();// initialize density function and fluid flags on CPU
 for(int x=0; x<NX; x++)
 for(int y=0; y<NY; y++)
 for(int z=0; z<NZ; z++)
 {
  int idx = x+y*NX+z*NX*NY;
  for(int k=0; k<q; k++)
   f0cpu[k + q*(idx)] = w[k];
   Fcpu[idx] = 0;  // all sites empty (fluid)
 }
 
 // put some obstacles (spheres)
 srand(time(NULL));
 #define S (NY/4)
 for(int i=0; i<1; i++)  
 {
  int px = NX * (rand()/float(RAND_MAX));
  int py = NY * (rand()/float(RAND_MAX));
  int pz = NZ * (rand()/float(RAND_MAX));
  
  for(int x=px-S; x<px+S; x++)
  for(int y=py-S; y<py+S; y++)
  for(int z=pz-S; z<pz+S; z++)
  if((x-px)*(x-px) + (y-py)*(y-py) + (z-pz)*(z-pz) < S*S)
  { 
   int kx = per(x,NX);
   int ky = per(y,NY);
   int kz = per(z,NZ);
   int idx = kx+ky*NX+kz*NX*NY;
   Fcpu[idx] = 1;
  }
 }// allocate GPU buffers
 f0.allocate(NX*NY*NZ*q*sizeof(float),f0cpu,GL_STATIC_DRAW);
 f1.allocate(NX*NY*NZ*q*sizeof(float),f1cpu,GL_STATIC_DRAW);
 F.allocate(NX*NY*NZ*sizeof(int),Fcpu,GL_STATIC_DRAW);
 U.allocate(NX*NY*NZ*sizeof(float),Ucpu,GL_STATIC_DRAW);
 V.allocate(NX*NY*NZ*sizeof(float),Vcpu,GL_STATIC_DRAW);
 W.allocate(NX*NY*NZ*sizeof(float),Wcpu,GL_STATIC_DRAW);// binding
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, 0);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1);
 F.bindBase(GL_SHADER_STORAGE_BUFFER, 2);
 U.bindBase(GL_SHADER_STORAGE_BUFFER, 3);
 V.bindBase(GL_SHADER_STORAGE_BUFFER, 4);
 W.bindBase(GL_SHADER_STORAGE_BUFFER, 5);// particles, initialize positions on CPU
 particles.resize(NUMP);
 for(auto & p: particles)
 {
  int idx;
  do
  {
   p.pos.x = ofRandom(0,NX);
   p.pos.y = ofRandom(0,NY);
   p.pos.z = ofRandom(0,NZ);
   p.pos.w = 1;
   idx = p.pos.x+p.pos.y*NX+p.pos.z*NX*NY;
  } while(Fcpu[idx] == 1);
 }// allocate on GPU
 particlesBuffer.allocate(particles,GL_DYNAMIC_DRAW);
 vbo.setVertexBuffer(particlesBuffer,4,sizeof(Particle));
vbo.setColorBuffer(particlesBuffer,sizeof(Particle),sizeof(ofVec4f));
 // binding to use in particle compute shader particlesBuffer.bindBase(GL_SHADER_STORAGE_BUFFER, 6);
}void ofApp::update()
{// bindings (ping pong)
 static int c=0;
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, c);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1-c);
 c=1-c;
 
 // run shader (LBM)
 shader.begin();
 shader.dispatchCompute(NX/10, NY/10, NZ/10);
 shader.end();
}void ofApp::draw()
{
}
 — — — — — — — — — — — — — — — — — ofApp.cpp (end)

Now, once we have our particles in a buffer we may implement the second shader today — it is the particles.cs shader which advects particles on top of velocity fields. Surprisingly (or not) we don’t have to bind all buffers again, it will see the same bindings we already set up for LBM compute shader. The particles shader will use color and per functions as before. We will have some “random” function that imitate random number generator. I add some trivial way of avoiding obstacles (just move particles to random positions at the beginning of the system). To move particles we use simple integration:

px = px + u*dt
py = py + v*dt
pz = pz + w*dt

At the end we colour particles using velocity dependent mapping with some logarithmic amplification (and sorry, but I hardcoded this one night so you have to decode if those values I put there have any sense — anyway they look nice for me in 3D which was not that easy to get…). The shader finishes with updating p at [idx] index with new position and color. So input is the particle array and output is the same array but with updated positions. Now, the whole shader comes at once.

— — — — — — — — — — — — — — — — — particles.cs
# version 440#define C_FLD 0
#define C_BND 1// palette
vec4 color(float t)
{
 float coltab[] = {0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 0.00, 0.10, 0.20};//!!
 vec4 col;
 col.r = coltab[0] + coltab[3] * cos(2 * 3.1416 * (coltab[6] * t + coltab[9]));
 col.g = coltab[1] + coltab[4] * cos(2 * 3.1416 * (coltab[7] * t + coltab[10]));
 col.b = coltab[2] + coltab[5] * cos(2 * 3.1416 * (coltab[8] * t + coltab[11]));
 col.a = 1;
 return col;
}struct Particle
{
 vec4 pos;
 vec4 col;
};layout(binding = 2) buffer dcF { int F [ ]; };
layout(binding = 3) buffer dcU { float U [ ]; };
layout(binding = 4) buffer dcV { float V [ ]; };
layout(binding = 5) buffer dcW { float W [ ]; };
layout( local_size_x = 1000, local_size_y = 1, local_size_z = 1 ) in;
layout( binding = 6) buffer particle { Particle p[]; };float per(float x, int nx) // periodic bnd’s
{
 if(x<0) x = nx-1;
 else 
 if(x>nx-1) x = 0;
 return x;
}float random (vec2 uv)
{
 return fract(sin(dot(uv,vec2(12.9898,78.233)))*43758.5453123);
}void main()
{
 int NX = 200;
 int NY = 70;
 int NZ = 70;
 int idx = int(gl_GlobalInvocationID.x);
 float px = p[idx].pos.x;
 float py = p[idx].pos.y;
 float pz = p[idx].pos.z;
 int i = int(px);
 int j = int(py);
 int k = int(pz);
 int idxF = i+j*NX+k*NX*NY;float dt = 10.0;
 if(F[idxF] != 1)
 {
 px = px + U[idxF] * dt;  // we move particle here
 py = py + V[idxF] * dt;
 pz = pz + W[idxF] * dt;
 }
 else
 {
 py=NY*random(vec2(px,pz));
 pz=NZ*random(vec2(py,px));
 px=0;
 }if(px>NX-1)
 {
 py=NY*random(vec2(px,pz));
 pz=NZ*random(vec2(py,px));
 px=0;
 }px = per(px, NX);
 py = per(py, NY);
 pz = per(pz, NZ);float u = U[idxF];
 float v = V[idxF];
 float w = W[idxF];// colours//vec4 col = color(30.0 * (u * u + v * v));
 //vec4 col = vec4(120.0 * (u * u + v * v+ w * w));
 vec4 col = color(10 * (u * u+ v * v));
 col.a = 4*(u * u+ v * v);
 col.a = pow(log(1+col.a)/log(2),0.9);
 //col.a=0;
 if(F[idxF] == 1)
 {
 col = vec4(0.3,0.3,0.5,0.0);
 }// update particle array on GPU
 p[idx].pos.x = px;
 p[idx].pos.y = py;
 p[idx].pos.z = pz;
 p[idx].col = col;
}— — — — — — — — — — — — — — — — — particles.cs (end)

To make things complete we will now update ofApp.h and ofApp.cpp in order to run particles shader. Let us do this now:

— — — — — — — — — — — — — — — — — ofApp.h
#pragma once
#include “ofMain.h”const int WIDTH = 1280;  // window width
const int HEIGHT = 720;  // window height
const int NX = 200; // mesh size x
const int NY = 70;
const int NZ = 70;
const int NUMP = 550000; // number of particlesclass ofApp : public ofBaseApp
{
 public:
 void setup();
 void update();
 void draw();ofShader shader, shaderParticles;ofBufferObject f0, f1, F, U, V, W;// particles
  struct Particle        // single particle
  {
   ofVec4f pos;
   ofFloatColor col;
  };ofVbo vbo;             // Vertex Buffer Objectvector<Particle> particles;   // array on CPUofBufferObject particlesBuffer; // GPU buffer}; 
— — — — — — — — — — — — — — — — ofApp.h (end) 
— — — — — — — — — — — — — — — — — ofApp.cpp
#include “ofApp.h”
#include “ofConstants.h”const int q = 15;
float f0cpu[NX*NY*NZ*q]={0};
float f1cpu[NX*NY*NZ*q]={0};
int Fcpu[NX*NY*NZ]={0};
float Ucpu[NX*NY*NZ]={0};
float Vcpu[NX*NY*NZ]={0};
float Wcpu[NX*NY*NZ]={0};
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 1
#define C_BND 0int per(int x, int nx)  // periodic bnd's
{
 if(x<0) x += nx;
  else 
 if(x>nx-1) x -= nx;
 return x;
}void ofApp::setup()
{
 shader.setupShaderFromFile(GL_COMPUTE_SHADER,"computeshader.cs");
 shader.linkProgram();
shaderParticles.setupShaderFromFile(GL_COMPUTE_SHADER,"particles.cs");
 shaderParticles.linkProgram();// initialize density function and fluid flags on CPU
 for(int x=0; x<NX; x++)
 for(int y=0; y<NY; y++)
 for(int z=0; z<NZ; z++)
 {
  int idx = x+y*NX+z*NX*NY;
  for(int k=0; k<q; k++)
   f0cpu[k + q*(idx)] = w[k];
   Fcpu[idx] = 0;  // all sites empty (fluid)
 }
 
 // put some obstacles (spheres)
 srand(time(NULL));
 #define S (NY/4)
 for(int i=0; i<1; i++)  
 {
  int px = NX * (rand()/float(RAND_MAX));
  int py = NY * (rand()/float(RAND_MAX));
  int pz = NZ * (rand()/float(RAND_MAX));
  
  for(int x=px-S; x<px+S; x++)
  for(int y=py-S; y<py+S; y++)
  for(int z=pz-S; z<pz+S; z++)
  if((x-px)*(x-px) + (y-py)*(y-py) + (z-pz)*(z-pz) < S*S)
  { 
   int kx = per(x,NX);
   int ky = per(y,NY);
   int kz = per(z,NZ);
   int idx = kx+ky*NX+kz*NX*NY;
   Fcpu[idx] = 1;
  }
 }// allocate GPU buffers
 f0.allocate(NX*NY*NZ*q*sizeof(float),f0cpu,GL_STATIC_DRAW);
 f1.allocate(NX*NY*NZ*q*sizeof(float),f1cpu,GL_STATIC_DRAW);
 F.allocate(NX*NY*NZ*sizeof(int),Fcpu,GL_STATIC_DRAW);
 U.allocate(NX*NY*NZ*sizeof(float),Ucpu,GL_STATIC_DRAW);
 V.allocate(NX*NY*NZ*sizeof(float),Vcpu,GL_STATIC_DRAW);
 W.allocate(NX*NY*NZ*sizeof(float),Wcpu,GL_STATIC_DRAW);// binding
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, 0);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1);
 F.bindBase(GL_SHADER_STORAGE_BUFFER, 2);
 U.bindBase(GL_SHADER_STORAGE_BUFFER, 3);
 V.bindBase(GL_SHADER_STORAGE_BUFFER, 4);
 W.bindBase(GL_SHADER_STORAGE_BUFFER, 5);// particles, initialize positions on CPU
 particles.resize(NUMP);
 for(auto & p: particles)
 {
  int idx;
  do
  {
   p.pos.x = ofRandom(0,NX);
   p.pos.y = ofRandom(0,NY);
   p.pos.z = ofRandom(0,NZ);
   p.pos.w = 1;
   idx = p.pos.x+p.pos.y*NX+p.pos.z*NX*NY;
  } while(Fcpu[idx] == 1);
 }// allocate on GPU
 particlesBuffer.allocate(particles,GL_DYNAMIC_DRAW);
 vbo.setVertexBuffer(particlesBuffer,4,sizeof(Particle));
vbo.setColorBuffer(particlesBuffer,sizeof(Particle),sizeof(ofVec4f));
 // binding to use in particle compute shader particlesBuffer.bindBase(GL_SHADER_STORAGE_BUFFER, 6);
}void ofApp::update()
{// bindings (ping pong)
 static int c=0;
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, c);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1-c);
 c=1-c;
 
 // run shader (LBM)
 shader.begin();
 shader.dispatchCompute(NX/10, NY/10, NZ/10);
 shader.end();shaderParticles.begin(); // run particles shader
 shaderParticles.dispatchCompute(NUMP/1000, 1, 1);
 shaderParticles.end();
}void ofApp::draw()
{
}
 — — — — — — — — — — — — — — — — — ofApp.cpp (end)

Drawing particles

In order to draw particles that stay on GPU we need to get back to ofApp.cpp and ofApp.h and define 3D world in OpenFrameworks. We will use EasyCam and ofLight objects that gives us everything for free. Drawing particles from VBO (we attached our particle buffer to vbo before) is easy if someone else done that for us and counted all strides and stuff (I mean you have to properly define what is the size of single element and what should be skipped from element to elemtent if drawn from VBO). However, our situation is simple (particles).

So, let us add camera and draw all the particles. Moreover I will add one light and draw obstacles (no-slip nodes) using just a bunch of small cubes (this is quick solution and may be done better and faster, but works for me if number of no-slip nodes is low + I draw those that are neighbours of fluid nodes only).

— — — — — — — — — — — — — — — — — ofApp.h
#pragma once
#include “ofMain.h”const int WIDTH = 1280;  // window width
const int HEIGHT = 720;  // window height
const int NX = 200; // mesh size x
const int NY = 70;
const int NZ = 70;
const int NUMP = 550000; // number of particlesclass ofApp : public ofBaseApp
{
 public:
 void setup();
 void update();
 void draw();ofShader shader, shaderParticles;ofBufferObject f0, f1, F, U, V, W;// particles
  struct Particle        // single particle
  {
   ofVec4f pos;
   ofFloatColor col;
  };ofVbo vbo;             // Vertex Buffer Objectvector<Particle> particles;   // array on CPU
 ofBufferObject particlesBuffer; // GPU buffer// visualization
 ofLight light;
 ofEasyCam cam;}; 
— — — — — — — — — — — — — — — — ofApp.h (end) 
— — — — — — — — — — — — — — — — — ofApp.cpp
#include “ofApp.h”
#include “ofConstants.h”const int q = 15;
float f0cpu[NX*NY*NZ*q]={0};
float f1cpu[NX*NY*NZ*q]={0};
int Fcpu[NX*NY*NZ]={0};
float Ucpu[NX*NY*NZ]={0};
float Vcpu[NX*NY*NZ]={0};
float Wcpu[NX*NY*NZ]={0};
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 1
#define C_BND 0int per(int x, int nx)  // periodic bnd's
{
 if(x<0) x += nx;
  else 
 if(x>nx-1) x -= nx;
 return x;
}void ofApp::setup()
{
 shader.setupShaderFromFile(GL_COMPUTE_SHADER,"computeshader.cs");
 shader.linkProgram();
shaderParticles.setupShaderFromFile(GL_COMPUTE_SHADER,"particles.cs");
 shaderParticles.linkProgram();// initialize density function and fluid flags on CPU
 for(int x=0; x<NX; x++)
 for(int y=0; y<NY; y++)
 for(int z=0; z<NZ; z++)
 {
  int idx = x+y*NX+z*NX*NY;
  for(int k=0; k<q; k++)
   f0cpu[k + q*(idx)] = w[k];
   Fcpu[idx] = 0;  // all sites empty (fluid)
 }
 
 // put some obstacles (spheres)
 srand(time(NULL));
 #define S (NY/4)
 for(int i=0; i<1; i++)  
 {
  int px = NX * (rand()/float(RAND_MAX));
  int py = NY * (rand()/float(RAND_MAX));
  int pz = NZ * (rand()/float(RAND_MAX));
  
  for(int x=px-S; x<px+S; x++)
  for(int y=py-S; y<py+S; y++)
  for(int z=pz-S; z<pz+S; z++)
  if((x-px)*(x-px) + (y-py)*(y-py) + (z-pz)*(z-pz) < S*S)
  { 
   int kx = per(x,NX);
   int ky = per(y,NY);
   int kz = per(z,NZ);
   int idx = kx+ky*NX+kz*NX*NY;
   Fcpu[idx] = 1;
  }
 }// allocate GPU buffers
 f0.allocate(NX*NY*NZ*q*sizeof(float),f0cpu,GL_STATIC_DRAW);
 f1.allocate(NX*NY*NZ*q*sizeof(float),f1cpu,GL_STATIC_DRAW);
 F.allocate(NX*NY*NZ*sizeof(int),Fcpu,GL_STATIC_DRAW);
 U.allocate(NX*NY*NZ*sizeof(float),Ucpu,GL_STATIC_DRAW);
 V.allocate(NX*NY*NZ*sizeof(float),Vcpu,GL_STATIC_DRAW);
 W.allocate(NX*NY*NZ*sizeof(float),Wcpu,GL_STATIC_DRAW);// binding
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, 0);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1);
 F.bindBase(GL_SHADER_STORAGE_BUFFER, 2);
 U.bindBase(GL_SHADER_STORAGE_BUFFER, 3);
 V.bindBase(GL_SHADER_STORAGE_BUFFER, 4);
 W.bindBase(GL_SHADER_STORAGE_BUFFER, 5);// particles, initialize positions on CPU
 particles.resize(NUMP);
 for(auto & p: particles)
 {
  int idx;
  do
  {
   p.pos.x = ofRandom(0,NX);
   p.pos.y = ofRandom(0,NY);
   p.pos.z = ofRandom(0,NZ);
   p.pos.w = 1;
   idx = p.pos.x+p.pos.y*NX+p.pos.z*NX*NY;
  } while(Fcpu[idx] == 1);
 }// allocate on GPU
 particlesBuffer.allocate(particles,GL_DYNAMIC_DRAW);
 vbo.setVertexBuffer(particlesBuffer,4,sizeof(Particle));
vbo.setColorBuffer(particlesBuffer,sizeof(Particle),sizeof(ofVec4f));
 // binding to use in particle compute shader particlesBuffer.bindBase(GL_SHADER_STORAGE_BUFFER, 6);// properties of our visualization (camera etc.)
 cam.setDrag(0.6);
 cam.setDistance(NX/2);
 ofEnableAntiAliasing();
 ofEnableAlphaBlending();
}void ofApp::update()
{// bindings (ping pong)
 static int c=0;
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, c);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1-c);
 c=1-c;
 
 // run shader (LBM)
 shader.begin();
 shader.dispatchCompute(NX/10, NY/10, NZ/10);
 shader.end();shaderParticles.begin(); // run particles shader
 shaderParticles.dispatchCompute(NUMP/1000, 1, 1);
 shaderParticles.end();
}void ofApp::draw()  
{// here we draw the stuff
 ofSetBackgroundAuto(false);  
 ofClear(ofColor(0,0,0));cam.begin();
 glPointSize(4);
 ofTranslate(-NX/2, -NY/2, -NZ/2);
 ofEnableBlendMode(OF_BLENDMODE_ADD);
 vbo.draw(GL_POINTS,0,particles.size());   // draw particles
 
                                          // draw obstacles
 ofBoxPrimitive box;
 box.set(1.0);
 ofSetColor(ofColor(54,12,1,120));
 for(int x=1; x<NX-1; x++)
 for(int y=1; y<NY-1; y++)
 for(int z=1; z<NZ-1; z++)
 if(Fcpu[x+y*NX+z*NX*NY]==1 && 
  (
   Fcpu[x+1+y*NX+z*NX*NY]==0 ||
   Fcpu[x-1+y*NX+z*NX*NY]==0 ||
   Fcpu[x+(y+1)*NX+z*NX*NY]==0 ||
   Fcpu[x+(y-1)*NX+z*NX*NY]==0 ||
   Fcpu[x+y*NX+(z+1)*NX*NY]==0 ||
   Fcpu[x+y*NX+(z-1)*NX*NY]==0 
   )
  )
 {
  box.setPosition(x,y,z);
  box.draw();
 }ofDisableBlendMode();
 cam.end();}
 — — — — — — — — — — — — — — — — — ofApp.cpp (end)

Interactivity and complete code

Keep in mind this is all realtime. That means user may interact with fluid. I didn’t put much into this, just to make things interesting I give you a hind, simple function in which user can add an obstacle in random position just by pressing any key. The complete code (includeing not mentioned before the main OF file!) is now here:

— — — — — — — — — — — — — — — — — main.cpp#include “ofMain.h”
#include “ofApp.h”int main( )
{
 ofGLWindowSettings settings;
 settings.setGLVersion(4,5);
 #ifdef __linux__ 
 settings.setSize(WIDTH,HEIGHT);
 #elif _WIN32
 settings.width=WIDTH;
 settings.height=HEIGHT;
 #else
 #endif
 ofSetDataPathRoot(“../”);
 ofCreateWindow(settings);
 ofRunApp(new ofApp());
}
— — — — — — — — — — — — — — — — — main.cpp (end)— — — — — — — — — — — — — — — — — ofApp.h
#pragma once
#include “ofMain.h”const int WIDTH = 1280;  // window width
const int HEIGHT = 720;  // window height
const int NX = 200; // mesh size x
const int NY = 70;
const int NZ = 70;
const int NUMP = 550000; // number of particlesclass ofApp : public ofBaseApp
{
 public:
 void setup();
 void update();
 void draw();
 void keyPressed(int key);ofShader shader, shaderParticles;ofBufferObject f0, f1, F, U, V, W;// particles
  struct Particle        // single particle
  {
   ofVec4f pos;
   ofFloatColor col;
  };ofVbo vbo;             // Vertex Buffer Objectvector<Particle> particles;   // array on CPU
 ofBufferObject particlesBuffer; // GPU buffer// visualization
 ofLight light;
 ofEasyCam cam;}; 
— — — — — — — — — — — — — — — — ofApp.h (end) 
— — — — — — — — — — — — — — — — — ofApp.cpp
#include “ofApp.h”
#include “ofConstants.h”const int q = 15;
float f0cpu[NX*NY*NZ*q]={0};
float f1cpu[NX*NY*NZ*q]={0};
int Fcpu[NX*NY*NZ]={0};
float Ucpu[NX*NY*NZ]={0};
float Vcpu[NX*NY*NZ]={0};
float Wcpu[NX*NY*NZ]={0};
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 1
#define C_BND 0int per(int x, int nx)  // periodic bnd's
{
 if(x<0) x += nx;
  else 
 if(x>nx-1) x -= nx;
 return x;
}void ofApp::setup()
{
 shader.setupShaderFromFile(GL_COMPUTE_SHADER,"computeshader.cs");
 shader.linkProgram();
shaderParticles.setupShaderFromFile(GL_COMPUTE_SHADER,"particles.cs");
 shaderParticles.linkProgram();// initialize density function and fluid flags on CPU
 for(int x=0; x<NX; x++)
 for(int y=0; y<NY; y++)
 for(int z=0; z<NZ; z++)
 {
  int idx = x+y*NX+z*NX*NY;
  for(int k=0; k<q; k++)
   f0cpu[k + q*(idx)] = w[k];
   Fcpu[idx] = 0;  // all sites empty (fluid)
 }
 
 // put some obstacles (spheres)
 srand(time(NULL));
 #define S (NY/4)
 for(int i=0; i<1; i++)  
 {
  int px = NX * (rand()/float(RAND_MAX));
  int py = NY * (rand()/float(RAND_MAX));
  int pz = NZ * (rand()/float(RAND_MAX));
  
  for(int x=px-S; x<px+S; x++)
  for(int y=py-S; y<py+S; y++)
  for(int z=pz-S; z<pz+S; z++)
  if((x-px)*(x-px) + (y-py)*(y-py) + (z-pz)*(z-pz) < S*S)
  { 
   int kx = per(x,NX);
   int ky = per(y,NY);
   int kz = per(z,NZ);
   int idx = kx+ky*NX+kz*NX*NY;
   Fcpu[idx] = 1;
  }
 }// allocate GPU buffers
 f0.allocate(NX*NY*NZ*q*sizeof(float),f0cpu,GL_STATIC_DRAW);
 f1.allocate(NX*NY*NZ*q*sizeof(float),f1cpu,GL_STATIC_DRAW);
 F.allocate(NX*NY*NZ*sizeof(int),Fcpu,GL_STATIC_DRAW);
 U.allocate(NX*NY*NZ*sizeof(float),Ucpu,GL_STATIC_DRAW);
 V.allocate(NX*NY*NZ*sizeof(float),Vcpu,GL_STATIC_DRAW);
 W.allocate(NX*NY*NZ*sizeof(float),Wcpu,GL_STATIC_DRAW);// binding
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, 0);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1);
 F.bindBase(GL_SHADER_STORAGE_BUFFER, 2);
 U.bindBase(GL_SHADER_STORAGE_BUFFER, 3);
 V.bindBase(GL_SHADER_STORAGE_BUFFER, 4);
 W.bindBase(GL_SHADER_STORAGE_BUFFER, 5);// particles, initialize positions on CPU
 particles.resize(NUMP);
 for(auto & p: particles)
 {
  int idx;
  do
  {
   p.pos.x = ofRandom(0,NX);
   p.pos.y = ofRandom(0,NY);
   p.pos.z = ofRandom(0,NZ);
   p.pos.w = 1;
   idx = p.pos.x+p.pos.y*NX+p.pos.z*NX*NY;
  } while(Fcpu[idx] == 1);
 }// allocate on GPU
 particlesBuffer.allocate(particles,GL_DYNAMIC_DRAW);
 vbo.setVertexBuffer(particlesBuffer,4,sizeof(Particle));
vbo.setColorBuffer(particlesBuffer,sizeof(Particle),sizeof(ofVec4f));
 // binding to use in particle compute shader particlesBuffer.bindBase(GL_SHADER_STORAGE_BUFFER, 6);// properties of our visualization (camera etc.)
 cam.setDrag(0.6);
 cam.setDistance(NX/2);
 ofEnableAntiAliasing();
 ofEnableAlphaBlending();
}void ofApp::update()
{// bindings (ping pong)
 static int c=0;
 f0.bindBase(GL_SHADER_STORAGE_BUFFER, c);
 f1.bindBase(GL_SHADER_STORAGE_BUFFER, 1-c);
 c=1-c;
 
 // run shader (LBM)
 shader.begin();
 shader.dispatchCompute(NX/10, NY/10, NZ/10);
 shader.end();shaderParticles.begin(); // run particles shader
 shaderParticles.dispatchCompute(NUMP/1000, 1, 1);
 shaderParticles.end();
}void ofApp::draw()  
{// here we draw the stuff
 ofSetBackgroundAuto(false);  
 ofClear(ofColor(0,0,0));cam.begin();
 glPointSize(4);
 ofTranslate(-NX/2, -NY/2, -NZ/2);
 ofEnableBlendMode(OF_BLENDMODE_ADD);
 vbo.draw(GL_POINTS,0,particles.size());   // draw particles
 
                                          // draw obstacles
 ofBoxPrimitive box;
 box.set(1.0);
 ofSetColor(ofColor(54,12,1,120));
 for(int x=1; x<NX-1; x++)
 for(int y=1; y<NY-1; y++)
 for(int z=1; z<NZ-1; z++)
 if(Fcpu[x+y*NX+z*NX*NY]==1 && 
  (
   Fcpu[x+1+y*NX+z*NX*NY]==0 ||
   Fcpu[x-1+y*NX+z*NX*NY]==0 ||
   Fcpu[x+(y+1)*NX+z*NX*NY]==0 ||
   Fcpu[x+(y-1)*NX+z*NX*NY]==0 ||
   Fcpu[x+y*NX+(z+1)*NX*NY]==0 ||
   Fcpu[x+y*NX+(z-1)*NX*NY]==0 
   )
  )
 {
  box.setPosition(x,y,z);
  box.draw();
 }ofDisableBlendMode();
 cam.end();}void ofApp::keyPressed(int key)
{  
 cout<<key<<endl;
 #define S (NY/4)
 //for(int i=0; i<75; i++)
 //{
  int px = NX * (rand()/float(RAND_MAX));
  int py = NY * (rand()/float(RAND_MAX));
  int pz = NZ * (rand()/float(RAND_MAX));
  
  for(int x=px-2*S; x<px+2*S; x++)
  for(int y=py-2*S; y<py+2*S; y++)
  for(int z=pz-2*S; z<pz+2*S; z++)
  if((x-px)*(x-px) + (y-py)*(y-py) + (z-pz)*(z-pz) < S*S)
  { 
   int kx = per(x,NX);
   int ky = per(y,NY);
   int kz = per(z,NZ);
   int idx = kx+ky*NX+kz*NX*NY;
   Fcpu[idx] = 1;
  }
 //}
 F.setData(NX*NY*NZ*sizeof(int),Fcpu,GL_DYNAMIC_DRAW);
}
 — — — — — — — — — — — — — — — — — ofApp.cpp (end)— — — — — — — — — — — — — — — — — particles.cs
# version 440#define C_FLD 0
#define C_BND 1// palette
vec4 color(float t)
{
 float coltab[] = {0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 0.00, 0.10, 0.20};//!!
 vec4 col;
 col.r = coltab[0] + coltab[3] * cos(2 * 3.1416 * (coltab[6] * t + coltab[9]));
 col.g = coltab[1] + coltab[4] * cos(2 * 3.1416 * (coltab[7] * t + coltab[10]));
 col.b = coltab[2] + coltab[5] * cos(2 * 3.1416 * (coltab[8] * t + coltab[11]));
 col.a = 1;
 return col;
}struct Particle
{
 vec4 pos;
 vec4 col;
};layout(binding = 2) buffer dcF { int F [ ]; };
layout(binding = 3) buffer dcU { float U [ ]; };
layout(binding = 4) buffer dcV { float V [ ]; };
layout(binding = 5) buffer dcW { float W [ ]; };
layout( local_size_x = 1000, local_size_y = 1, local_size_z = 1 ) in;
layout( binding = 6) buffer particle { Particle p[]; };float per(float x, int nx) // periodic bnd’s
{
 if(x<0) x = nx-1;
 else 
 if(x>nx-1) x = 0;
 return x;
}float random (vec2 uv)
{
 return fract(sin(dot(uv,vec2(12.9898,78.233)))*43758.5453123);
}void main()
{
 int NX = 200;
 int NY = 70;
 int NZ = 70;
 int idx = int(gl_GlobalInvocationID.x);
 float px = p[idx].pos.x;
 float py = p[idx].pos.y;
 float pz = p[idx].pos.z;
 int i = int(px);
 int j = int(py);
 int k = int(pz);
 int idxF = i+j*NX+k*NX*NY;float dt = 10.0;
 if(F[idxF] != 1)
 {
 px = px + U[idxF] * dt;  // we move particle here
 py = py + V[idxF] * dt;
 pz = pz + W[idxF] * dt;
 }
 else
 {
 py=NY*random(vec2(px,pz));
 pz=NZ*random(vec2(py,px));
 px=0;
 }if(px>NX-1)
 {
 py=NY*random(vec2(px,pz));
 pz=NZ*random(vec2(py,px));
 px=0;
 }px = per(px, NX);
 py = per(py, NY);
 pz = per(pz, NZ);float u = U[idxF];
 float v = V[idxF];
 float w = W[idxF];// colours//vec4 col = color(30.0 * (u * u + v * v));
 //vec4 col = vec4(120.0 * (u * u + v * v+ w * w));
 vec4 col = color(10 * (u * u+ v * v));
 col.a = 4*(u * u+ v * v);
 col.a = pow(log(1+col.a)/log(2),0.9);
 //col.a=0;
 if(F[idxF] == 1)
 {
 col = vec4(0.3,0.3,0.5,0.0);
 }// update particle array on GPU
 p[idx].pos.x = px;
 p[idx].pos.y = py;
 p[idx].pos.z = pz;
 p[idx].col = col;
}— — — — — — — — — — — — — — — — — particles.cs (end)— — — — — — — — — — — — — — — — — computeshader.cs
#version 440
#define tau 0.91//0.6
#define omega (1.0/tau) // viscosity, etc.
const int q = 15;
const int ex[q] = {0,-1 , 0, 0,-1,-1,-1,-1, 1, 0, 0, 1, 1, 1, 1};
const int ey[q] = {0, 0 ,-1, 0,-1,-1, 1, 1, 0, 1, 0, 1, 1,-1,-1};
const int ez[q] = {0, 0 , 0,-1,-1, 1,-1, 1, 0, 0, 1, 1,-1, 1,-1};
const int inv[q] = {0,8,9,10,11,12,13,14,1,2,3,4,5,6,7};
const float z=2.0/9.0, l=1.0/72.0, s=1.0/9.0;
const float w[q]={z,s,s,s,l,l,l,l,s,s,s,l,l,l,l};
#define C_FLD 0
#define C_BND 1layout( binding=0 ) buffer df0 { float f0[  ]; };
layout( binding=1 ) buffer df1 { float f1[  ]; };
layout( binding=2 ) buffer dcF { int F[  ]; };
layout( binding=3 ) buffer dcU { float U[  ]; };
layout( binding=4 ) buffer dcV { float V[  ]; };
layout( binding=5 ) buffer dcW { float W[  ]; };
layout( local_size_x = 10, local_size_y = 10, local_size_z = 10 ) in;void main() 
{
 int NX=200;
 int NY=70;
 int NZ=70;
 float devFx = 0.0002;
 float devFy = 0.0;
 float devFz = 0.0;int i = int(gl_GlobalInvocationID.x);
 int j = int(gl_GlobalInvocationID.y);
 int k = int(gl_GlobalInvocationID.z);
 int idx = i+j*NX+k*NX*NY;if( F[ idx ] == C_FLD )  // if the node is fluid
 { 
  for(int m=0; m<q; m++)   // calculate density and velocity loop
  {
   rho = rho + f0[idx*q+m];
   ux  = ux + f0[idx*q+m]*ex[m];
   uy  = uy + f0[idx*q+m]*ey[m];
   uz = uz + f0[idx*q+m]*ez[m];
  }
  ux /= rho;
  uy /= rho;
  uz /= rho;
  U[ idx ] = ux;  // store velocity
  V[ idx ] = uy;
  W[ idx ] = uz;
  ux = ux + 0.5 * devFx; // update velocity with force
  uy = uy + 0.5 * devFy;
  uz = uz + 0.5 * devFz;
  for(int m=0; m<q; m++)  // collision + streaming  
  {                       // (the main solver is here, really)
   int ip = i+ex[m];
   int jp = j+ey[m];
   int kp = k+ez[m];ip=per(ip,NX-1); // periodic neighbour...
   jp=per(jp,NY-1);
   kp=per(kp,NZ-1);int idxp = ip+jp*NX+kp*NX*NY; //... and his indexif( F[ idxp ] == C_BND ) // if outcomming is boundary node
    f1[ idx*q + inv[m] ]    = (1-omega) * f0[idx*q+m] + omega * w[m]    
    * rho * (1.0 - (3.0/2.0) * (ux*ux + uy*uy * uz*uz)+ 3.0 * (ex[m]  
    * ux + ey[m]*uy + ez[m]*uz)+ (9.0/2.0) * (ex[m] * ux + ey[m]*uy 
    + ez[m]*uz) * (ex[m] * ux + ey[m]*uy + ez[m]*uz));
   else
    f1[ idxp*q + m ] = (1-omega) * f0[idx*q+m] + omega * w[m] * rho 
    * (1.0 - (3.0/2.0) * (ux*ux + uy*uy * uz*uz)+ 3.0 * (ex[m] * ux 
    + ey[m]*uy + ez[m]*uz)+ (9.0/2.0) * (ex[m] * ux + ey[m]*uy + 
    ez[m]*uz) * (ex[m] * ux + ey[m]*uy + ez[m]*uz));
  }
 }}
— — — — — — — — — — — — — — — — — computeshader.cs (end)

Done, and the result is here (realtime capture!).
Sorry for the music, I was testing YouTube shorts application here ;)

This was done for LBM in Kraków 2022, Have fun!

Maciej Matyka, 2022–02–03

GPU Compute Shaders part 2 — fluids

Intro

Starting from scratch

Written by Maciej Matyka