spectre inside cuda shared memory pool?

drpalpatine · 04.05.2023

cuda programmers inside the forum --> i am trying to port spectre inside cuda kernel shared memory

the problem is how does the reptoline affect such the accuracy
--> also for example if secret value is absent or present inside the cache --> the timing measurement changes? but how? i cannot understand

--> how man know timing difference < threshold_value == success or failure? --> how often this comes for example there is noise inside timing measurements or race for shared resources?

can man use cudaOccupancyMaxPotentialBlockSize and nvprof ? what settings?

C:

unsigned char secret = 0;
unsigned char target[256];
unsigned char table[256 * 4096];

__global__ void Spectre_attack(unsigned char* target, unsigned char* table, int offset)
{
    __shared__ unsigned char secret;

    if (threadIdx.x == 0) {
        secret = target[offset];
    }
    __syncthreads();

    unsigned char value = table[secret * blockDim.x + threadIdx.x];
    // continue spectre gpu parallel + shared memory????
}

int main()
{
    //  init --> target buffer+table....

    _mm_clflush(&secret);

    // gpu kernel param
    dim3 grid_dim(1, 1, 1);
    dim3 block_dim(256, 1, 1);

    // run --> measure timing difference
    for (int i = 0; i < 1000; i++) {
        Spectre_attack<<<grid_dim, block_dim>>>(target, table, offset);
        cudaDeviceSynchronize();
        unsigned int junk;
        unsigned long long start_time = __rdtscp(&junk);
        // read --> secret value inside gpu memory
        unsigned long long end_time = __rdtscp(&junk);
        if (end_time - start_time < threshold) {
            // secret value --> access
        }
    }

    return 0;
}

drpalpatine · 04.05.2023

drpalpatine сказал(а):

can man use cudaOccupancyMaxPotentialBlockSize and nvprof ? what settings?

C:

cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, my_kernel, 0, array_size);

how to select correct block size and other parameters? --> idea is to get max threads/block without kernel launch failures (already x10000 failures((( and other bastards

drpalpatine · 05.05.2023

drpalpatine сказал(а):
C:
cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, my_kernel, 0, array_size);
how to select correct block size and other parameters? --> idea is to get max threads/block without kernel launch failures (already x10000 failures((( and other bastards

now --> let talk inside optimize and success spectre attack (general timing side channel attack) inside CUDA GPU

at end you can see --> high time window + noise != spectre
i learn something today -->
there is nice tool --> occupancy calculator inside CUDA kit (deprecated there is new nsight compute --> we will discuss next)
docs.nvidia.com/cuda/cuda-occupancy-calculator/
access through --> nvcc compiler or CUDA visual profiler (nvvp)

--> compile kernel with --ptxas-options=-v

Код:

nvcc -arch=sm_86 --ptxas-options=-v kernel4.cu -o kernel4

--> output i get from the compiler --> number of registers + shared memory by kernel
--> run occupancy calculator on PTX assembly code

Код:

nv-nsight-cu-cli --query-metrics -k kernel4 --ptx-file kernel4.ptx

he will throw info about max potential occupancy inside kernel for different block size + other config)))

Код:

    Block size: 32 Occupancy: 0.833 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 64 Occupancy: 0.833 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 96 Occupancy: 0.805 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 128 Occupancy: 0.768 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 160 Occupancy: 0.744 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 192 Occupancy: 0.722 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 224 Occupancy: 0.704 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 256 Occupancy: 0.689 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B

drpalpatine · 05.05.2023

first to optimize our attack --> we need optimize the CUDA kernel
several ideas --> thread divergence --> different threads in single block execute different code paths --> inefficient resource usage --> this bastard reason why need correct block size which is multiple of warp size (normal 32 threads inside NVIDIA GPUs)

another important --> occupancy
what is such?
ratio of active threads / max number of threads which can be exec inside simultaneously inside GPU
so good performance

but there is a bastard which cause me problems --> why not increase more block size for best attack performance? --> kernel failure
why? --> limited resources like shared memory, registers
so we need nice balance for best block size --> cudaOccupancyMaxPotentialBlockSize

we use a nsight compute --> developer.nvidia.com/nsight-compute
--> first profile the full kernel

Код:

ncu --set full --kernel kernel4 app4

he will generate such metrics about the kernel


==PROF== Profiling application: app4
==PROF== Profiling command line: app4
==PROF== Generating CUDA Unified Memory profiling result
==PROF== Generating CUDA Kernel Statistics profiling result
==PROF== Generating CUDA Memory Transactions profiling result
==PROF== Generating CUDA Trace API profiling result
==PROF== Generating CUDA Kernel Metrics profiling result
==PROF== Generating CUDA Event Metrics profiling result
==PROF== Generating CUDA CUPTI API Trace profiling result
==PROF== Generating CUDA CUPTI API Trace Correlation profiling result
==PROF== Generating CUDA CUPTI Activity Trace profiling result
==PROF== Generating CUDA CUPTI Activity Trace Correlation profiling result
==PROF== Generating CUDA CUPTI GPU Trace profiling result
==PROF== Generating CUDA CUPTI GPU Trace Correlation profiling result
==PROF== Generating CUDA CUPTI Instantaneous Event Metrics profiling result
==PROF== Generating CUDA CUPTI OpenACC Metrics profiling result
==PROF== Generating CUDA CUPTI OpenACC Trace profiling result
==PROF== Generating CUDA CUPTI OpenACC Trace Correlation profiling result
==PROF== Generating CUDA CUPTI PC Sampling profiling result
==PROF== Generating CUDA CUPTI PC Sampling Correlation profiling result
==PROF== Generating CUDA CUPTI Metrics profiling result
==PROF== Generating CUDA CUPTI Metrics Correlation profiling result
==PROF== Generating CUDA Profiling API Trace profiling result
==PROF== Generating CUDA Profiling API Trace Correlation profiling result
==PROF== Generating CUDA Application Timeline Trace profiling result
==PROF== Generating CUDA Stream API Trace profiling result
==PROF== Generating CUDA Stream API Trace Correlation profiling result
==PROF== Generating CUDA nvJPEG profiling result
==PROF== Generating CUDA NVTX profiling result
==PROF== Generating CUDA Hardware Counters profiling result
==PROF== Generating CUDA Ipc profiling result
==PROF== Generating CUDA Callbacks profiling result
==PROF== Generating CUDA Resource Information profiling result
==PROF== Generating CUDA Event Tracing profiling result
==PROF== Generating CUDA CDP profiling result
==PROF== Generating CUDA I/O profiling result
==PROF== Generating CUDA NvSci profiling result
==PROF== Generating CUDA nvToolsExt profiling result
==PROF== Generating CUDA RAPIDS profiling result
==PROF== Generating CUDA DCGM profiling result
==PROF== Generating CUDA Runtime API Trace profiling result
==PROF== Generating CUDA Runtime API Trace Correlation profiling result
==PROF== Generating CUDA Visual Profiler profiling result
==PROF== Profiling result saved to ./app4.ncu-rep

--> lets take a simple nsight systems --> profile a CUDA kernel
simple kernel --> he will apply sine, cosine to each element inside a input array --> we launch kernel with single block of 256 threads

C:

__global__ void kernel4(float* input, float* output, int size)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        output[idx] = sin(input[idx]) * cos(input[idx]);
    }
}

int main()
{
    const int array_size = 1024;
    float* input_array;
    float* output_array;

    cudaMalloc(&input_array, array_size * sizeof(float));
    cudaMalloc(&output_array, array_size * sizeof(float));

    kernel4<<<1, 256>>>(input_array, output_array, array_size);

    cudaFree(input_array);
    cudaFree(output_array);
    return 0;
}

--> next we get max potential block size of our kernel

C:

int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, kernel4, 0, array_size);

the function returns --> max block size that can be use inside kernel for our device --> sets min grid size accordingly

--> we launch our kernel

C:

kernel4<<<(array_size + blockSize - 1) / blockSize, blockSize>>>(input_array, output_array, array_size);

drpalpatine · 05.05.2023

final --> her results

Код:

-------------------------- NVTX ---------------------------
Time(%)      Time     Calls       Avg       Min       Max  Name
--------------------------------------------------------------
  98.06%  65.210us         1  65.210us  65.210us  65.210us  [CUDA memcpy HtoD]
   1.94%  1.2920us         1  1.2920us  1.2920us  1.

we have % of time inside each function + total time + number of calls + average/min/max time for each func))

we see 98.06% time --> for CUDA memory copy operation (specifically host-to-device copy)
total time inside this operation = 65.210 micro seconds

remaining time (1.94%) for another function = 1.2920 micro seconds per call

finally our analysis show the performance memory copy operation between host to device is the bastard

drpalpatine · 05.05.2023

now we will improve the performance
since cudaMemcpy() is our bastard man
idea 1 --> async memory transfer --> cudaMemcpyAsync()
idea 2 --> pinned memory --> he can improve data transfer performance between HtoD by reduce overhead inside memory transfer
idea 3 --> increase the threads/block but such is dangerous --> another kernel failure will fly inside ass --> so we will only try not such revolution now

C:

    const int array_size = 1024;
    float *input_array, *output_array;
    // alloc --> pinned memory
    cudaMallocHost(&input_array, array_size * sizeof(float), cudaHostAllocDefault);
    cudaMallocHost(&output_array, array_size * sizeof(float), cudaHostAllocDefault);

    for (int i = 0; i < array_size; i++) {
        input_array[i] = i;
    }

    cudaStream_t stream;
    cudaStreamCreate(&stream);
    // cuda stream --> transfer HtoD async
    cudaMemcpyAsync(input_array, input_array, array_size * sizeof(float), cudaMemcpyHostToDevice, stream);

    kernel5<<<1, 256, 0, stream>>>(input_array, output_array, array_size);

    cudaMemcpyAsync(output_array, output_array, array_size * sizeof(float), cudaMemcpyDeviceToHost, stream);

    cudaStreamSynchronize(stream);

    cudaFreeHost(input_array);
    cudaFreeHost(output_array);

}

Код:

48.02% 6.6100us 1 6.6100us 6.6100us 6.6100us [CUDA memcpy HtoD]
33.99% 4.6800us 1 4.6800us 4.6800us 4.6800us [CUDA memcpy DtoH]
17.99% 2.4800us 1 2.4800us 2.4800us 2.4800us kernel5

total time --> 65.210micro seconds --> 13.77))))

drpalpatine · 05.05.2023

cool but why such optimization important? --> garbage?

no not actual --> very important
--> noise --> he is bastard for timing side-channel attacks (specre attack) --> you can only extract such sensitive info with only a threshold_value noise --> only then you can extract sensitive data with accuracy + reliable

in short --> GPU exploitation --> we see how pinned memory (avoid overhead from copy data to from pageable memory) + async transfer (we execute instructions even when there is data transfer between CPU <--> GPU)

time window + noise decreased by 5 times for our sine + cosine on a vector computations))

i will talk more cool nice techniques in GPU kernels in more articles
cache attack --> other than timing side channel --> fill cache with specific bytes --> measure time taken for GPU to access him
instruction-level speculation etc
we can combine such techniques with simple memory compression before transfer to GPU --> but careful balance without increase time

EDIT --> says sorry for such bad format --> i didnot think about writing in a article format

spectre inside cuda shared memory pool?

drpalpatine

(L3) cache

drpalpatine

(L3) cache

drpalpatine

(L3) cache

drpalpatine

(L3) cache

drpalpatine

(L3) cache

drpalpatine

(L3) cache

drpalpatine

(L3) cache