• XSS.stack #1 – первый литературный журнал от юзеров форума

spectre inside cuda shared memory pool?

drpalpatine

(L3) cache
Пользователь
Регистрация
04.08.2021
Сообщения
260
Решения
1
Реакции
108
Гарант сделки
2
Депозит
0.0001
cuda programmers inside the forum --> i am trying to port spectre inside cuda kernel shared memory

the problem is how does the reptoline affect such the accuracy
--> also for example if secret value is absent or present inside the cache --> the timing measurement changes? but how? i cannot understand

--> how man know timing difference < threshold_value == success or failure? --> how often this comes for example there is noise inside timing measurements or race for shared resources?


can man use cudaOccupancyMaxPotentialBlockSize and nvprof ? what settings?


C:
unsigned char secret = 0;
unsigned char target[256];
unsigned char table[256 * 4096];

__global__ void Spectre_attack(unsigned char* target, unsigned char* table, int offset)
{
    __shared__ unsigned char secret;

    if (threadIdx.x == 0) {
        secret = target[offset];
    }
    __syncthreads();

    unsigned char value = table[secret * blockDim.x + threadIdx.x];
    // continue spectre gpu parallel + shared memory????
}

int main()
{
    //  init --> target buffer+table....

    _mm_clflush(&secret);

    // gpu kernel param
    dim3 grid_dim(1, 1, 1);
    dim3 block_dim(256, 1, 1);

    // run --> measure timing difference
    for (int i = 0; i < 1000; i++) {
        Spectre_attack<<<grid_dim, block_dim>>>(target, table, offset);
        cudaDeviceSynchronize();
        unsigned int junk;
        unsigned long long start_time = __rdtscp(&junk);
        // read --> secret value inside gpu memory
        unsigned long long end_time = __rdtscp(&junk);
        if (end_time - start_time < threshold) {
            // secret value --> access
        }
    }

    return 0;
}
 
Последнее редактирование:
can man use cudaOccupancyMaxPotentialBlockSize and nvprof ? what settings?
C:
cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, my_kernel, 0, array_size);

how to select correct block size and other parameters? --> idea is to get max threads/block without kernel launch failures (already x10000 failures((( and other bastards
 
C:
cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, my_kernel, 0, array_size);

how to select correct block size and other parameters? --> idea is to get max threads/block without kernel launch failures (already x10000 failures((( and other bastards
now --> let talk inside optimize and success spectre attack (general timing side channel attack) inside CUDA GPU

at end you can see --> high time window + noise != spectre
i learn something today -->
there is nice tool --> occupancy calculator inside CUDA kit (deprecated there is new nsight compute --> we will discuss next)
docs.nvidia.com/cuda/cuda-occupancy-calculator/
access through --> nvcc compiler or CUDA visual profiler (nvvp)

--> compile kernel with --ptxas-options=-v
Код:
nvcc -arch=sm_86 --ptxas-options=-v kernel4.cu -o kernel4
--> output i get from the compiler --> number of registers + shared memory by kernel
--> run occupancy calculator on PTX assembly code
Код:
nv-nsight-cu-cli --query-metrics -k kernel4 --ptx-file kernel4.ptx
he will throw info about max potential occupancy inside kernel for different block size + other config)))

Код:
    Block size: 32 Occupancy: 0.833 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 64 Occupancy: 0.833 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 96 Occupancy: 0.805 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 128 Occupancy: 0.768 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 160 Occupancy: 0.744 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 192 Occupancy: 0.722 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 224 Occupancy: 0.704 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
    Block size: 256 Occupancy: 0.689 Registers/thread: 32 Shared memory/thread: 0 B Block shared memory: 0 B
 
Последнее редактирование:
first to optimize our attack --> we need optimize the CUDA kernel
several ideas --> thread divergence --> different threads in single block execute different code paths --> inefficient resource usage --> this bastard reason why need correct block size which is multiple of warp size (normal 32 threads inside NVIDIA GPUs)

another important --> occupancy
what is such?
ratio of active threads / max number of threads which can be exec inside simultaneously inside GPU
so good performance

but there is a bastard which cause me problems --> why not increase more block size for best attack performance? --> kernel failure
why? --> limited resources like shared memory, registers
so we need nice balance for best block size --> cudaOccupancyMaxPotentialBlockSize

we use a nsight compute --> developer.nvidia.com/nsight-compute
--> first profile the full kernel
Код:
ncu --set full --kernel kernel4 app4
he will generate such metrics about the kernel

==PROF== Profiling application: app4 ==PROF== Profiling command line: app4 ==PROF== Generating CUDA Unified Memory profiling result ==PROF== Generating CUDA Kernel Statistics profiling result ==PROF== Generating CUDA Memory Transactions profiling result ==PROF== Generating CUDA Trace API profiling result ==PROF== Generating CUDA Kernel Metrics profiling result ==PROF== Generating CUDA Event Metrics profiling result ==PROF== Generating CUDA CUPTI API Trace profiling result ==PROF== Generating CUDA CUPTI API Trace Correlation profiling result ==PROF== Generating CUDA CUPTI Activity Trace profiling result ==PROF== Generating CUDA CUPTI Activity Trace Correlation profiling result ==PROF== Generating CUDA CUPTI GPU Trace profiling result ==PROF== Generating CUDA CUPTI GPU Trace Correlation profiling result ==PROF== Generating CUDA CUPTI Instantaneous Event Metrics profiling result ==PROF== Generating CUDA CUPTI OpenACC Metrics profiling result ==PROF== Generating CUDA CUPTI OpenACC Trace profiling result ==PROF== Generating CUDA CUPTI OpenACC Trace Correlation profiling result ==PROF== Generating CUDA CUPTI PC Sampling profiling result ==PROF== Generating CUDA CUPTI PC Sampling Correlation profiling result ==PROF== Generating CUDA CUPTI Metrics profiling result ==PROF== Generating CUDA CUPTI Metrics Correlation profiling result ==PROF== Generating CUDA Profiling API Trace profiling result ==PROF== Generating CUDA Profiling API Trace Correlation profiling result ==PROF== Generating CUDA Application Timeline Trace profiling result ==PROF== Generating CUDA Stream API Trace profiling result ==PROF== Generating CUDA Stream API Trace Correlation profiling result ==PROF== Generating CUDA nvJPEG profiling result ==PROF== Generating CUDA NVTX profiling result ==PROF== Generating CUDA Hardware Counters profiling result ==PROF== Generating CUDA Ipc profiling result ==PROF== Generating CUDA Callbacks profiling result ==PROF== Generating CUDA Resource Information profiling result ==PROF== Generating CUDA Event Tracing profiling result ==PROF== Generating CUDA CDP profiling result ==PROF== Generating CUDA I/O profiling result ==PROF== Generating CUDA NvSci profiling result ==PROF== Generating CUDA nvToolsExt profiling result ==PROF== Generating CUDA RAPIDS profiling result ==PROF== Generating CUDA DCGM profiling result ==PROF== Generating CUDA Runtime API Trace profiling result ==PROF== Generating CUDA Runtime API Trace Correlation profiling result ==PROF== Generating CUDA Visual Profiler profiling result ==PROF== Profiling result saved to ./app4.ncu-rep


--> lets take a simple nsight systems --> profile a CUDA kernel
simple kernel --> he will apply sine, cosine to each element inside a input array --> we launch kernel with single block of 256 threads
C:
__global__ void kernel4(float* input, float* output, int size)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        output[idx] = sin(input[idx]) * cos(input[idx]);
    }
}

int main()
{
    const int array_size = 1024;
    float* input_array;
    float* output_array;

    cudaMalloc(&input_array, array_size * sizeof(float));
    cudaMalloc(&output_array, array_size * sizeof(float));

    kernel4<<<1, 256>>>(input_array, output_array, array_size);

    cudaFree(input_array);
    cudaFree(output_array);
    return 0;
}



--> next we get max potential block size of our kernel
C:
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, kernel4, 0, array_size);

the function returns --> max block size that can be use inside kernel for our device --> sets min grid size accordingly



--> we launch our kernel
C:
kernel4<<<(array_size + blockSize - 1) / blockSize, blockSize>>>(input_array, output_array, array_size);
 
final --> her results

Код:
-------------------------- NVTX ---------------------------
Time(%)      Time     Calls       Avg       Min       Max  Name
--------------------------------------------------------------
  98.06%  65.210us         1  65.210us  65.210us  65.210us  [CUDA memcpy HtoD]
   1.94%  1.2920us         1  1.2920us  1.2920us  1.


we have % of time inside each function + total time + number of calls + average/min/max time for each func))

we see 98.06% time --> for CUDA memory copy operation (specifically host-to-device copy)
total time inside this operation = 65.210 micro seconds

remaining time (1.94%) for another function = 1.2920 micro seconds per call



finally our analysis show the performance memory copy operation between host to device is the bastard
 
now we will improve the performance
since cudaMemcpy() is our bastard man
idea 1 --> async memory transfer --> cudaMemcpyAsync()
idea 2 --> pinned memory --> he can improve data transfer performance between HtoD by reduce overhead inside memory transfer
idea 3 --> increase the threads/block but such is dangerous --> another kernel failure will fly inside ass --> so we will only try not such revolution now

C:
    const int array_size = 1024;
    float *input_array, *output_array;
    // alloc --> pinned memory
    cudaMallocHost(&input_array, array_size * sizeof(float), cudaHostAllocDefault);
    cudaMallocHost(&output_array, array_size * sizeof(float), cudaHostAllocDefault);

    for (int i = 0; i < array_size; i++) {
        input_array[i] = i;
    }

    cudaStream_t stream;
    cudaStreamCreate(&stream);
    // cuda stream --> transfer HtoD async
    cudaMemcpyAsync(input_array, input_array, array_size * sizeof(float), cudaMemcpyHostToDevice, stream);

    kernel5<<<1, 256, 0, stream>>>(input_array, output_array, array_size);

    cudaMemcpyAsync(output_array, output_array, array_size * sizeof(float), cudaMemcpyDeviceToHost, stream);

    cudaStreamSynchronize(stream);

    cudaFreeHost(input_array);
    cudaFreeHost(output_array);

}


Код:
48.02% 6.6100us 1 6.6100us 6.6100us 6.6100us [CUDA memcpy HtoD]
33.99% 4.6800us 1 4.6800us 4.6800us 4.6800us [CUDA memcpy DtoH]
17.99% 2.4800us 1 2.4800us 2.4800us 2.4800us kernel5

total time --> 65.210micro seconds --> 13.77))))
 
cool but why such optimization important? --> garbage?

no not actual --> very important
--> noise --> he is bastard for timing side-channel attacks (specre attack) --> you can only extract such sensitive info with only a threshold_value noise --> only then you can extract sensitive data with accuracy + reliable


in short --> GPU exploitation --> we see how pinned memory (avoid overhead from copy data to from pageable memory) + async transfer (we execute instructions even when there is data transfer between CPU <--> GPU)

time window + noise decreased by 5 times for our sine + cosine on a vector computations))

i will talk more cool nice techniques in GPU kernels in more articles
cache attack --> other than timing side channel --> fill cache with specific bytes --> measure time taken for GPU to access him
instruction-level speculation etc
we can combine such techniques with simple memory compression before transfer to GPU --> but careful balance without increase time

EDIT --> says sorry for such bad format --> i didnot think about writing in a article format
 
Последнее редактирование:


Напишите ответ...
  • Вставить:
Прикрепить файлы
Верх