python-catalin: cuda

Showing posts with label cuda. Show all posts

Wednesday, June 5, 2024

News : JAX in 100 Seconds by Fireship!

JAX is a Python library similar to NumPy for scientific computing and linear algebra, but designed to run on accelerators like Cuda-based GPUs and Google's TPUs.

Friday, April 19, 2024

Python 3.10.12 : Colab quantum circuits with qiskit - part 046.

I've added another introductory example to my GitHub repository with Google Colab notebooks on how to use quantum circuits with the Python package called qiskit and the IBM Quantum Platform.

I used the IBM Quantum Platform and it provides an A.P.I symbol so that it can be used with the source code.

You can find this notebook at this catafest_061.ipynb repo file.

Saturday, February 17, 2024

Python 3.10.12 : Few example for CUDA and NVCC - part 044.

NVCC use CUDA C/C++ source code and allows developers to write high-performance GPU-accelerated applications by leveraging the power of NVIDIA GPUs for parallel processing tasks.

Today I test some simple examples with this tool on Google Colab using the nvcc4jupyter python package.

You need to install it with the pip and know how to use the CUDA C/C++ source code, or use the basic example from documentation.

pip install nvcc4jupyter

I change some source code because is need to install this library and I don't have time to learn and test.

But this will allow me to test better, because on my desktop I don't have a good hardware.

This is the source I change and I cut the source code linked on error_handling.h.

This is the changed source code , you can see more on my GitHub repo for Google Colab ! !

#include 
//#include "error_handling.h"

const int DSIZE = 4096;
const int block_size = 256;

// vector add kernel: C = A + B
__global__ void vadd(const float *A, const float *B, float *C, int ds){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < ds) {
        C[idx] = A[idx] + B[idx];
    }
}

int main(){
    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;

    // allocate space for vectors in host memory
    h_A = new float[DSIZE];
    h_B = new float[DSIZE];
    h_C = new float[DSIZE];

    // initialize vectors in host memory to random values (except for the
    // result vector whose values do not matter as they will be overwritten)
    for (int i = 0; i < DSIZE; i++) {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

    // allocate space for vectors in device memory
    cudaMalloc(&d_A, DSIZE*sizeof(float));
    cudaMalloc(&d_B, DSIZE*sizeof(float));
    cudaMalloc(&d_C, DSIZE*sizeof(float));
    //cudaCheckErrors("cudaMalloc failure"); // error checking

    // copy vectors A and B from host to device:
    cudaMemcpy(d_A, h_A, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
    //cudaCheckErrors("cudaMemcpy H2D failure");

    // launch the vector adding kernel
    vadd<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A, d_B, d_C, DSIZE);
    //cudaCheckErrors("kernel launch failure");

    // wait for the kernel to finish execution
    cudaDeviceSynchronize();
    //cudaCheckErrors("kernel execution failure");

    cudaMemcpy(h_C, d_C, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
    //cudaCheckErrors("cudaMemcpy D2H failure");

    printf("A[0] = %f\n", h_A[0]);
    printf("B[0] = %f\n", h_B[0]);
    printf("C[0] = %f\n", h_C[0]);
    return 0;
}

This is result ...

A[0] = 0.840188
B[0] = 0.394383
C[0] = 0.000000

python-catalin

analitics

Pages