instalasi microsoft visual studio - institutional...

45

LAMPIRAN A

Instalasi Microsoft Visual Studio

Gambar 1 Install Visual Studio

Gambar 2 Setup Preparation

46

Gambar 3 Path Instalasi

Pada langkah ini tentukan dimana letak Visual Studio akan di install. Setelah itu klik Next

dan tunggu proses instalasi sampai selesai.

Gambar 4 Instalasi Komponen Visual Studio

47

Gambar 5 Restart Komputer

Gambar 6 Proses Instalasi Setelah Komputer Restart

48

Gambar 7 Proses Instalasi Selesai

Instalasi MPICH2

Gambar 8 Instalasi MPICH2

49

Gambar 9 Proses Instalasi dan Finishing Setup

Ikuti perintah next setelah window setup muncul sampai muncul window path installation,

kemudian tentukan dimana MPICH2 akan di install. Klik next, maka proses instalasi akan dimulai,

tunggu sampai selesai kemudian finish.

Gambar 10 Install smpd dan Validasi MPI

Untuk menjalankan fungsi dari MPI yang akan di integrasikan dengan Visual Studio, maka

service dari MPI perlu diaktifkan, install service smpd dengan cara smpd –install setelah install

MPICH2. Setelah itu aktifkan smpd dengan command smpd –start , lalu cek dengan mpiexec –

validate, bila success maka service MPI sudah berjalan.

50

Setting MPI Pada Visual Studio

Gambar 11 Additional Include Directories.

Klik kanan project pada Solution Manager, kemudian pilih properties. Pada

Configuration Properties, expand C/C++ pilih general, kemudian pada kolom

Additional Include Directories, berikan path dari folder include OpenMPI, supaya

header dapat terbaca oleh sistem.

Gambar 12 Additional Library Directories.

51

Gambar 13 Additional Dependencies.

Expand menu linker, kemudian pilih general, pada Additional Library Directories berikan

path folder lib supaya file mpi.lib yang dideklarasikan di Additional Dependencies pada

sub menu linker input, dapat berjalan pada saat compile dan menjalankan aplikasi.

52

Setting Koneksi Cluster

Konfigurasi Firewall

Firewall pada masing-masing komputer user harus terbuka, supaya koneksi dari

MPI yang dikirimkan dari komputer cluster tidak di block oleh komputer lainnya.

Gambar 14 Pencarian Firewall dengan searchbox.

Gambar 15 Advanced Security Firewall.

53

Gambar 16 Firewall Properties.

Kemudian status Firewall State pilih menjadi off. Sehingga inbound dan outbound

connections tidak memblokir koneksi MPI pada saat mengirim data pada cluster atau pada

saat menerima data.

Konfigurasi IP dan User Credential

Gambar 17 Search Network and Sharing Center.

54

Gambar 18 Network and Sharing Center.

Pilih pada Change adapter setting , kemudian pada Local Area Connection klik

kanan dan pilih properties.

Gambar 19 Local Area Connection Properties.

55

Gambar 20 IPV4 Properties.

Setting masing-masing PC user dengan menggunakan cara yang sama, dan set

masing-masing IP PC. Dalam project ini PC pertama menggunakan IP 192.168.62.10 dan

PC kedua menggunakan IP 192.168.62.11

56

PC 1 PC 2

Gambar 3.19 User Account host dan client.

Nama user pada PC 1 dan PC 2 dan juga password harus identik, supaya pada

proses eksekusi OpenMPI PC 2 terdeteksi, dan MPI dapat melakukan transfer data antara

PC 1 dan PC2.

Setting Component Service

Pada search box start menu, ketikkan dcomcnfg.exe , tekan enter, pilih Component

service, kemudian masuk ke folder Computer, pada my Computer klik kanan pilih

properties.

Gambar 21 Component Service.

57

Gambar 22 Limit COM Security pada My Computer Properties.

Klik COM Security pilih edit limits. Disini akan di konfigurasikan koneksi user ke

komputer utama, supaya security PC memberikan status allow pada user yang terhubung

pada komputer utama. Add terlebih dahulu user yang akan diberikan permission untuk

mengakses komputer utama.

Gambar 23 Search Select User.

Klik advanced sehingga muncul menu untuk menambahkan jenis user yang akan di

tambahkan ke permission.

58

Gambar 24 Advanced Select User.

Klik Find now untuk mencari jenis user, kemudian pilih everyone, lalu klik OK.

Gambar 25 Edit Permission untuk user yang dipilih.

Check box yang terdapat pada Access Permission dan launch and activation permision

pada user Everyone. Beri check pada allow untuk semua opsi nya. Lalu OK dan tutup

Component Service.

59

Tes Koneksi dan Eksekusi Aplikasi MPI

Gambar 26 Test Ping

Gunakan command ping dengan diikuti nomor IP komputer cluster untuk mengetahui

koneksi cluster yang sudah terhubung.

Gambar 27 Eksekusi MPI dengan Menggunakan Command prompt

Aplikasi yang di implementasikan dengan MPI dijalankan menggunakan command prompt

dengan perintah :

Local : mpirun –np 2 file.exe

Angka 2 pada command tersebut digunakan untuk mensimulasikan jumlah proses

yang secara virtual berjalan pada local host, bisa diganti dengan angka yang berjumlah 2n

Cluster : mpirun –np 2 –host host1,host2 file.exe

Sama seperti dengan local, hanya ditambahkan dengan –host dan juga dengan nama

komputer masing-masing host, jumlah host dan angka host harus sama dengan sejumlah 2n.

60

Gambar 28 Task Manager Komputer Cluster

Pastikan pada saat eksekusi dengan menggunakan MPI , CPU usage pada komputer cluster

menunjukkan aktivitas pemrosesan. Hal ini menandakan ada data yang di proses di komputer

cluster.

Setting Nvidia Nsight

Langkah awal dalam menggunakan Nvidia Nsight, adalah pada PC user sudah

terinstall visual studio, supaya pada waktu instalasi Nvidia Toolkit, template dari Nsight

dapat terintegrasi pada new project visual studio, sehingga dapat langsung digunakan oleh

user. Setelah Instalasi berhasil dilakukan, cek kompatibilitas dari hardware GPU, support

atau tidak untuk memprogram dan menjalankan CUDA.

Gambar 29 Summary NVIDIA Installer setelah installing Toolkit.

61

Gambar 30 Pencarian Code Samples untuk uji coba GPU.

Untuk mengetahui apakah GPU yang terpasang di PC mendukung CUDA dapat

dilakukan pada NVIDIA CUDA samples browser, search dengan kata kunci particles

kemudian pada smoke particles klik run.

Gambar 31 Smoke screen code samples.

Apabila muncul render smoke screen , maka GPU mendukung CUDA

62

Gambar 32 Template dari CUDA yang terintegrasi dengan Visual Studio.

Setelah proses instalasi selesai maka installation summary akan menampilkan

fitur-fitur dan komponen dari CUDA Nsight yang telah berhasil di integrasikan pada visual

studio dan pada PC user. Dan pada visual studio sudah terintegrasi template project CUDA

runtime.

Gambar 33 Path CUDA pada environment variables.

Pada Environment Variabel yang terdapat di My Computer Properties lalu pilih

Advanced system settings, pastikan terdapat CUDA path yang berisi letak dari folder bin ,

include, dan library dari CUDA, supaya program CUDA dapat dikompilasi dan dieksekusi

oleh user.

63

Eksekusi Aplikasi CUDA

Pada saat CUDA di eksekusi pastikan GPU berjalan dengan menggunakan aplikasi GPU-z

atau CUDA – z , pada aplikasi tersebut terdapat sensor dari processor GPU yang akan

menunjukkan kepada user.

Gambar 34 Eksekusi aplikasi CUDA

Gambar 35 Sensor GPU pada saat idle dan Mengeksekusi Program

64

LAMPIRAN B

Source Code CPU Computing

Sorting

#include <stdio.h>

#include <conio.h>

#include <stdlib.h>

#include <iostream>

#include <windows.h>

void quicksort(float [10],int,int);

int main()

{

LARGE_INTEGER frequency;

LARGE_INTEGER t1,t2;

double elapsedTime;

QueryPerformanceFrequency(&frequency);

int size,i;

float *x;

float aa = 100.0;

printf("Enter size of the array: ");

scanf("%d",&size);

x = (float *)malloc( (size+1)*sizeof(float) );

for(i=0;i<size;i++)

{

x[i]=((float)rand()/(float)(RAND_MAX)) * aa;

}

QueryPerformanceCounter(&t1);

quicksort(x,0,size-1);


elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/

frequency.QuadPart;

printf("\n\n%f ms\n",elapsedTime);

system("pause");

return 0;

}

void quicksort(float x[],int first,int last)

{

int pivot,j,i;

float temp;

if(first<last)

{

pivot=first;

i=first;

j=last;

while(i<j)

{

while(x[i]<=x[pivot]&&i<last)

i++;

while(x[j]>x[pivot])

j--;

65

if(i<j)

{

temp=x[i];

x[i]=x[j];

x[j]=temp;

}

}

temp=x[pivot];

x[pivot]=x[j];

x[j]=temp;

quicksort(x,first,j-1);

quicksort(x,j+1,last);

}

}

Binary Search

#include <stdio.h>

#include <conio.h>

#include <stdlib.h>

#include <iostream>


int main()

{



double elapsedTime;

int c,n;

int first, last, middle;

float search;

double *array;

float c2=1.25;

printf("number of elements\n");

scanf("%d",&n);

array = (double *)malloc((n+1) * sizeof(double));

//printf("Enter %d integers\n", n);


for ( c = 0 ; c < n ; c++ )

{

array[c]=c2;

c2=c2+1.25;

}

printf("\nvalue to find\n");

scanf("%f",&search);

first = 0;

last = n - 1;

middle = (first+last)/2;


while( first <= last )

{

if ( array[middle] < search ){

first = middle + 1;}

else if ( array[middle] == search ){

printf("%f found at location %d.\n", search, middle+1);

66

break;}

else

{

last = middle - 1;

}

middle = (first + last)/2;

}

if ( first > last )

{ printf("Not found! %d is not present in the list.\n",

search); }



frequency.QuadPart;

printf("\n\n\n%f ms\n",elapsedTime);

system("pause");

return 0;

}

Matrix Multiplication

#include <stdio.h>

#include <conio.h>

#include <stdlib.h>

#include <iostream>


int main()

{ //FLOATING

int i, j, k;

double **mat1, **mat2, **res;

long n;

float aa = 5.0;



double elapsedTime;

// get the order of the matrix from the user

printf("Size of matrix:");

scanf("%d", &n);


// dyamically allocate memory to store elements

mat1 = (double **)malloc(sizeof(double) * n);

mat2 = (double **)malloc(sizeof(double) * n);

res = (double **) malloc(sizeof(double) * n);

for (i = 0; i < n; i++)

{

mat1[i] = (double *)malloc(sizeof(double) * n);

mat2[i] = (double *)malloc(sizeof(double) * n);

res[i] = (double *)malloc(sizeof(double) * n);

}

// get the input matrix

printf("\n");

for (i = 0; i < n; i++) {

for (j = 0; j < n; j++) {

67

//mat1[i][j] = rand() % 10 +1;

mat1[i][j] =

((float)rand()/(float)(RAND_MAX)) * aa;

}

}

printf("matrix 1:\n");

for(int aa=0; aa<n ; aa++)

{

for(int bb=0; bb<n ;bb++)

{

printf("%.2f ",mat1[aa][bb]);

}

printf("\n");

}

printf("\n");

// get the input for second matrix from the user

printf("matrix 2:\n");

for (i = 0; i < n; i++)

{

for (j = 0; j < n; j++)

{

//mat2[i][j] = rand() % 10 +1;

mat2[i][j]=((float)rand()/(float)(RAND_MAX)) * aa;

}

}

for(int aa=0; aa<n ; aa++)

{

for(int bb=0; bb<n ;bb++)

{

printf("%.2f ",mat2[aa][bb]);

}

printf("\n");

}


// multiply first and second matrix

for (i = 0; i < n; i++) {

for (j = 0; j < n; j++) {

*(*(res + i) + j) = 0;

for (k = 0; k < n; k++) {

*(*(res + i) + j) = *(*(res + i) + j) +

(*(*(mat1 + i) + k) * *(*(mat2 + k) + j));

}

}

}



frequency.QuadPart;


// print the result

printf("\nResult :\n");

for (i = 0; i < n; i++) {

for (j = 0; j < n; j++) {

printf("%.2f ", *(*(res + i) + j));

}

printf("\n");

68

}

free(mat1);

free(mat2);

free(res);

system("pause");

return 0;

}

Gauss Jordan Elimination

#include <stdio.h>

#include <conio.h>

#include <stdlib.h>

#include <iostream>


#include <math.h>

#include <malloc.h>


int main()

{

int i, j, n;

double **a, *b, *x;



double elapsedTime;

void gauss_jordan(int n, double **a, double *b, double *x);

printf("\nNumber of equations: ");

scanf("%d", &n);

float aa = 10.0;


x = (double *)malloc( (n+1)*sizeof(double) );

b = (double *)malloc( (n+1)*sizeof(double) );

a = (double **)malloc( (n+1)*sizeof(double *) );

for(i = 1; i <= n; i++)

a[i] = (double *)malloc( (n+1)*sizeof(double) );

for(i = 1; i <= n; i++)

{

for(j = 1; j <= n; j++)

{

//a[i][j]=rand()%10 + 1;

a[i][j]=((float)rand()/(float)(RAND_MAX)) * aa;

}

//b[i]=rand()%10 + 1;

b[i]=((float)rand()/(float)(RAND_MAX)) * aa;

}

for(int aa = 1 ; aa<=n ; aa++)

{

69

for(int bb = 1 ; bb<=n ; bb++)

{

printf("%.1f ",a[aa][bb]);

}

printf(" %.1f ",b[aa]);

printf("\n");

}

printf("\n\n");


gauss_jordan(n, a, b, x);



frequency.QuadPart;


printf("\nSolution\n");

printf("------------------------------------------------\n");

printf("x = (");

for(i = 1; i <= n-1; i++) printf("%lf, ", x[i]);

printf("%lf)\n\n", x[n]);

system("pause");

return(0);

}

void gauss_jordan(int n, double **a, double *b, double *x)

{

int i, j, k;

int p;

double factor;

double big, dummy;

for(k = 1; k <= n; k++)

{

// pivoting

if(k < n)

{

p = k;

big = fabs(a[k][k]);

for(i = k+1; i <= n; i++)

{

if(big < fabs(a[i][k]))

{

big = fabs(a[i][k]);

p = i;

}

}

if(p != k)

{

for(j = 1; j <= n; j++)

{

dummy = a[p][j];

70

a[p][j] = a[k][j];

a[k][j] = dummy;

}

dummy = b[p];

b[p] = b[k];

b[k] = dummy;

}

}

// Gauss-Jordan elimination

factor = a[k][k];

for(j = 1; j <= n; j++) a[k][j] /= factor;

b[k] /= factor;

for(i = 1; i <= n; i++)

{

if(i == k) continue;

factor = a[i][k];

for(j = 1; j <= n; j++) a[i][j] -=

a[k][j]*factor;

b[i] -= b[k]*factor;

}

}

for(i = 1; i <= n; i++) x[i] = b[i];

return;

}

Source Code GPU Computing

Sorting

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <iostream>


using namespace std;

#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>

#include <conio.h>

#include <cuda_runtime_api.h>

//#define NUM 8

__device__ inline void swap(float & a, float & b)

{

float tmp = a;

a = b;

b = tmp;

}

71

__global__ void bitonicSort(float * values, float N)

{

extern __shared__ float shared[];

const unsigned int tid = threadIdx.x;

shared[tid] = values[tid];

for (unsigned int k = 2; k <= N; k *= 2)

{

for (unsigned int j = k / 2; j>0; j /= 2)

{

unsigned int ixj = tid ^ j;

if (ixj > tid)

{

if ((tid & k) == 0)

{

if (shared[tid] > shared[ixj])

{

swap(shared[tid], shared[ixj]);

}

}

else

{

if (shared[tid] < shared[ixj])

{

swap(shared[tid], shared[ixj]);

}

}

}

}

}

values[tid] = shared[tid];

}

int main(void)

{

cudaEvent_t start, stop;

float time;

float * dvalues;

float * values;

double NUM;

float aa = 5.0;

scanf("%d",&NUM);

values = (float *)malloc( (NUM+1)*sizeof(float) );

size_t size = NUM * sizeof(int);

for(int i = 0; i < NUM; i++)

{

//values[i]=rand()%10 + 1;

values[i] = ((float)rand()/(float)(RAND_MAX)) * aa;

}

/*printf("\n nilai awal: ");

for (int i=0; i<NUM; i++) printf(" %i",values[i]); */

cudaMalloc((void **)&dvalues,size);

72

cudaMemcpy(dvalues, values, size , cudaMemcpyHostToDevice);

cudaEventCreate(&start);

cudaEventCreate(&stop);

cudaEventRecord(start,0);

bitonicSort<<<1, NUM, size >>>(dvalues,NUM);

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop);

cudaMemcpy(values, dvalues, size, cudaMemcpyDeviceToHost);

cudaFree(dvalues);

/*printf("\n hasil pengurutan: ");

for (int i=0; i<NUM; i++) printf(" %i",values[i]);*/

printf("%f ms\n",time);

printf("\n");

system("pause");

}

Binary Search



#include <stdio.h>

#include <conio.h>

#include <stdlib.h>

#include <iostream>


#include <assert.h>

__device__ int get_index_to_check(int thread, int num_threads, int

set_size, int offset) {

return (((set_size + num_threads) / num_threads) * thread) +

offset;

}

__global__ void p_ary_search(float search, int array_length, int

*arr, int *ret_val ) {

const int num_threads = blockDim.x * gridDim.x;

const int thread = blockIdx.x * blockDim.x + threadIdx.x;

int set_size = array_length;

while(set_size != 0){

int offset = ret_val[1];

int index_to_check = get_index_to_check(thread,

num_threads, set_size, offset);

if (index_to_check < array_length){

int next_index_to_check =

get_index_to_check(thread + 1, num_threads, set_size, offset);

if (next_index_to_check >= array_length){

next_index_to_check = array_length - 1;

}

if (search > arr[index_to_check] && (search <

arr[next_index_to_check])) {

ret_val[1] = index_to_check;

}

73

else if (search == arr[index_to_check]) {

ret_val[0] = index_to_check;

}

}

set_size = set_size / num_threads;

}

}

float chop_position(float search, float *search_array, int

array_length)

{

float time;


int array_size = array_length * sizeof(int);

if (array_size == 0) return -1;

int *dev_arr;

cudaMalloc((void**)&dev_arr, array_size);

cudaMemcpy(dev_arr, search_array, array_size,

cudaMemcpyHostToDevice);

int *ret_val = (int*)malloc(sizeof(int) * 2);

ret_val[0] = -1; // return value

ret_val[1] = 0; // offset

array_length = array_length % 2 == 0 ? array_length :

array_length - 1; // array size

int *dev_ret_val;

cudaMalloc((void**)&dev_ret_val, sizeof(int) * 2);

cudaMemcpy(dev_ret_val, ret_val, sizeof(int) * 2,

cudaMemcpyHostToDevice);

// Launch kernel




p_ary_search<<<16, 64>>>(search, array_length, dev_arr,

dev_ret_val);




// Get results

cudaMemcpy(ret_val, dev_ret_val, 2 * sizeof(int),

cudaMemcpyDeviceToHost);

int ret = ret_val[0];

printf("\nFound %i\n",ret_val[1]);

printf("\nElapsed Time : %f ms",time);

// Free memory on device

cudaFree(dev_arr);

cudaFree(dev_ret_val);

74

free(ret_val);

return ret;

}

static float * build_array(int length) {

float *ret_val = (float*)malloc(length * sizeof(float));

for (int i = 0; i < length; i++)

{

ret_val[i] = (i * 2 + 0.5) - 1;

//ret_val[i] = i;

printf("%.2f ",ret_val[i]);

}

return ret_val;

}

static void test_array(int length, float search, float index) {

printf("Length %i Search %.2f\n", length, search);

assert(index == chop_position(search, build_array(length),

length) && "test_small_array()");

}

static void test_arrays() {

int length;

float search;

scanf("%d",&length);

scanf("%f",&search);

test_array(length, search, -1);

}

int main(){

test_arrays();

system("pause");

}




#include <iostream>



#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>

#include <conio.h>

75

#include <cuda_runtime_api.h>

#define BLOCK_SIZE 100

__global__ void gpuMM(float *A, float *B, float *C, int N)

{

int row = blockIdx.y*blockDim.y + threadIdx.y;

int col = blockIdx.x*blockDim.x + threadIdx.x;

float sum = 0.f;

for (int n = 0; n < N; ++n)

sum += A[row*N+n]*B[n*N+col];

C[row*N+col] = sum;

}

int main(int argc, char *argv[])

{



double elapsedTime;

int N,K,L;

awal:

scanf("%d",&L);

if(L < 1000)

{

printf("Input must be greater than 1000\n");

goto awal;

}

K = L/100;

N = K*BLOCK_SIZE;

float time;


float *hA,*hB,*hC;

hA = new float[N*N];

hB = new float[N*N];

hC = new float[N*N];

float aa=5.0;

for (int j=0; j<N; j++){

for (int i=0; i<N; i++){

hA[j*N+i] = ((float)rand()/(float)(RAND_MAX)) * aa;

hB[j*N+i] = ((float)rand()/(float)(RAND_MAX)) *

aa;

}

}

int size = N*N*sizeof(float); // Size of the memory in

bytes

float *dA,*dB,*dC;

cudaMalloc(&dA,size);

cudaMalloc(&dB,size);

cudaMalloc(&dC,size);

dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE);

dim3 grid(K,K);

76

// Copy matrices from the host to device

cudaMemcpy(dA,hA,size,cudaMemcpyHostToDevice);

cudaMemcpy(dB,hB,size,cudaMemcpyHostToDevice);

//Execute the matrix multiplication kernel




gpuMM<<<grid,threadBlock>>>(dA,dB,dC,N);




float *C;

C = new float[N*N];

cudaMemcpy(C,dC,size,cudaMemcpyDeviceToHost);

cudaFree(dA);

cudaFree(dB);

cudaFree(dC);


system("pause");

}


main.cpp

#include<stdio.h>

#include<conio.h>

#include<stdlib.h>

#include "Common.h"

int main(int argc , char **argv)

{

float *a_h = NULL ;

float *b_h = NULL ;

float *result , sum ,rvalue ;

int numvar ,j ;

float aa = 5.0;

numvar = 0;

scanf("%d",&numvar);

a_h = (float*)malloc(sizeof(float)*numvar*(numvar+1));

b_h = (float*)malloc(sizeof(float)*numvar*(numvar+1));

int ii=0;

for(int i = 1; i <= numvar; i++)

{

for(int i = 1; i <= numvar+1; i++)

77

{

//a_h[ii]=rand()%10 + 1;

a_h[ii]=((float)rand()/(float)(RAND_MAX)) * aa;

ii++;

}

}

//Calling device function to copy data to device

DeviceFunc(a_h , numvar , b_h);

//Showing the data

printf("\n\n");

/*for(int i =0 ; i< numvar ;i++)

{

for(int j =0 ; j< numvar+1; j++)

{

printf("%.2f ",b_h[i*(numvar+1) + j]);

}

printf("\n");

} */

//Using Back substitution method

result = (float*)malloc(sizeof(float)*(numvar));

for(int i = 0; i< numvar;i++)

{

result[i] = 1.0;

}

for(int i=numvar-1 ; i>=0 ; i--)

{

sum = 0.0 ;

for( j=numvar-1 ; j>i ;j--)

{

sum = sum + result[j]*b_h[i*(numvar+1) + j];

}

rvalue = b_h[i*(numvar+1) + numvar] - sum ;

result[i] = rvalue / b_h[i *(numvar+1) + j];

}

//Tampil hasil

/*for(int i =0;i<numvar;i++)

{

printf("[X%d] = %+f\n", i ,result[i]);

}*/

_getch();

return 0;

}

DeviceFunc.cu

78

#include <cuda.h>

#include "Common.h"



#include <stdio.h>

#include <conio.h>

#include <stdlib.h>

#include <iostream>


__global__ void Kernel(float *, float * ,int );

void DeviceFunc(float *temp_h , int numvar , float *temp1_h)

{

float time;

float *a_d , *b_d;



double elapsedTime;


//Memory allocation on the device

cudaMalloc(&a_d,sizeof(float)*(numvar)*(numvar+1));

cudaMalloc(&b_d,sizeof(float)*(numvar)*(numvar+1));

//Copying data to device from host

cudaMemcpy(a_d, temp_h,

sizeof(float)*numvar*(numvar+1),cudaMemcpyHostToDevice);

//Defining size of Thread Block

dim3 dimBlock(numvar+1,numvar,1);

dim3 dimGrid(1,1,1);

//Kernel call




Kernel<<<dimGrid , dimBlock>>>(a_d , b_d , numvar);




//Coping data to host from device

cudaMemcpy(temp1_h,b_d,sizeof(float)*numvar*(numvar+1),cudaMemcpyD

eviceToHost);

//Deallocating memory on the device

cudaFree(a_d);

cudaFree(b_d);


}

Kernel.cu

#include <cuda.h>

#include "Common.h"

79

__global__ void Kernel(float *a_d , float *b_d ,int size)

{

int idx = threadIdx.x ;

int idy = threadIdx.y ;

//int width = size ;

//int height = size ;

//Allocating memory in the share memory of the device

__shared__ float temp[16][16];

//Copying the data to the shared memory

temp[idy][idx] = a_d[(idy * (size+1)) + idx] ;

for(int i =1 ; i<size ;i++)

{

if((idy + i) < size)

{

float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]);

temp[i+idy][idx] = temp[i-1][idx] +((var1) *

(temp[i+idy ][idx]));

}

}

b_d[idy*(size+1) + idx] = temp[idy][idx];

}

Common.h

#ifndef __Common_H

#define __Common_H

#endif

void getvalue(float ** ,int *);

void DeviceFunc(float * , int , float *);

Source Code Cluster Computing

Sorting

#include <stdio.h>

#include <stdlib.h>

#include <mpi.h>

#define DEBUG

#define ROOT 0

#define ISPOWER2(x) (!((x)&((x)-1)))

float *merge(float array1[], float array2[], float size) {

float *result = (float *)malloc(2*size*sizeof(float));

int i=0, j=0, k=0;

while ((i < size) && (j < size))

result[k++] = (array1[i] <= array2[j])? array1[i++] : array2[j++];

while (i < size)

80

result[k++] = array1[i++];

while (j < size)

result[k++] = array2[j++];

return result;

}

float sorted(float array[], float size) {

int i;

for (i=1; i<size; i++)

if (array[i-1] > array[i])

return 0;

return 1;

}

int compare(const void *p1, const void *p2) {

return *(float *)p1 - *(float *)p2;

}

int main(int argc, char** argv) {

int i, b=1, npes, myrank;

long datasize;

float localsize, *localdata, *otherdata, *data = NULL;

int active = 1;

MPI_Status status;

double start, finish, p, s;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

MPI_Comm_size(MPI_COMM_WORLD, &npes);

datasize = strtol(argv[1], argv, 10);

if (!ISPOWER2(npes)) {

if (myrank == ROOT) printf("Processor number must be power of

two.\n");

return MPI_Finalize();

}

if (datasize%npes != 0) {

if (myrank == ROOT) printf("Datasize must be divisible by

processor number.\n");


}

if (myrank == ROOT) {

data = (float *)malloc(datasize * sizeof(float));

for (i = 0; i < datasize; i++)

data[i] = rand()%99 + 1;

}

start = MPI_Wtime();

localsize = datasize / npes;

localdata = (float *) malloc(localsize * sizeof(float));

MPI_Scatter(data, localsize, MPI_INT, localdata, localsize,

MPI_INT,

ROOT, MPI_COMM_WORLD);

qsort(localdata, localsize, sizeof(int), compare);

81

while (b < npes) {

if (active) {

if ((myrank/b)%2 == 1) {

MPI_Send(localdata, b * localsize, MPI_INT, myrank - b, 1,

MPI_COMM_WORLD);

free(localdata);

active = 0;

} else {

otherdata = (float *) malloc(b * localsize * sizeof(float));

MPI_Recv(otherdata, b * localsize, MPI_INT, myrank + b, 1,

MPI_COMM_WORLD, &status);

localdata = merge(localdata, otherdata, b * localsize);

free(otherdata);

}

}

b <<= 1;

}

finish = MPI_Wtime();

if (myrank == ROOT) {

#ifdef DEBUG

if (sorted(localdata, npes*localsize)) {

printf("\nParallel sorting succeed.\n\n");

} else {

printf("\nParallel sorting failed.\n\n");

}

#endif

free(localdata);

p = finish - start;

printf(" Parallel : %.8f\n", p);

/*start = MPI_Wtime();

qsort(data, datasize, sizeof(float), compare);

finish = MPI_Wtime();*/

free(data);

}


}

Binary Search

#include "mpi.h"

#include <iostream>

#include <math.h>


int main(int argc,char **argv)

{

const int Master = 0;

const int Tag_Size = 1;

const int Tag_Data= 2;

82

const int Tag_Max=3;

int max;

double MaxInAll;

int MyId, P;

double* A;

int ArrSize, Target;

int n, Start;

int i, x;

int Source, dest, Tag;

int WorkersDone = 0 ;

double start, finish, p;

MPI_Status RecvStatus;


MPI_Comm_rank (MPI_COMM_WORLD, &MyId);

MPI_Comm_size (MPI_COMM_WORLD, &P);


//start working..

if (MyId == Master)

{

.

cout<<"This is the master process on "<<P<<" Processes\n";

MaxInAll=0;

int GlobIndx;

cout<<"Enter the number of elements you want to

generate..";

cin>> ArrSize;

..

A = new double[ArrSize];

srand ( P ); /* initialize random seed: */

for ( i= 0; i<ArrSize; i++)

{

A[i] = i+1.25;

}

n = ArrSize/(P-1);

for( i = 1; i < P; i++)

{

dest = i;

if (i == P-1)

n = ArrSize - (n*(P-2));

Tag = Tag_Size;

MPI_Send(&n, 1, MPI_DOUBLE, dest, Tag,

MPI_COMM_WORLD);

Tag = Tag_Data;

83

Start = (i - 1) * ( ArrSize/(P-1) );

MPI_Send(A+Start, n, MPI_DOUBLE, dest, Tag,

MPI_COMM_WORLD);

}

WorkersDone = 0;

int MaxIndex = 0;

while (WorkersDone < P-1 )

{

MPI_Recv(&x, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG,

MPI_COMM_WORLD, &RecvStatus);

Source = RecvStatus.MPI_SOURCE;

Tag = RecvStatus.MPI_TAG;

if (Tag == Tag_Max)/

{

GlobIndx = (Source - 1)*(ArrSize/(P-1) ) + x;

if ( A[GlobIndx] > MaxInAll)

{

MaxInAll = A[GlobIndx];

MaxIndex = GlobIndx;

}

WorkersDone++;

}

}

if(WorkersDone==P-1)

cout << "Process "<<Source<<" found the max of the

array "<< MaxInAll<<" at index "<<MaxIndex;

delete [] A;

}

else

{

max=0;

cout<<"Process "<<MyId<<" is alive...\n";

Source = Master;

Tag = Tag_Size;

MPI_Recv(&n, 1, MPI_DOUBLE, Source, Tag, MPI_COMM_WORLD,

&RecvStatus);

A = new double[n];

Tag = Tag_Data;

MPI_Recv(A, n, MPI_DOUBLE, Source, Tag, MPI_COMM_WORLD,

&RecvStatus);

cout<<"Process "<<MyId<< "Received "<<n<<" data

elements\n";

int max_i;

i = 0;

while (i<n )

{

if ( A[i] > max )

{

max=A[i];

max_i=i;

}

i++;

84

}

dest = Master;

Tag = Tag_Max;

cout<<"Process "<<MyId<< " has max equals "<<max<<endl;

MPI_Send(&max_i, 1, MPI_DOUBLE, dest, Tag,

MPI_COMM_WORLD);

delete [] A;

}


if (MyId == 0)

{

p = finish - start;


}

MPI_Finalize();

return 0;

}


#include <stdio.h>

#include "mpi.h"

#define N 5000 /* number of rows and columns in matrix */

MPI_Status status;

double a[N][N],b[N][N],c[N][N];

int main(int argc, char **argv)

{

double start, finish, p;

int

numtasks,taskid,numworkers,source,dest,rows,offset,i,j,k,remainPar

t,originalRows;

//struct timeval start, stop;


MPI_Comm_rank(MPI_COMM_WORLD, &taskid);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

numworkers = numtasks-1;


if (taskid == 0) {

for (i=0; i<N; i++) {

for (j=0; j<N; j++) {

a[i][j]= 1.25;

b[i][j]= 2.25;

}

85

}

//gettimeofday(&start, 0);

/* send matrix data to the worker tasks */

rows = N/numworkers;

offset = 0;

remainPart = N%numworkers;

for (dest=1; dest<=numworkers; dest++)

{

if (remainPart > 0)

{

originalRows = rows;

++rows;

remainPart--;

MPI_Send(&offset, 1, MPI_INT, dest, 1, MPI_COMM_WORLD);

MPI_Send(&rows, 1, MPI_INT, dest, 1, MPI_COMM_WORLD);

MPI_Send(&a[offset][0], rows*N, MPI_DOUBLE,dest,1,

MPI_COMM_WORLD);

MPI_Send(&b, N*N, MPI_DOUBLE, dest, 1, MPI_COMM_WORLD);

offset = offset + rows;

rows = originalRows;

}

else

{

MPI_Send(&offset, 1, MPI_INT, dest, 1, MPI_COMM_WORLD);

MPI_Send(&rows, 1, MPI_INT, dest, 1, MPI_COMM_WORLD);

MPI_Send(&a[offset][0], rows*N, MPI_DOUBLE,dest,1,

MPI_COMM_WORLD);

MPI_Send(&b, N*N, MPI_DOUBLE, dest, 1, MPI_COMM_WORLD);

offset = offset + rows;

}

}

/* wait for results from all worker tasks */

for (i=1; i<=numworkers; i++)

{

source = i;

MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD,

&status);

MPI_Recv(&rows, 1, MPI_INT, source, 2, MPI_COMM_WORLD,

&status);

MPI_Recv(&c[offset][0], rows*N, MPI_DOUBLE, source, 2,

MPI_COMM_WORLD, &status);

}

}

if (taskid > 0) {

source = 0;

MPI_Recv(&offset, 1, MPI_INT, source, 1, MPI_COMM_WORLD,

&status);

MPI_Recv(&rows, 1, MPI_INT, source, 1, MPI_COMM_WORLD,

&status);

MPI_Recv(&a, rows*N, MPI_DOUBLE, source, 1, MPI_COMM_WORLD,

&status);

MPI_Recv(&b, N*N, MPI_DOUBLE, source, 1, MPI_COMM_WORLD,

&status);

86

/* Matrix multiplication */

for (k=0; k<N; k++)

for (i=0; i<rows; i++) {

c[i][k] = 0.0;

for (j=0; j<N; j++)

c[i][k] = c[i][k] + a[i][j] * b[j][k];

}

MPI_Send(&offset, 1, MPI_INT, 0, 2, MPI_COMM_WORLD);

MPI_Send(&rows, 1, MPI_INT, 0, 2, MPI_COMM_WORLD);

MPI_Send(&c, rows*N, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD);

}


if (taskid == 0)

{

p = finish - start;


}

MPI_Finalize();

}


#include <stdlib.h>

#include <stdio.h>

#include <iostream>

#include "mpi.h"

double serial_gaussian( double *A, double *b, double *y, int n )

{

int i, j, k;

double tstart = MPI_Wtime();

for( k=0; k<n; k++ ) {

for( j=k+1; j<n; j++ ) {

if( A[k*n+k] != 0)

A[k*n+j] = A[k*n+j] / A[k*n+k];

else

A[k*n+j] = 0;

}

if( A[k*n+k] != 0 )

y[k] = b[k] / A[k*n+k];

else

y[k] = 0.0;

87

A[k*n+k] = 1.0;

for( i=k+1; i<n; i++ ) {

for( j=k+1; j<n; j++ )

A[i*n+j] -= A[i*n+k] * A[k*n+j];

b[i] -= A[i*n+k] * y[k];

A[i*n+k] = 0.0;

}

}

return tstart;

}

void print_equations( double *A, double *y, int n )

{

int i, j;

for( i=0; i<n; i++ ) {

for( j=0; j<n; j++ ) {

if( A[i*n+j] != 0 ) {

std::cout << A[i*n+j] << "x" << j;

if( j<n-1 ) std::cout << " + ";

}

else

std::cout << " ";

}

std::cout << " = " << y[i] << std::endl;

}

}

int main( int argc, char *argv[] )

{

double *A, *b, *y, *a, *tmp, *final_y; // var decls

int i, j, n, row, r;

double tstart, tfinish, TotalTime; // timing decls

float aa = 5.0;

if( argc < 2 ) {

std::cout << "Usage\n";

std::cout << " Arg1 = number of equations / unkowns\n";

return -1;

}

n = atoi(argv[1]);

A = new double[n*n]; // space for matricies

b = new double[n];

y = new double[n];

for( i=0; i<n; i++ ) { // creates a matrix of random

b[i] = 0.0;

for( j=0; j<n; j++ ) {

r = ((float)rand()/(float)(RAND_MAX)) * aa;

A[i*n+j] = r;

b[i] += j*r;

88

}

}

MPI_Init (&argc,&argv); // Initialize MPI

MPI_Comm com = MPI_COMM_WORLD;

int size,rank; // Get rank/size info

MPI_Comm_size(com,&size);

MPI_Comm_rank(com,&rank);

int manager = (rank == 0);

if (size == 1)

tstart = serial_gaussian ( A, b, y, n);

else

{

if ( ( n % size ) != 0 )

{

std::cout << "Unknowns must be multiple of processors." <<

std::endl;

return -1;

}

int np = (int) n/size;

a = new double[n*np];

tmp = new double[n*np];

if ( manager )

{

tstart = MPI_Wtime();

final_y = new double[n];

}

MPI_Scatter(A,n*np,MPI_INT,a,n*np,MPI_INT,0,com);

for ( i=0; i < (rank*np); i++ )

{

MPI_Bcast(tmp,n,MPI_INT,i/np,com);

MPI_Bcast(&(y[i]),1,MPI_INT,i/np,com);

for (row=0; row<np; row++)

{

for ( j=i+1; j<n; j++ )

a[row*n+j] = a[row*n+j] - a[row*n+i]*tmp[j];

b[rank*np+row] = b[rank*np+row] - a[row*n+i]*y[i];

a[row*n+i] = 0;

}

}

for (row=0; row<np; row++)

{

for ( j=rank*np+row+1; j < n ; j++ )

89

{

a[row*n+j] = a[row*n+j] / a[row*n+np*rank+row];

}

y[rank*np+row] = b[rank*np+row] / a[row*n+rank*np+row];

a[row*n+rank*np+row] = 1;

for ( i=0; i<n ; i++ )

tmp[i] = a[row*n+i];

MPI_Bcast (tmp,n,MPI_INT,rank,com);

MPI_Bcast (&(y[rank*np+row]),1,MPI_INT,rank,com);

for ( i=row+1; i<np; i++)

{

for ( j=rank*np+row+1; j<n; j++ )

a[i*n+j] = a[i*n+j] - a[i*n+row+rank*np]*tmp[j];

b[rank*np+i] = b[rank*np+i] -

a[i*n+row+rank*np]*y[rank*np+row];

a[i*n+row+rank*np] = 0;

}

}

for (i=(rank+1)*np ; i<n ; i++)

{

MPI_Bcast (tmp,n,MPI_INT,i/np,com);

MPI_Bcast (&(y[i]),1,MPI_INT,i/np,com);

}

MPI_Barrier(com);

MPI_Gather(a,n*np,MPI_INT,A,n*np,MPI_INT,0,com);

MPI_Gather(&(y[rank*np]),np,MPI_INT,final_y,np,MPI_INT,0,com);

y = final_y;

}

if (manager || (size==1) )

{

tfinish = MPI_Wtime();

TotalTime = tfinish - tstart;

printf("%f",TotalTime);

std::cout << std::endl;

}

MPI_Finalize();

}

instalasi microsoft visual studio - institutional...

Documents