CSc-387 Parallel Processing CHAPTER 11 : IMAGE PROCESSING ALGORITHMS ======================================== 11.1. Low-level Image Processing -------------------------------- - Gray scale: 0 (black) - 255 (white), pixels are represented in 8 bits - Color images: (R)ed, (G)reen, (B)lue 8 bits each. 24 bits total or pixel value may be an index to a look-up table which stores colors - assume that the stored image uses a coordinate system as shown in: (* T Figure 11.1 *) - Low-level image operations: embarrassingly parallel e.g. noise reduction, edge detection, object labelling, template matching Computational Requirements: 1024x1024 pixels ==> 2^20 (~10^6) = 1 MBytes -------------------------- suppose each pixel must be operated once, using a computer with speed = 10^{-8} sec/oper, it takes 10 ms to process 2^20 pixels. In real time, we need to process 60-85 frames/second (1 frame in 12-16 ms). Typically, many high-complexity operations must be performed on the image. Therefore, computational requirements are very high and often special purpose hardware is used. POINT PROCESSING: ----------------- embarrassingly parallel. THRESHOLDING: if (Xi < threshold) Xi=0; else Xi=1 CONTRAST STRETCHING: map the interval (x_l, x_h) =====> (X_L, X_H) X_L ............................ X_H \ \ / \ \ / \ \ / \ \ / x_l ....x_i.... x_h x_i = (x_i - x_l)*[(X_H - X_L)/(x_h - x_l)] + X_L GREY LEVEL REDUCTION: reduce the number of bits to represent the pixels in order to reduce storage requirements 11.3 HISTOGRAM ============== a global operation (all pixels are used) mainly used before a thresholding operation (* T Figure 11.2 *) for(i=0; i < height_max; i++) for(j=0; j < width_max; j++) hist[p[i][j]]++ ; To do it in parallel, processors locally update their histogram arrays and then perform a Vector Global Sum (global sum for each vector entry). or send it to a master processor for a global update. 11.4 SMOOTHING, SHARPENING, and NOISE REDUCTION =============================================== - local operations which require the value of neighboring pixels - Smoothing suppresses large fluctuations in intensity by reducing the high frequency content - Sharpening accentuates the transitions, enhancing the detail. Two ways: reduce low-freq. content or accentuate changes through differentiation - Noise reduction: suppresses noise through averaging and median filtering Smoothing: ---------- new value for x4 (x4') is obtained by taking the average of 9 pixels in the 3x3 region around x4: (* T Figure 11.3 *) new value for x4 can also be obtained by finding the MEDIAN of the pixels. This obviously requires sorting 9 values. Efficient approximate sorting strategies (i.e. shearsort) can be employed. Parallel Code: suppose one processor is assigned for each pixel. -------------- Then, each PE perfroms 4 communications with its L, R, U, and D neighbors to get neighboring pixel values: (* T Figure 11.4 *) (* T Figure 11.5 *) After the values are obtained, new pixel values can be computed. This is an embarrassingly parallel application and it is suitable for synchronous computation in lock-step mode (SIMD type). Weighted Masks -------------- Operations described so far can be generalized in the form of a "weighed mask operation" as shown in: (* T Figure 11.7 *) For example, mask for computing the mean is: (* T Figure 11.8 *) Mask for noise reduction: (* T Figure 11.9 *) A High-pass sharpening filter mask: (* T Figure 11.10 *) =================== 11.5 EDGE DETECTION =================== Edge Detection using differentiation: (* T Figure 11.11 *) Two important parameters to determine: (* T Figure 11.12 *) Gradient (magnitude) = Df = SQRT[(df/dx)^2 + (df/dy)^2] Gradient Direction = Theta(x,y) = Tan^{-1}[(df/dy) / (df/dx)] - Gradient (magnitude) can be approximated to |df/dx| + |df/dy| Edge Detection Masks ===================== For discrete functions, the derivatives are approximated by differences: df/dx = (x5 - x3) df/dy = (x7 - x1) Gradient = Df = |x5 - x3| + |x7 - x1| Prewitt Operator: ----------------- Better results can be obtained by using more pixel values. For example: (* T Figure 11.13 *) df/dy = (x6 - x0) + (x7 - x1) + (x8 - x2) df/dx = (x2 - x0) + (x5 - x3) + (x8 - x6) Df = |df/dy| + |df/dx| Sobel Operator: --------------- (* T Figure 11.14 *) df/dy = (x6 + 2x7 + x8) - (x0 + 2x1 + x2) df/dx = (x2 + 2x5 + x8) - (x0 + 2x3 + x6) Edge Detection Example (with Sobel operator): (* T Figure 11.15 *) Laplace Operator: second-order derivative which can be approximated to: ----------------- (* T Figure 11.16 *) (* T Figure 11.17 *) D^2 f = 4.x4 - (x1+x3+x5+x7) Notice that this computation will yield to zero for a constant change in gray level across the mask. - Usually, a threshold is applied after the edge operators to get either black or white pixels: (* T Figure 11.18 *) shows the result of Laplace Operator. 11.6 THE HOUGH TRANSFORM ======================== - Edge detection described so far identifies the pixels which might be on an edge, but it does not tell much about how they connect to each other to make an edge. Hough transform tries to do exactly that. Line equation: y = ax + b b = -xa + y - A single pixel at (x1, y1) has an infinite number of lines that could pass through it. - If a and b are discretized, a finite set of lines will exist. (* T Figure 11.19 *) - There will be a common point (S, T) in the parameter space (a-b) into which all the points that lie on a specific line in the x-y space will map. These pixels may also map onto other points in (a-b) space, but (S, T) will get the most hits. If we keep a count of these hits, then those (a, b) values with most points mapped onto will be the best candidates for edges in the image. - However, in order to make this method to work with vertical/horizontal lines, we need to rewrite the line equation in polar coordinates: y = ax + b ============> (polar) r = xcos(A) + ysin(A) where r is the perpendicular distance from the origin to the line, and (A) is the angle between r and x-axis. (* T Figure 11.20 *) Implementation: --------------- - r is always positive, (A) is between 0 - 360 degrees - The parameter space is divided into small rectangular regions. r is divided into increments of 5 and (A) increments of 10 degrees. An accumulator is provided for each rectangular region: (* T Figure 11.22 *) - Accumulators of those regions that a pixel maps into are incremented. Therefore, the computation time will depend upon the number of regions. - If there are n pixels and k discrete values of (A) are tried, then complexity is O(kn) - To reduce the computational complexity, we might make k smaller. If only the Gradient Angle at that pixel is used, then k=1. In this case, time complexity for n pixels is O(n). SEQUENTIAL CODE --------------- for(x=0; x < Xmax; x++) for(y=0; y < Ymax; y++) { sobel(x, y, dx, dy); magnitude = grad_mag(dx, dy); if(magnitude > treshold) { Theta = grad_dir(dx, dy); /* atan2() function */ Theta = Theta_quantize(Theta); r = x * cos(Theta) + y * sin(Theta); r = r_quantize(r); acc[r][Theta]++; /* increment accumulator */ append(r, Theta, x, y); /* append point to line */ } } - In the end, the most likely lines are those with the most pixels, and accumulators with local maxima are chosen. PARALLEL CODE ------------- Given P: number of processors N*N : Image size M1: number of discrete intervals used for r M2: number of discrete intervals used for Theta If [ (M1 > M2) and (P < N) and (P < M1) ] Then a PARALLEL ALGORITHM with time complexity O(N^2/P + M1*M2*logM1) can be developed to perform Hough Transform on the given image: - Computation for each accumulator is independent of the others. Image could be sliced into strips and assigned onto processors. Each PE locally updates the accumulators corresponding to the pixels in its own region. When done, accumulators must be merged. Creative solutions could be adapted for the final merge step. - One simple solution is that a Master PE could collect all the results and add them up to form a Global Accumulator Matrix. - Or PEs may use a Binary Tree approach for merging the local matrices. This takes O(M1*M2*logP) where M1 and M2 are the number of discrete intervals used for r and (Theta) respectively. Hence M1*M2 is the size of the Accumulator Matrix. After the matrices are merged, entries must be analyzed to find the local maximas. This can also be done through sorting which takes O(M1*M2*logM1*M2) = O(M1*M2*(logM1 + logM2)) - How about a special mergeANDsort operation in which processors try to sort their accumulator values? Whenever they come across the same accumulator index, they reduce them into one value, and participate in the sort as one value instead of two separate values. - Indeed, this can be a nice term project for those who are interested! 11.7.2 DISCRETE FOURIER TRANSFORM (DFT) ======================================= Given N real discrete values x0, x1, ..., x_(N-1), DFT is computed as: X_k = 1/N * SUM_{j=0}^{N-1} xj * e^{-2*PI*ijk/N} for k = 0,1,...N-1 Inverse DFT is: x_k = SUM_{j=0}^{N-1} Xj * e^{2*PI*ijk/N} for k = 0,1,...N-1 - X_k s are complex values whereas xk's are real. - Note: by definition, e^{iA} = cosA + i*sinA - Sometimes the DFT is given without the scale factor 1/N ONE APPLICATION: Earlier, Frequency Filtering was achieved with weighted masks, which can be described by the CONVOLUTION operation: h(j, k) = g(j, k) * f(j, k) where g(j, k) describes the weighted mask and f(j, k) describes the image. The CONVOLUTION of two functions can be obtained by taking the Fourier transforms of each function (H(j,k), G(j,k), F(j,k)) and multiplying them: H(j, k) = G(j, k) * F(j, k) (* T Figure 11.24 *) - Doing it this way does not reduce the time complexity but working in Frequency domain has other advantages, e.g. eliminating Very High and Very Low frequencies, etc. IMPLEMENTATION of DFT: ====================== - Sequential Time Complexity for computing DFT of N values: O(N^2) w = e^{-2*PI*i/N}; for (k=0; k < N; k++) /* for every point */ X[k] = 0; for (j=0; j < N; j++) /* compute summation */ X[k] = X[k} + w^{j*k} * x[j] ; - Note that the summation step requires complex number arithmetic. Since w^{(j-1)*k}*w^k = w^{j*k}, the code could be rewritten as: w = e^{-2*PI*i/N}; for (k=0; k < N; k++) { /* for every point */ X[k] = 0; a = 1; for (j=0; j < N; j++) { /* compute summation */ X[k] = X[k} + a * x[j] ; a = a * w^k ; } } PARALLEL IMPLEMENTATION of DFT: -------------------------------- - N DFTs and N processors. Each PE computes one DFT. Pi computes Xi - Each PE needs a copy of N discrete values x0, x1, ..., x_(N-1) - Therefore, Tpar = O(N) with N processors - If P << N, then each PE computes N/P DFTs. FAST FOURIER TRANSFORM (FFT) ============================ - FFT is a fast method of obtaining DFT - has a time complexity of O(NlogN) instead of O(N^2) - In the following, it will be assumed that N is a power of 2. - It can be shown that, DFT for Xk can be rewritten as: Xk = 1/2 [ X_even + w^k * X_odd ] for k = 0,1, ..., N-1 X_{k+N/2} = 1/2 [ X_even + w^{k+N/2} * X_odd ] = 1/2 [ X_even - w^k * X_odd ] (because w^{N/2} = -1) where X_even is the N/2 point DFT of the numbers with even indices (x0, x2, x4, .....) and X_odd is the N/2 point DFT of the numbers with odd indices (x1, x3, x5, .....). (* T Figure 11.28 *) - Using the above formula in a recursive manner, we can further represent X_even and X_odd in terms of 2 summations each. This yields a recursive divide-and-conquer approach to computing FFT: (* T Figure 11.29 *) (* T Figure 11.30 *) (* T Figure 11.31 *) SEQUENTAL IMPLEMENTATION of FFT: -------------------------------- There are logN phases in the computation of FFT (see Figure 11.31) each phase taking O(N) time. Therefore, Tseq = O(NlogN) PARALLEL IMPLEMENTATION of FFT: -------------------------------- - N FFTs and N processors. Each PE computes one FFT. - Pi computes Xi - Unlike the DFT algorithm, PEs do not need to access all of the N discrete values x0, x1, ..., x_(N-1) at once. At each step PEs need only two values to compute (one stored locally another one received from the neighbor processor which differs in only one bit. - Processors run a Binary Exchange Algorithm in which they interact (send/receive) with those processors who differ in exacly one bit. At each phase the differing bit changes moving from left to right. If P=N then Tpar = O(logN) (* Linear speedup *) - If P << N, then each PE computes N/P FFTs. After logP steps, they can find the data they need locally. In this case, Tpar = O(NlogN/P) (* Linear speedup *)