Real-time polarization-sensitive optical coherence tomography data processing with parallel computing

With the increase of the A-line speed of optical coherence tomography (OCT) systems, real-time processing of acquired data has become a bottleneck. The shared-memory parallel computing technique is used to process OCT data in real time. The real-time processing power of a quad-core personal computer (PC) is analyzed. It is shown that the quad-core PC could provide real-time OCT data processing ability of more than 80K A-lines per second. A real-time, fiber-based, swept source polarization-sensitive OCT system with 20K A-line speed is demonstrated with this technique. The real-time 2D and 3D polarization-sensitive imaging of chicken muscle and pig tendon is also demonstrated.


Introduction
Optical coherence tomography (OCT) is a prominent biomedical tissue imaging technique that provides noninvasive, micrometer-resolution measurements [1]. Polarization-sensitive OCT (PSOCT)  is a functional extension of OCT that can measure the depth-resolved birefringent characteristics of biological tissues such as collagen, cartilage, and muscle. OCT and PSOCT have evolved from the earlier time domain to the current Fourier domain system because of the high signal-to-noise ratio and high imaging speed of the Fourier domain system [23,24].
OCT systems with more than 100K A-line speeds (where K denotes 2 10 and A-line speed is axial scan lines per second) have been demonstrated [25,26]. With such a fast A-line speed, the data processing becomes the bottleneck of the system. Compared with standard nonpolarization-sensitive OCT, the fiber-based PSOCT system needs to acquire four times the amount of data for the same size picture. In addition, a time-consuming algorithm is needed to calculate the depth-dependent birefringent information. The most common way to solve this problem is to acquire the data and save it on a hard disk, then process the acquired data afterward, or to process and display a portion of the acquired data in real time [7,8,27]. However, real-time processing for all the acquired data is preferred and necessary in clinical imaging, especially for applications such as endoscopy and ophthalmology.
Hardware-based parallel processing schemes, such as digital signal processors or fieldprogrammablegate-array-based schemes are the most common methods to provide real-time OCT data processing [28][29][30][31]. However, additional hardware inclusion in the system increases the system complexity and cost. In addition, the programming and debugging can be very tedious for the added hardware.
Multiple processing units have become popular on personal computers. Most personal computers arrive prebuilt with multiple processing units such as multicore CPUs, multicore video processors, and multicore audio processors. With these multiple processing units, highperformance parallel computing is available on personal computers without the inclusion of additional hardware. In this paper, we demonstrate real-time data processing and display of a fiber-based swept source PSOCT system with a multicore CPU computer and shared memory parallel computing technique. Using the multicore CPU and shared memory parallel computing technique, the OCT software processing speed was increased with minimum modification of the original program. With our current Quad-Core Intel Xeon X5355 2.66GHz workstation, real-time processing of standard non-polarization-sensitive OCT data as fast as 80 K A-lines per second is possible. Real-time processing of a 20K A-line swept source fiber-based PSOCT system is demonstrated.

A. Parallel Computing
The interest in parallel computing can be traced back to the 1950s. It is based on the principle that large problems can be divided into small ones, which are executed concurrently. Parallel computers can be classified into two categories [32]. The first category is the shared memory system, where a number of processors are connected to the main memory. An example is a multicore personal computer. Another category is the distributed memory system, where each processing unit has its own memory and all these processing units are connected by a network. Examples include clusters and grids. The speed-up of parallel computing can be described by Amdahl's law [33]: (1) where S is the speed-up, P is the ratio of the parallel part to the whole task, and N is the number of processing units. In order to make full use of the processors, the parallel part of the whole task must be as high as possible. If the whole task can be parallelized (P = 1), the speed-up is linearly proportional to the number of processing units.
In our OCT configuration, a Dell Precision 690 workstation with a quad-core Intel Xeon X5355 at 2.66GHz was used as the data acquisition and processing computer. The operating system was Microsoft Windows XP Pro SP3, and the software developing platform was Microsoft Visual Studio 2005. The shared-memory parallel computing technique was used in our software. Parallel computing was implemented via openMP [34]. OpenMP provides a simple and flexible interface for developing the parallel application, and only a few application programming interfaces are needed to upgrade our serial processing OCT software to the parallel real-time processing software.
In order to understand the data processing capability of our computer for OCT applications, we tested the processing speed of our current configuration. The 1024-point 32 bit (single precision) and 64 bit (double precision) complex number fast Fourier transform (FFT) was used for the test, and the FFT algorithm was FFTW [35]. Time consumption for 12,000 1024-point, complex number FFTs was measured with different threads (cores) in parallel computing, and the time consumption for each single 1024-point complex number FFT was calculated accordingly.When all four cores were used in the parallel computing, it would take about 1.37 µs for one 1024-point complex number FFT(32 bit), and the speed was comparable with the result of a digital signal processor system [30]. However, the programming and debugging for our software is much easier than that for the digital-signal-processor-based system. The speed-up curves for our test are shown in Fig. 1. Figure 1 shows that introduction of the parallel computing in our quad-core system was able to provide a speed-up of as much as 3.5 times. The time consumption of a 64 bit FFT is almost two times that of a 32 bit FFT. The accuracy of the 32 bit FFT compared with the 64 bit FFT was also tested. The difference between results from the 32 bit FFT and the 64 bit FFT was found to be less than 10 −5 . This margin of error is acceptable for OCT applications. We will adopt the 32 bit FFT for the following OCT applications.

B. Software Architecture
The software structure is shown in Fig. 2. The OCT data acquisition and processing software was written with Microsoft Visual C++. Multithreading and parallel computing techniques were used. A total of four threads were in the software: (1) the controlling thread, (2) data acquisition and fetching thread, (3) data processing thread, and (4) image display thread. The controlling thread was the main thread that controlled the whole process and all the other threads. The data acquisition and fetch thread was in charge of data acquisition and fetching the original data into the main memory. The data processing thread read the acquired data from the memory and processed the data. Data processing involved calibration, dispersion compensation, and time domain intensity calculation [30]. Once a frame was processed, the image was displayed by the display thread. Parallel computing was implemented in the data processing thread. In the software architecture, the controlling thread and the display thread did not take much of the CPU resources. The data acquisition thread and data processing thread took most of the CPU resources. Based on the quad-core PC configuration, there were a total of three cores involved in the parallel computing data processing thread.
A 20K A-line speed fiber-based swept source system [30] was used to test the software. The software could provide real-time acquisition and processing ability for the 20K A-line system. The software was able to acquire and process 45K A-lines in real-time by using a function generator as a signal source. Because of the limitation of the data streaming speed of the PCI digitizer, a faster A-line speed using a function generator could not be tested. However, by calculating the time consumption for data of a single A-line, it was found that processing of a single A-line would take approximately 12 µs, which corresponded to an A-line speed of about 80 K.
For fiber-based PSOCT systems, the acquired data volume is two times that of standard nonpolarization-sensitive OCT system if an alternative A-line polarization modulation scheme is used [7][8][9]13], and it is four times that of standard nonpolarization-sensitive OCT system if a single A-line polarization modulation scheme is used [17,19]. In addition, the phase retardation calculation algorithm is also very time consuming. Real-time processing is more challenging than standard nonpolarization-sensitive OCT imaging. Park et al. have shown multifunction systems [7,8] that were able to process and display a portion of the acquired data in real time. In this paper, we will show a real-time processing and display 20K A-line fiberbased PSOCT system using the alternative A-line polarization modulation scheme.

C. Fiber-Based Swept Source PSOCT System
A schematic of the PSOCT system is shown in Fig. 3. The laser source is a swept source laser with a maximum scanning rate of 20K (HSL 2000, Santec Corporation, Aichi, Japan). The laser generates polarized light with a center wavelength of 1310 nm and a bandwidth of 100 nm. A polarizer was oriented 45° with respect to the amplitude modulator crystal (4104, New Focus, San Jose, California) to make a polarization modulator. An amplified square waveform with a frequency of 10 kHz was used to drive the modulator so that alternating linear and circular polarization states for consecutive A-lines were obtained [6][7][8]. A fiber polarization controller in front of the polarizer was used to control the polarization state so that the laser power after the modulator is maximum. A 1 × 2 beam splitter behind the modulator split 20% of the laser power into the reference arm and 80% of the laser power into the sample arm. In the reference arm, a polarization controller and a linear polarizer were used to ensure that the reference beam had a constant polarization state and constant amplitude. Two circulators were used in the reference arm and the sample arm to guide the sample and reference light into a 2 × 2 polarization maintenance (PM) coupler. Two fiber polarization beam splitters (PBSs) were used to separate the fast and slow axes signals and send them to two balanced detectors. A high speed digitizer (PCI 5122, National Instruments, Austin, Texas) was used to digitize the electrical signal from the two balanced detectors at 33 M samples/s. The acquisition and processing software was introduced in the previous section, and the architecture of the software was the same as shown in Fig. 2. However, for the PSOCT version of the software, both analogy input channels of the digitizer were used in the acquisition and fetching thread. The algorithm in the processing thread was more complex. Figure 4 shows the flow for processing the PSOCT data. The data acquired by the two channels of the digitizer in alternative A-lines are marked as LH, LV, CH, and CV in the figure. They represent the detected horizontal (LH) or vertical polarization (LV) signal for linear polarization state input and the detected horizontal (CH) or vertical polarization (CV) signal for circular polarization state input. These signals are resampled to linear k space based on the linear interpolation method. The resampled signals are marked as LH′, LV′, CH′, and CV′ in Fig. 4. The FFT was used to convert the resampled data to the complex depth encoded signal (SLH, SLV, SCH, SCV in Fig. 4). The complex depth encoded signal was used to get the Stokes parameters for linear polarization input (LI, LQ, LU, LV) and circular polarization input (CI, CQ, CU, CV). Then the intensity and phase retardation images were obtained. The intensity images were obtained from the average of LI and CI. The phase retardation was calculated according to the method in [5,6].
In the processing thread, three parallel sections were used to process the data in real time. The three parallel sections processed the data in different A-lines. For example, if there are 1024 A-lines in one frame, the first parallel section will process the first 341 A-lines (from the 1st to the 341st A-line), the second parallel section will process the second 341 Alines (from the 342nd to the 682nd A-line) and the third section will process the last 342 A-lines. All variables for each parallel section were made private variables that were used by that section only. Because different parallel sections processes different A-lines and private variables are used in each section, false sharing is avoided. The efficiency of the parallelism can be seen from Fig. 5. The time to process one single effective A-line is around 48.7 µs. This means that our system can process up to a 40 kHz Aline PSOCT system in real time when the alternative Aline polarization modulation is used.

Results
Chicken breast muscle was measured ex vivo by our fiber-based, real-time PSOCT system. Figure 6 shows the intensity (left-hand side) and phase retardation (right-hand side) images. The phase retardation image is indicated in the gray scale from black (0°) to white (180°). These images were 256 (width) ×400 (height) pixels for both intensity and phase retardation images. The images were acquired and processed in real time at 40 frames/s. The periodic band structure in the axial direction is clearly seen in the phase retardation images, which shows the strong birefringent features of the chicken breast muscle. Figure 7 shows the 3D reconstruction images for a piece of pig tendon. 2D images were acquired, processed and displayed in real time at 20 frames/s (with 512 effective A-lines per frame). The image area is 4 mm × 2 mm with 512 × 400 pixels. Figure 7(a) shows the reconstructed 3D intensity images, and Fig. 7(b) shows the reconstructed 3D phase retardation images. The volume size is 400 × 512 × 100. Before doing 3D rendering, the tissue surface was found. The maximum intensity location along one A-line was taken as tissue surface. By sixth-order polynomial fitting of these data along the B scan, the smoothed tissue surface was obtained. Then the images were imported into a commercially available rendering software (Amira 4.1.0, Mercury Computer System, Chelmsford, Massachusetts). The images were filtered by median filter. Then the Voltex module in the Amira software was used to do the 3D rendering.

Conclusions
In conclusion, we have analyzed the real-time processing power of a quad-core PC and developed a real-time fiber-based PSOCT system. By using the parallel computing technique, real-time OCT data processing of an 80K A-line system is possible for our system. Real-time acquisition and processing of PSOCT data at A-line speed of 20K was demonstrated. The realtime processing ability could be easily improved by inclusion of more computing cores. With our current configuration (four-core CPU), only three cores were used for data processing. Another core was used for data acquisition and the other operations. If a PC with two quadcore CPUs is used, seven cores can be used for data processing, and the speed-up will be more than six times. The speed-up is enough for real-time data processing of most fast OCT systems. With emerging low-cost multicore parallel computing techniques, such as the GPU technique [36], a low-cost supercomputer with hundreds of cores is possible, and the processing power of a personal PC can be increased hundreds of times.  Software architecture for OCT and PSOCT systems. Three threads are shown here (the controlling thread is not shown). They are the data acquisition and fetching thread, processing thread, and displaying thread. The processing thread has three parallel processing sections that will process the data for different A lines.  Data processing flow for the PSOCT. LH (LV) is the acquired horizontally (vertically) polarized component of the interference signal for linear polarization input. CH (CV) is the acquired horizontally (vertically) polarized component of the interference signal for circular polarization input. LH′, LV′, CH′, and CV′ are the frequency resampled data for LH, LV, CH, and CV. SLH, SLV, SCH, and SCV are the fast Fourier transform of LH′, LV′, CH′, and CV ′. From SLH and SLV, we can get the Stokes parameters (LI, LQ, LU, and LV) for linear polarization input. Accordingly, we can get the stokes parameters (CI, CQ, CU, and CV) for circular polarization input. The intensity images were obtained from the average of LI and CI. The phase retardation is calculated according to the method in [5,6]. Time consumption for processing one single A-line of PSOCT data.  3D polarization sensitive images for pig tendon ex vivo: (a) intensity image, (b) phase retardation image.