Considerations and Justifications for Choice
When training the deep learning model (see Deep Learning (Multiclass Classification with CNN) for more information), generating a synthetic dataset was helpful for increasing the size of the training and validation dataset to improve model performance.
Subsequently, even after moving away from deep learning models — which meant a negation of large volumes of data for training, there was also the realisation that the synthetic dataset could be used for quantifying model performance. The application of generated datasets on different cell counting model iterations is discussed here.
i. Dataset V1
Some of the 173 real microscope images taken using the microscope and yeast samples were used for the dataset after image preprocessing was conducted. After preprocessing, each 40 x 40 piece was manually labelled with the number of cells inside. This was collated in a table saved as a CSV file.
—————————————————————
Identified Issues with Previous Dataset (Critique):
- Some of the cells in the real images are blurry or partially cropped and may affect classification performance.
- Since the boxes are drawn somewhat small compared to the size of the cell, there are a lot of edge effects, and many cells are cut off.
- Hence, whether some boxes represent a whole cell is ambiguous and will limit model performance.
- The use of a synthetic dataset enables control over such factors and helps to improve the quality of the training dataset, as we can generate images that are not cropped.
- Nevertheless, the problem of cut-off cells would still exist during deployment.
Modifications Made (Redesign):
- Control and feed model training data with even proportion of images with different cell counts
- Pick and select only cells that are clear to form synthetic dataset
ii. Dataset V2.1 (Synthetic Dataset)
We modified a program from Kaggle which would generate synthetic cells [1]. The images above are samples of these synthesised images. The number of cells wanted for each image could be specified and the program would randomly select from any of the 22 prepared template cells (cropped from real microscope images). The selected cell would then be inserted onto an empty background image and this would be repeated for as many cells required. Since the cell inserted are derived from real images taken, they resemble actual cells very closely.
Modifications Made to Original Program:
- The original program was further optimised to adhere to the Do not Repeat Yourself (DRY) principle by abstracting the cell insertion into a reusable function
insert_cell(cell_type, bground, x_coordinates, y_coordinates, counting)
- The original program did not account for overlapping cells when generating synthetic images. This meant that instances, where cells had overlapping coordinates, could occur. This would introduce confusion for the classification training as cells would appear distorted or merged.
This was solved by storing all x and y coordinates in respective sets. A set is an unordered collection of items. Every set element is unique (no duplicates) and must be immutable (cannot be changed). A coordinate is randomly chosen for the next cell to be inserted. For instance, for a cell of 15 x 30, the coordinates (25,56) are chosen randomly. There is a check to see if the coordinates spanning the width and length of the entire cell to be inserted are available. In this case, the sets will be checked for the presence of x-coordinates 25,26,27,28,29…40 and y-coordinates 56,57,58,59….86. If any of the numbers are missing, these coordinates have been already occupied by previous cells and would result in overlapping. Thus, the process for a random selection of a new set of coordinates occurs and will repeat until satisfactory coordinates that do not result in an overlap are found. If the coordinates found do not overlap with previously inserted cells, these coordinates will be used for the insertion of this cell, and the coordinates will be removed from the existing set of coordinates. If there are multiple cell insertions, future cell insertions would then chose from the updated set of coordinates, and the process repeats.
x= random.choice(tuple(x_coordinates)) #choose from set of coordinates y= random.choice(tuple(y_coordinates)) h=shape[0] w=shape[1] # check if coordinates are available or occupied by previous cells cur_x = {i for i in range(x, x+w+1)} cur_y = {i for i in range(y, y+h+1)} # keep looping if coordinates chosen are not satisfactory and overlap with existing cells while not (cur_x.issubset(x_coordinates) and cur_y.issubset(y_coordinates)): x= random.choice(tuple(x_coordinates)) y= random.choice(tuple(y_coordinates)) cur_x = {i for i in range(x, x+w+1)} cur_y = {i for i in range(y, y+h+1)} x_coordinates -= cur_x # remove coordinates from set of remaining coordinates y_coordinates -= cur_y
- The original program did not label images with cell counts. We added code to create a pandas dataframe and appended rows with cell counts for every image synthesised. The dataframe is then exported and saved as a CSV file.
Since cells and the image are resizable, we can also customise the image dimensions should we use other models with different input image sizes.
Libraries Used |
import pandas as pd import numpy as np import os import cv2 import matplotlib.pyplot as plt from skimage.io import imread, imshow from skimage.transform import resize import random |
Preparing of Template Cells and Background by Cropping from Real Images |
######### Crop out Template Cells ############ number = 3 # take samples from original images to generate unique cell types sample_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/' sample = cv2.imread(sample_path + str(number) + '.png') plt.imshow(sample) # known coordinates for cropping exact cell (find by viewing image with axis grids) y=2 x=33 h=7 w=9 # use numpy slicing to execute the crop img = sample[y:y+h, x:x+w] cv2.imwrite(sample_path + f'cropped/{str(number)}.png', img) plt.imshow(img) ######### Crop out Background ############ sample_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/' fname = '1.png' # known coordinates for background y = 40 x = 38 h = 50 w = 50 image = cv2.imread(sample_path + fname) plt.imshow(image) bground = image[y:y+h, x:x+w].copy() plt.imshow(bground) cv2.imwrite(sample_path + 'background_' + fname, bground) |
Cell Insertion Function |
def insert_cell(cell_type, bground, x_coordinates, y_coordinates, counting): """ Insert a cell of specified cell type into background image and removes cell coordinates from sets of coordinates for tracking. @param cell_type: cell type from given template cells, @param bground: image background for cells to be pasted on, @param x_coordinates: set of x coordinates that have not been occupied by previous cells inserted, @param y_coordinates: set of y coordinates that have not been occupied by previous cells inserted @return bground: updated background with cell inserted @return x_coordinate: updated set of x coordinates with current cell x-coordinates removed @return y_coordinate: updated set of y coordinates with current cell y-coordinates removed """ cell = cv2.imread(f'/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/cropped/{str(cell_type)}.png') # add a random rotation to the cell cell = np.rot90(cell, k=np.random.randint(0,3)) shape = cell.shape x= random.choice(tuple(x_coordinates)) y= random.choice(tuple(y_coordinates)) h=shape[0] w=shape[1] # check if coordinates are available or occupied by previous cells cur_x = {i for i in range(x, x+w+1)} cur_y = {i for i in range(y, y+h+1)} while not (cur_x.issubset(x_coordinates) and cur_y.issubset(y_coordinates)): x= random.choice(tuple(x_coordinates)) y= random.choice(tuple(y_coordinates)) cur_x = {i for i in range(x, x+w+1)} cur_y = {i for i in range(y, y+h+1)} counting += 1 x_coordinates -= cur_x y_coordinates -= cur_y bground[y:y+h, x:x+w] = 0 bground[y:y+h, x:x+w] = cell return bground, x_coordinates, y_coordinates |
Synthetic Dataset Program |
""" Variables for customisation job_name : CSV and part of image name csv_dir: Where CSV will be saved """ job_name = '8julywebcam' # what csv will be named num_images_wanted = 8 min_cells_on_image = 60 max_cells_on_image = 100 # set max x and y to prevent cells from extending outside the background image max_x = 1500 max_y = 1500 # store filename and cell count for csv making filename = [] n_cells = [] csv_dir = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/' # ============================== for i in range(0, num_images_wanted): # randomly choose the number of cells to put in the image num_cells_on_image = np.random.randint(min_cells_on_image, max_cells_on_image+1) # Name the image. # The number of cells is included in the file name. image_name = job_name + '_' + str(i) + '_' + str(num_cells_on_image) + '.png' # ========================= # 1. Create the background # ========================= path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/background_1.png' # read the image bground_comb = cv2.imread(path) # add random rotation to the background num_k = np.random.randint(0,3) bground_comb = np.rot90(bground_comb, k=num_k) # resize the background to match what we want bground_comb = cv2.resize(bground_comb, (1600, 1600)) # =============================== # 2. Add cells to the background # =============================== # store coordinates to handle overlap x_coordinates = {i for i in range(0,max_x)} y_coordinates = {i for i in range(0, max_y)} for j in range(0, num_cells_on_image): path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/background_1.png' # read the image bground = cv2.imread(path) # add rotation to the background bground = np.rot90(bground, k=num_k) # resize the background to match what we want bground = cv2.resize(bground, (1600, 1600)) # randomly choose a type of cell to add to the image cell_type = np.random.randint(1,11+1) # insert cell template in image and update set with leftover coordinates bground, x_coordinates, y_coordinates = insert_cell(cell_type, bground, x_coordinates, y_coordinates,j+1) bground_comb = np.maximum(bground_comb, bground) plt.imshow(bground_comb) path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/8july_webcam/' + image_name filename.append(image_name) n_cells.append(str(num_cells_on_image)) #bground_comb = cv2.resize(bground_comb, (74, 74)) cv2.imwrite(path, bground_comb) df = pd.DataFrame({ 'filename': filename, 'n_cells': n_cells }) print(df) print(csv_dir+ '{}.csv'.format(job_name)) df.to_csv(csv_dir+ '{}.csv'.format(job_name), index = False) |
ii. Dataset V2.2 (Synthetic Dataset)
Identified Issues with Previous Dataset (Critique):
- Storing available coordinates in sets causes the coordinate value to be available for use only once with no duplicates. Cells inserted cannot share the same row (y-coordinate value) or cannot share the same column (x-coordinate value) since the coordinate is removed from the set after one cell insertion. This limits cell generation since fewer cells can be packed together.
- Using the synthetic dataset to gauge cell counting performance is not reliable as the dataset does not mock real conditions well. It does not account for unfocused cells and uneven lighting. This causes cell counting performance to be inaccurately estimated.
Modifications Made (Redesign):
- Instead of storing coordinates in sets, use a H x W array to allow cells to overlap in row or column coordinates. More cells can be packed in the same image now.
- Randomly add blur filter to inserted cell to mock unfocused cell. After testing median blurring and gaussian blurring with different kernel sizes, a median blur filter with a kernel size of 9 was selected as it mocks unfocused cells the best.
- Optimised code and reduce number of lines.
Libraries Used |
import numpy as np import matplotlib.pyplot as plt import cv2 import copy |
Improved Cell Insertion Function |
def new_insert(N_min, N_max, dim = (1600, 1600)): """ N_min and N_max: max no. of cells """ mask = np.zeros(dim) N_cells = np.random.randint(N_min, N_max + 1) bg_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/background_1.png' image = cv2.imread(bg_path) image = np.rot90(image, k = np.random.randint(1,4)) image = cv2.resize(image, dim) bg = copy.deepcopy(image) c = 0 attempt_c = 0 MAX_ATTEMPTS = 25 while c < N_cells: cell = cv2.imread(f'/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/cropped/{np.random.randint(1,12)}.png') # choose from 11 cell types # randomly blur and reduce opacity -- larger kernel size, larger extent of blurring if (np.random.randint(0,2)) and attempt_c < 2: cell = cv2.medianBlur(cell,9) test_x = np.random.randint(0, dim[1] - cell.shape[1]) test_y = np.random.randint(0, dim[0] - cell.shape[0]) if np.sum(mask[test_y: test_y + cell.shape[0], test_x: test_x + cell.shape[1]]) == 0: c += 1 attempt_c = 0 mask[test_y: test_y + cell.shape[0], test_x: test_x + cell.shape[1]] = 1 image[test_y: test_y + cell.shape[0], test_x: test_x + cell.shape[1]] = np.maximum(cell, bg[:cell.shape[0],:cell.shape[1]]) else: attempt_c += 1 # count as failed attempt if attempt_c >= MAX_ATTEMPTS: # avoid infinite loop if unsuccessful for ___ times break return N_cells, image |
Synthetic Dataset Generation Program |
job_name = "20july_webcam" N_images = 10 for i in range(N_images): _N, test_image = new_insert(100, 500, (600,600)) cv2.imwrite(os.path.join(f'/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/{job_name}/medianBlur9', f"{job_name}_{i}_{_N}.png"), test_image) |
References
[1] vbookshelf, “Synthetic cell images and masks for cell segmentation and counting,” Kaggle, 2019. [Online]. Available: https://www.kaggle.com/vbookshelf/synthetic-cell-images-and-masks-bbbc005-v1. [Accessed: 12-Jul-2021].