Synthetic Dataset

Considerations and Justifications for Choice

When training the deep learning model (see Deep Learning (Multiclass Classification with CNN) for more information),  generating a synthetic dataset was helpful for increasing the size of the training and validation dataset to improve model performance.

Subsequently, even after moving away from deep learning models — which meant a negation of large volumes of data for training, there was also the realisation that the synthetic dataset could be used for quantifying model performance. The application of generated datasets on different cell counting model iterations is discussed here.

i. Dataset V1

Some of the 173 real microscope images taken using the microscope and yeast samples were used for the dataset after image preprocessing was conducted. After preprocessing, each 40 x 40 piece was manually labelled with the number of cells inside. This was collated in a table saved as a CSV file. 

Image of Yeast Sample under the Microscope before Preprocessing
Snapshot of CSV rows for Labelled Dataset Preparation

—————————————————————

Identified Issues with Previous Dataset (Critique):

  • Some of the cells in the real images are blurry or partially cropped and may affect classification performance.
  • Since the boxes are drawn somewhat small compared to the size of the cell, there are a lot of edge effects, and many cells are cut off.
  • Hence, whether some boxes represent a whole cell is ambiguous and will limit model performance.
  • The use of a synthetic dataset enables control over such factors and helps to improve the quality of the training dataset, as we can generate images that are not cropped.
  • Nevertheless, the problem of cut-off cells would still exist during deployment. 

Modifications Made (Redesign):

  • Control and feed model training data with even proportion of images with different cell counts 
  • Pick and select only cells that are clear to form synthetic dataset

ii. Dataset V2.1 (Synthetic Dataset)

We modified a program from Kaggle which would generate synthetic cells [1]. The images above are samples of these synthesised images. The number of cells wanted for each image could be specified and the program would randomly select from any of the 22 prepared template cells (cropped from real microscope images). The selected cell would then be inserted onto an empty background image and this would be repeated for as many cells required. Since the cell inserted are derived from real images taken, they resemble actual cells very closely.

Modifications Made to Original Program:

  • The original program was further optimised to adhere to the Do not Repeat Yourself (DRY) principle by abstracting the cell insertion into a reusable function insert_cell(cell_type, bground, x_coordinates, y_coordinates, counting)
  • The original program did not account for overlapping cells when generating synthetic images. This meant that instances, where cells had overlapping coordinates, could occur. This would introduce confusion for the classification training as cells would appear distorted or merged. 

This was solved by storing all x and y coordinates in respective sets. A set is an unordered collection of items. Every set element is unique (no duplicates) and must be immutable (cannot be changed). A coordinate is randomly chosen for the next cell to be inserted. For instance, for a cell of 15 x 30, the coordinates (25,56) are chosen randomly. There is a check to see if the coordinates spanning the width and length of the entire cell to be inserted are available. In this case, the sets will be checked for the presence of x-coordinates 25,26,27,28,29…40  and y-coordinates 56,57,58,59….86. If any of the numbers are missing, these coordinates have been already occupied by previous cells and would result in overlapping. Thus, the process for a random selection of a new set of coordinates occurs and will repeat until satisfactory coordinates that do not result in an overlap are found. If the coordinates found do not overlap with previously inserted cells, these coordinates will be used for the insertion of this cell, and the coordinates will be removed from the existing set of coordinates. If there are multiple cell insertions, future cell insertions would then chose from the updated set of coordinates, and the process repeats.

x= random.choice(tuple(x_coordinates)) #choose from set of coordinates
y= random.choice(tuple(y_coordinates))
h=shape[0]
w=shape[1]
 
# check if coordinates are available or occupied by previous cells
cur_x = {i for i in range(x, x+w+1)}
cur_y = {i for i in range(y, y+h+1)}
 
 # keep looping if coordinates chosen are not satisfactory and overlap with existing cells
while not (cur_x.issubset(x_coordinates) and cur_y.issubset(y_coordinates)):
              x= random.choice(tuple(x_coordinates))
              y= random.choice(tuple(y_coordinates))
              cur_x = {i for i in range(x, x+w+1)}
              cur_y = {i for i in range(y, y+h+1)}
 
 
x_coordinates -= cur_x # remove coordinates from set of remaining coordinates
y_coordinates -= cur_y
  • The original program did not label images with cell counts. We added code to create a pandas dataframe and appended rows with cell counts for every image synthesised. The dataframe is then exported and saved as a CSV file.

Since cells and the image are resizable, we can also customise the image dimensions should we use other models with different input image sizes.

Libraries Used
import pandas as pd
import numpy as np
import os
import cv2
import matplotlib.pyplot as plt
from skimage.io import imread, imshow
from skimage.transform import resize
import random
Preparing of Template Cells and Background by Cropping from Real Images
######### Crop out Template Cells ############
number = 3
# take samples from original images to generate unique cell types 
sample_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/'
sample = cv2.imread(sample_path + str(number) + '.png')
plt.imshow(sample)
 
# known coordinates for cropping exact cell (find by viewing image with axis grids)
y=2
x=33
h=7
w=9
 
# use numpy slicing to execute the crop
img = sample[y:y+h, x:x+w]
cv2.imwrite(sample_path + f'cropped/{str(number)}.png', img)
plt.imshow(img)
######### Crop out Background ############
sample_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/'
fname = '1.png'
 
# known coordinates for background
y = 40
x = 38
h = 50
w = 50
 
image = cv2.imread(sample_path + fname)
plt.imshow(image)
bground = image[y:y+h, x:x+w].copy()
plt.imshow(bground)
 
cv2.imwrite(sample_path + 'background_' +  fname, bground)
Cell Insertion Function
def insert_cell(cell_type, bground, x_coordinates, y_coordinates, counting):
            """ 
            Insert a cell of specified cell type into background image and removes cell coordinates from sets of coordinates for tracking.
 
            @param cell_type: cell type from given template cells, 
            @param bground: image background for cells to be pasted on,
            @param x_coordinates: set of x coordinates that have not been occupied by previous cells inserted,
            @param y_coordinates: set of y coordinates that have not been occupied by previous cells inserted
            @return bground: updated background with cell inserted
            @return x_coordinate: updated set of x coordinates with current cell x-coordinates removed
            @return y_coordinate: updated set of y coordinates with current cell y-coordinates removed
            """
 
            cell = cv2.imread(f'/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/cropped/{str(cell_type)}.png')
 
            # add a random rotation to the cell
            cell = np.rot90(cell, k=np.random.randint(0,3))
 
            shape = cell.shape
 
            x= random.choice(tuple(x_coordinates))
            y= random.choice(tuple(y_coordinates))
            h=shape[0]
            w=shape[1]
 
            # check if coordinates are available or occupied by previous cells
            cur_x = {i for i in range(x, x+w+1)}
            cur_y = {i for i in range(y, y+h+1)}
 
            while not (cur_x.issubset(x_coordinates) and cur_y.issubset(y_coordinates)):
              x= random.choice(tuple(x_coordinates))
              y= random.choice(tuple(y_coordinates))
              cur_x = {i for i in range(x, x+w+1)}
              cur_y = {i for i in range(y, y+h+1)}
 
            counting += 1
            x_coordinates -= cur_x
            y_coordinates -= cur_y
 
            bground[y:y+h, x:x+w] = 0
            bground[y:y+h, x:x+w] = cell
            return bground, x_coordinates, y_coordinates
Synthetic Dataset Program
""" 
Variables for customisation
job_name : CSV and part of image name
csv_dir: Where CSV will be saved
"""
job_name = '8julywebcam' # what csv will be named
 
num_images_wanted = 8
min_cells_on_image = 60
max_cells_on_image = 100
 
# set max x and y to prevent cells from extending outside the background image
max_x = 1500
max_y = 1500
 
# store filename and cell count for csv making
filename = []
n_cells = []
csv_dir = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/'
 
# ==============================
 
for i in range(0, num_images_wanted):
    # randomly choose the number of cells to put in the image
    num_cells_on_image = np.random.randint(min_cells_on_image, max_cells_on_image+1)
 
 
    # Name the image.
    # The number of cells is included in the file name.
    image_name = job_name + '_' + str(i) + '_'  + str(num_cells_on_image) + '.png'
 
 
    # =========================
    # 1. Create the background
    # =========================
 
    path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/background_1.png'
 
    # read the image
    bground_comb = cv2.imread(path)
 
    # add random rotation to the background
    num_k = np.random.randint(0,3)
    bground_comb = np.rot90(bground_comb, k=num_k)
 
    # resize the background to match what we want
    bground_comb = cv2.resize(bground_comb, (1600, 1600))
 
 
    # ===============================
    # 2. Add cells to the background
    # ===============================
    # store coordinates to handle overlap
    x_coordinates = {i for i in range(0,max_x)}
    y_coordinates = {i for i in range(0, max_y)}
 
    for j in range(0, num_cells_on_image):
 
        path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/background_1.png'
 
        # read the image
        bground = cv2.imread(path)
        # add rotation to the background
        bground = np.rot90(bground, k=num_k)
        # resize the background to match what we want
        bground = cv2.resize(bground, (1600, 1600))
 
 
        # randomly choose a type of cell to add to the image
        cell_type = np.random.randint(1,11+1)
 
        # insert cell template in image and update set with leftover coordinates
        bground, x_coordinates, y_coordinates = insert_cell(cell_type, bground, x_coordinates, y_coordinates,j+1)
 
        bground_comb = np.maximum(bground_comb, bground)
 
    plt.imshow(bground_comb)
    path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/8july_webcam/' + image_name
    filename.append(image_name)
    n_cells.append(str(num_cells_on_image))
    #bground_comb = cv2.resize(bground_comb, (74, 74))
    cv2.imwrite(path, bground_comb)
 
df = pd.DataFrame({
     'filename': filename,
     'n_cells': n_cells
    })
print(df)
print(csv_dir+ '{}.csv'.format(job_name))
df.to_csv(csv_dir+ '{}.csv'.format(job_name), index = False)

 

ii. Dataset V2.2 (Synthetic Dataset)

Identified Issues with Previous Dataset (Critique):

  • Storing available coordinates in sets causes the coordinate value to be available for use only once with no duplicates. Cells inserted cannot share the same row (y-coordinate value) or cannot share the same column (x-coordinate value) since the coordinate is removed from the set after one cell insertion. This limits cell generation since fewer cells can be packed together.
  • Using the synthetic dataset to gauge cell counting performance is not reliable as the dataset does not mock real conditions well. It does not account for unfocused cells and uneven lighting. This causes cell counting performance to be inaccurately estimated.

Modifications Made (Redesign):

  • Instead of storing coordinates in sets, use a H x W array to allow cells to overlap in row or column coordinates. More cells can be packed in the same image now.
  • Randomly add blur filter to inserted cell to mock unfocused cell. After testing median blurring and gaussian blurring with different kernel sizes, a median blur filter with a kernel size of 9 was selected as it mocks unfocused cells the best.
  • Optimised code and reduce number of lines.
 V2.1 Synthetic Image Dataset (Sparse number of cells inserted using Sets)
 V2.2 Synthetic Dataset (more cells can be inserted and packed more closely)
 V2.2 Synthetic Dataset with Random Blurring of Cells to Mock Unfocused Cells

 

Libraries Used
import numpy as np
import matplotlib.pyplot as plt
import cv2
import copy
Improved Cell Insertion Function
def new_insert(N_min, N_max, dim = (1600, 1600)):
  """
  N_min and N_max: max no. of cells
  """
 
  mask = np.zeros(dim)
 
  N_cells = np.random.randint(N_min, N_max + 1)
 
  bg_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/background_1.png'
 
  image = cv2.imread(bg_path)
 
  image = np.rot90(image, k = np.random.randint(1,4))
 
  image = cv2.resize(image, dim)
 
  bg = copy.deepcopy(image)
 
  c = 0
  attempt_c = 0
  MAX_ATTEMPTS = 25
 
  while c < N_cells:
 
    cell = cv2.imread(f'/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/cropped/{np.random.randint(1,12)}.png') # choose from 11 cell types
 
    # randomly blur and reduce opacity -- larger kernel size, larger extent of blurring
    if (np.random.randint(0,2)) and attempt_c < 2:
      cell = cv2.medianBlur(cell,9)
 
    test_x = np.random.randint(0, dim[1] - cell.shape[1]) 
    test_y = np.random.randint(0, dim[0] - cell.shape[0])
 
    if np.sum(mask[test_y: test_y + cell.shape[0], test_x: test_x + cell.shape[1]]) == 0: 
      c += 1
      attempt_c = 0
 
      mask[test_y: test_y + cell.shape[0], test_x: test_x + cell.shape[1]] = 1
      image[test_y: test_y + cell.shape[0], test_x: test_x + cell.shape[1]] = np.maximum(cell, bg[:cell.shape[0],:cell.shape[1]])
    else:
      attempt_c += 1 # count as failed attempt
    if attempt_c >= MAX_ATTEMPTS: # avoid infinite loop if unsuccessful for ___ times
      break
  return N_cells, image
Synthetic Dataset Generation Program
job_name = "20july_webcam"
N_images = 10
 
for i in range(N_images):
  _N, test_image = new_insert(100, 500, (600,600))
 
  cv2.imwrite(os.path.join(f'/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/{job_name}/medianBlur9', f"{job_name}_{i}_{_N}.png"), test_image)

   


References

[1] vbookshelf, “Synthetic cell images and masks for cell segmentation and counting,” Kaggle, 2019. [Online]. Available: https://www.kaggle.com/vbookshelf/synthetic-cell-images-and-masks-bbbc005-v1. [Accessed: 12-Jul-2021].

Leave a Reply