Deep Learning (Multiclass Classification with CNN)

Considerations and Justifications for Choice

After conducting preliminary research on common cell-counting models [1][2], we settled on experimenting with deep learning first.

We decided to utilise a classification approach. By cropping the microscope images into multiple pieces, the range of cells in each piece would fall between 0 and 5 cells. The model would then be trained to predict the cell count for each piece of the image and the total count would be totalled to obtain the total number of cells for the whole microscope image.

 

Pipeline

  1. Preprocessing
Image Preprocessing of Cropped 50 x 50 Pieces
  1. Splitting of 1500 x 1500 Image to Multiple 50 x 50 Pieces
  2. Convert Image to Grayscale
    1. Colour is not relevant to the model as yeast cells have distinct features even in grayscale that enables detection. Hence, colour is not required to identify a cell.
    2. Although there could be some colour information due to chromatic aberration from the lens, the chromatic aberration mostly occurs along the circular edge of the image, and there is significant distortion here.
    3. Since we are cropping a square from the centre of the image, the colour information can be safely discarded. 
  3. Apply Median Blurring Filter for Artefact Removal / Image Smoothing
    1. The median of all the pixels under the kernel area is derived and the central element is replaced with this median value. This is highly effective against salt-and-pepper noise in an image. Its kernel size should be a positive odd integer [3].
  4. Invert Image and Apply High Pass Filter to Sharpen Image (Threshold of 55)
    1. This helps to remove pixels that are very bright that can interfere with results.
  5. Resize to 40 x 40 Pieces
    1. The 40×40 size is somewhat arbitrary. Since this was a prototype, we anticipated that in the future, we would have to do re-scaling of the image anyway.
    2. For reference, the classic MNIST handwriting dataset uses 28×28 pixels, so 40×40 is somewhat in the same order of magnitude.
  6. Normalise
    1. The pixel values of the whole image is normalised to the range [0, 1]. This is standard practice for machine learning inputs. 
def preprocess(image_path, radius = 750, threshold = 55):
	"""
	performs pre-defined pre-processing steps for microscope photos taken on june 3 by yingyue
 
	:param      image_path:  The image file path to jpg
	:type       image_path:  a string with the relative / absolute imagepath
	:param      radius: draws a square of 2r * 2r around the centre of the image to crop, default of 750
	:param      threshold: rgb threshold value for low pass filter (since cells are front lit, cell membranes are dark)
 
    returns a np array (uint8) that is cropped at the centre (2*radius by 2*radius) with some filtering done
	"""
 
	img = cv2.imread(image_path, 0) #greyscale
	_height, _width = img.shape
 
	img_c = img[_height//2-radius:_height//2+radius, _width//2 - radius: _width//2 + radius]
 
	img_c = cv2.medianBlur(img_c, 3)
 
	img_c = ~img_c # invert 
 
	img_c[img_c < 55] = 0 # removes all the bright stuff in the original image
        img_c = cv2.resize(img_c, dsize=(40, 40), interpolation=cv2.INTER_CUBIC) # resize to 40 x 40
	img_c= cv2.normalize(img_c, None, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F) # normalise image
 
	return img_c

      2. Training, Validating, and Testing Dataset Preparation

i. Dataset V1

Some of the 173 real microscope images taken using the microscope and yeast samples were used for the dataset after image preprocessing was conducted. After preprocessing, each 40 x 40 piece was manually labelled with the number of cells inside. This was collated in a table saved as a CSV file. 

Image of Yeast Sample under the Microscope before Preprocessing
Snapshot of CSV rows for Labelled Dataset Preparation

—————————————————————

Identified Issues with Previous Dataset (Critique):

  • Some of the cells in the real images are blurry or partially cropped and may affect classification performance.
  • Since the boxes are drawn somewhat small compared to the size of the cell, there are a lot of edge effects, and many cells are cut off.
  • Hence, whether some boxes represent a whole cell is ambiguous and will limit model performance.
  • The use of a synthetic dataset enables control over such factors and helps to improve the quality of the training dataset, as we can generate images that are not cropped.
  • Nevertheless, the problem of cut-off cells would still exist during deployment. 

Modifications Made (Redesign):

  • Control and feed model training data with even proportion of images with different cell counts 
  • Pick and select only cells that are clear to form synthetic dataset

ii. Dataset V2.1 (Synthetic Dataset)

We modified a program from Kaggle which would generate synthetic cells [5]. The images above are samples of these synthesised images. The number of cells wanted for each image could be specified and the program would randomly select from any of the 22 prepared template cells (cropped from real microscope images). The selected cell would then be inserted onto an empty background image and this would be repeated for as many cells required. Since the cell inserted are derived from real images taken, they resemble actual cells very closely.

Modifications Made to Original Program:

  • The original program was further optimised to adhere to the Do not Repeat Yourself (DRY) principle by abstracting the cell insertion into a reusable function insert_cell(cell_type, bground, x_coordinates, y_coordinates, counting)
  • The original program did not account for overlapping cells when generating synthetic images. This meant that instances, where cells had overlapping coordinates, could occur. This would introduce confusion for the classification training as cells would appear distorted or merged. 

This was solved by storing all x and y coordinates in respective sets. A set is an unordered collection of items. Every set element is unique (no duplicates) and must be immutable (cannot be changed). A coordinate is randomly chosen for the next cell to be inserted. For instance, for a cell of 15 x 30, the coordinates (25,56) are chosen randomly. There is a check to see if the coordinates spanning the width and length of the entire cell to be inserted are available. In this case, the sets will be checked for the presence of x-coordinates 25,26,27,28,29…40  and y-coordinates 56,57,58,59….86. If any of the numbers are missing, these coordinates have been already occupied by previous cells and would result in overlapping. Thus, the process for a random selection of a new set of coordinates occurs and will repeat until satisfactory coordinates that do not result in an overlap are found. If the coordinates found do not overlap with previously inserted cells, these coordinates will be used for the insertion of this cell, and the coordinates will be removed from the existing set of coordinates. If there are multiple cell insertions, future cell insertions would then chose from the updated set of coordinates, and the process repeats.

x= random.choice(tuple(x_coordinates)) #choose from set of coordinates
y= random.choice(tuple(y_coordinates))
h=shape[0]
w=shape[1]
 
# check if coordinates are available or occupied by previous cells
cur_x = {i for i in range(x, x+w+1)}
cur_y = {i for i in range(y, y+h+1)}
 
 # keep looping if coordinates chosen are not satisfactory and overlap with existing cells
while not (cur_x.issubset(x_coordinates) and cur_y.issubset(y_coordinates)):
              x= random.choice(tuple(x_coordinates))
              y= random.choice(tuple(y_coordinates))
              cur_x = {i for i in range(x, x+w+1)}
              cur_y = {i for i in range(y, y+h+1)}
 

x_coordinates -= cur_x # remove coordinates from set of remaining coordinates
y_coordinates -= cur_y
  • The original program did not label images with cell counts. We added code to create a pandas dataframe and appended rows with cell counts for every image synthesised. The dataframe is then exported and saved as a CSV file.

Since cells and the image are resizable, we can also customise the image dimensions should we use other models with different input image sizes.

Libraries Used
import pandas as pd
import numpy as np
import os
import cv2
import matplotlib.pyplot as plt
from skimage.io import imread, imshow
from skimage.transform import resize
import random
Preparing of Template Cells and Background by Cropping from Real Images
######### Crop out Template Cells ############
number = 3
# take samples from original images to generate unique cell types 
sample_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/'
sample = cv2.imread(sample_path + str(number) + '.png')
plt.imshow(sample)
 
# known coordinates for cropping exact cell (find by viewing image with axis grids)
y=2
x=33
h=7
w=9
 
# use numpy slicing to execute the crop
img = sample[y:y+h, x:x+w]
cv2.imwrite(sample_path + f'cropped/{str(number)}.png', img)
plt.imshow(img)

######### Crop out Background ############
sample_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/'
fname = '1.png'
 
# known coordinates for background
y = 40
x = 38
h = 50
w = 50
 
image = cv2.imread(sample_path + fname)
plt.imshow(image)
bground = image[y:y+h, x:x+w].copy()
plt.imshow(bground)
 
cv2.imwrite(sample_path + 'background_' +  fname, bground)
Cell Insertion Function
def insert_cell(cell_type, bground, x_coordinates, y_coordinates, counting):
            """ 
            Insert a cell of specified cell type into background image and removes cell coordinates from sets of coordinates for tracking.
 
            @param cell_type: cell type from given template cells, 
            @param bground: image background for cells to be pasted on,
            @param x_coordinates: set of x coordinates that have not been occupied by previous cells inserted,
            @param y_coordinates: set of y coordinates that have not been occupied by previous cells inserted
            @return bground: updated background with cell inserted
            @return x_coordinate: updated set of x coordinates with current cell x-coordinates removed
            @return y_coordinate: updated set of y coordinates with current cell y-coordinates removed
            """
 
            cell = cv2.imread(f'/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/cropped/{str(cell_type)}.png')
 
            # add a random rotation to the cell
            cell = np.rot90(cell, k=np.random.randint(0,3))
 
            shape = cell.shape
 
            x= random.choice(tuple(x_coordinates))
            y= random.choice(tuple(y_coordinates))
            h=shape[0]
            w=shape[1]
 
            # check if coordinates are available or occupied by previous cells
            cur_x = {i for i in range(x, x+w+1)}
            cur_y = {i for i in range(y, y+h+1)}
 
            while not (cur_x.issubset(x_coordinates) and cur_y.issubset(y_coordinates)):
              x= random.choice(tuple(x_coordinates))
              y= random.choice(tuple(y_coordinates))
              cur_x = {i for i in range(x, x+w+1)}
              cur_y = {i for i in range(y, y+h+1)}
            
            counting += 1
            x_coordinates -= cur_x
            y_coordinates -= cur_y
 
            bground[y:y+h, x:x+w] = 0
            bground[y:y+h, x:x+w] = cell
            return bground, x_coordinates, y_coordinates
Synthetic Dataset Program
""" 
Variables for customisation
job_name : CSV and part of image name
csv_dir: Where CSV will be saved
"""
job_name = '8julywebcam' # what csv will be named
 
num_images_wanted = 8
min_cells_on_image = 60
max_cells_on_image = 100
 
# set max x and y to prevent cells from extending outside the background image
max_x = 1500
max_y = 1500
 
# store filename and cell count for csv making
filename = []
n_cells = []
csv_dir = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/'
 
# ==============================
 
for i in range(0, num_images_wanted):
    # randomly choose the number of cells to put in the image
    num_cells_on_image = np.random.randint(min_cells_on_image, max_cells_on_image+1)
 
 
    # Name the image.
    # The number of cells is included in the file name.
    image_name = job_name + '_' + str(i) + '_'  + str(num_cells_on_image) + '.png'
 
 
    # =========================
    # 1. Create the background
    # =========================
 
    path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/background_1.png'
 
    # read the image
    bground_comb = cv2.imread(path)
 
    # add random rotation to the background
    num_k = np.random.randint(0,3)
    bground_comb = np.rot90(bground_comb, k=num_k)
 
    # resize the background to match what we want
    bground_comb = cv2.resize(bground_comb, (1600, 1600))
 
 
    # ===============================
    # 2. Add cells to the background
    # ===============================
    # store coordinates to handle overlap
    x_coordinates = {i for i in range(0,max_x)}
    y_coordinates = {i for i in range(0, max_y)}
 
    for j in range(0, num_cells_on_image):
 
        path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/background_1.png'
 
        # read the image
        bground = cv2.imread(path)
        # add rotation to the background
        bground = np.rot90(bground, k=num_k)
        # resize the background to match what we want
        bground = cv2.resize(bground, (1600, 1600))
 
 
        # randomly choose a type of cell to add to the image
        cell_type = np.random.randint(1,11+1)
 
        # insert cell template in image and update set with leftover coordinates
        bground, x_coordinates, y_coordinates = insert_cell(cell_type, bground, x_coordinates, y_coordinates,j+1)
 
        bground_comb = np.maximum(bground_comb, bground)
 
    plt.imshow(bground_comb)
    path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/8july_webcam/' + image_name
    filename.append(image_name)
    n_cells.append(str(num_cells_on_image))
    #bground_comb = cv2.resize(bground_comb, (74, 74))
    cv2.imwrite(path, bground_comb)
 
df = pd.DataFrame({
     'filename': filename,
     'n_cells': n_cells
    })
print(df)
print(csv_dir+ '{}.csv'.format(job_name))
df.to_csv(csv_dir+ '{}.csv'.format(job_name), index = False)

 

ii. Dataset V2.2 (Synthetic Dataset)

Identified Issues with Previous Dataset (Critique):

  • Storing available coordinates in sets causes the coordinate value to be available for use only once with no duplicates. Cells inserted cannot share the same row (y-coordinate value) or cannot share the same column (x-coordinate value) since the coordinate is removed from the set after one cell insertion. This limits cell generation since fewer cells can be packed together.
  • Using the synthetic dataset to gauge cell counting performance is not reliable as the dataset does not mock real conditions well. It does not account for unfocused cells and uneven lighting. This causes cell counting performance to be inaccurately estimated.

Modifications Made (Redesign):

  • Instead of storing coordinates in sets, use a H x W array to allow cells to overlap in row or column coordinates. More cells can be packed in the same image now.
  • Randomly add blur filter to inserted cell to mock unfocused cell. After testing median blurring and gaussian blurring with different kernel sizes, a median blur filter with a kernel size of 9 was selected as it mocks unfocused cells the best.
  • Optimised code and reduce number of lines.
 V2.1 Synthetic Image Dataset (Sparse number of cells inserted using Sets)
 V2.2 Synthetic Dataset (more cells can be inserted and packed more closely)
 V2.2 Synthetic Dataset with Random Blurring of Cells to Mock Unfocused Cells

 

Libraries Used
import numpy as np
import matplotlib.pyplot as plt
import cv2
import copy
Improved Cell Insertion Function
def new_insert(N_min, N_max, dim = (1600, 1600)):
  """
  N_min and N_max: max no. of cells
  """
 
  mask = np.zeros(dim)
 
  N_cells = np.random.randint(N_min, N_max + 1)
 
  bg_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/background_1.png'
 
  image = cv2.imread(bg_path)
 
  image = np.rot90(image, k = np.random.randint(1,4))
 
  image = cv2.resize(image, dim)
 
  bg = copy.deepcopy(image)
 
  c = 0
  attempt_c = 0
  MAX_ATTEMPTS = 25
 
  while c < N_cells:
 
    cell = cv2.imread(f'/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/original_images/cropped/{np.random.randint(1,12)}.png') # choose from 11 cell types
 
    # randomly blur and reduce opacity -- larger kernel size, larger extent of blurring
    if (np.random.randint(0,2)) and attempt_c < 2:
      cell = cv2.medianBlur(cell,9)
 
    test_x = np.random.randint(0, dim[1] - cell.shape[1]) 
    test_y = np.random.randint(0, dim[0] - cell.shape[0])
 
    if np.sum(mask[test_y: test_y + cell.shape[0], test_x: test_x + cell.shape[1]]) == 0: 
      c += 1
      attempt_c = 0
 
      mask[test_y: test_y + cell.shape[0], test_x: test_x + cell.shape[1]] = 1
      image[test_y: test_y + cell.shape[0], test_x: test_x + cell.shape[1]] = np.maximum(cell, bg[:cell.shape[0],:cell.shape[1]])
    else:
      attempt_c += 1 # count as failed attempt
    if attempt_c >= MAX_ATTEMPTS: # avoid infinite loop if unsuccessful for ___ times
      break
  return N_cells, image
Synthetic Dataset Generation Program
job_name = "20july_webcam"
N_images = 10
 
for i in range(N_images):
  _N, test_image = new_insert(100, 500, (600,600))
 
  cv2.imwrite(os.path.join(f'/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/{job_name}/medianBlur9', f"{job_name}_{i}_{_N}.png"), test_image)

      3. HDF5 Conversion

Initially, we loaded images and CSVs containing cell count labels as numpy arrays, X being the array for images and Y being the array for the cell counts, which are also the class labels. 

Afterwards, instead of feeding the deep learning model NumPy arrays for training and validation datasets (which we previously did), we opted to compress cell images and CSVs into a single HDF5. HDF5, or Hierarchical Data Format, is a set of file formats designed to store large amounts of data. As using arrays requires reading close to thousands of files individually, the loading to memory requires some time as compared to using HDF5s. Hence, HDF5s would be more efficient for loading datasets when training and validating deep learning models.

Version 1: Loading Images and CSVs as arrays and making class labels 
import pandas as pd
import cv2
import numpy as np
 
label_file_path =  '/content/gdrive/MyDrive/CY2003_MnT/training_data/3n4_50x50.csv'
training_data_folder = '/content/gdrive/MyDrive/CY2003_MnT/training_data/3_50x50/'
 
label_df = pd.read_csv(label_file_path)
 
X = [] # directly loading into memory
y = label_df.n_cells.to_numpy()
 
y = label_binarize(y, classes=[0,1, 2, 3, 4, 5]) # cell counts as classes
 
for index, row in label_df.iterrows(): 
 
 
  label_df.at[index, 'file_root'] = row.filename.split("/")[-1]
 
 
  _img_path = os.path.join(training_data_folder, label_df.at[index, 'file_root'])
 
  print(_img_path)
 
  _img = cv2.imread(_img_path, 0) # convert image to grayscale since model reads grayscale images for input
 
  _img = _img.reshape(_img.shape + (1,))
 
  # do rescaling and normalisation as a batch using keras 
 
  X.append(_img)
 
X = np.array(X)
X = X.astype('float64')
 
Version 2: Compressing Images and CSVs into HDF5s
import h5py
import pandas as pd
import cv2
import numpy as np
import os
from pathlib import Path
 
def make_hdf5():
 
	df = pd.read_csv('training_data/3n4_50x50.csv') 
 
	output_hdf5 = "training_data/3n4.hdf5"
 
	all_files = ([str(p) for p in Path("training_data/3_50x50").glob("*.png")])
 
	hf = h5py.File(output_hdf5, 'w')
 
	for filename in all_files:
		_filename = (filename.split("/")[-1])
 
		if df.loc[df['filename'] == _filename, 'n_cells'].iloc[0] > 2:
			continue
 
		_img = cv2.imread(filename, 0) # greyscale model
 
		_img = _img.astype("float32") / 255.0 # flatten
		_img = _img.reshape(_img.shape + (1,))
 
		_g = hf.create_dataset(_filename, (40,40, 1), data = _img)
 
	# read csvs
 
	hf.close()

4. Data Augmentation

Deep learning models rely heavily on large amounts of data to avoid overfitting. Overfitting is not ideal as it indicates that the model, instead of learning from features that help it predict and generalise well to other datasets, has memorised the training dataset to obtain a better performance. Data augmentation helps to improve the performance of deep learning models by enhancing the size and quality of training datasets so that limited datasets become artificially diverse, making deep learning models less prone to overfitting [3]. This is accomplished using geometric transformations, colour space augmentations, kernel filters, mixing images, random erasing, feature space augmentation, adversarial training amongst others [3].  In our case, we opted to rotate each image randomly either by 90, 180, 270 degrees and add noise to the image randomly. Additionally, we flip images from left to right and upside down. Code was not obtained from online sources but written after reading documentation for the libraries used.

 

5. Model Architecture

The model we tried out has a similar structure to other models in the literature. Since we are working with images with 2D information, 2D convolution layers will apply different (16,32,64, etc. as specified) kernels to the input image. The convolved output (after application of each kernel) forms the input for the next layers. The weights of these kernels are adjusted during the training process.

Max pooling is a technique used to reduce the size of the problem space, by taking the maximum value of a localised region, acting as a form of downsampling. It is reported to work well in literature.

Dropout is a layer that randomly drops out connections between nodes during training. This forces the model to rely on more ‘general’ features, and is known to reduce overfitting. 

 

Model Architecture

Model Architecture for Toy Model

 

Libraries Used
# imports and file locations
import pandas as pd
import cv2
import numpy as np
import matplotlib.pyplot as plt
import os
from pathlib import Path
 
# heavier imports
 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow.keras as keras
Partitioning HDF5 into Training, Validation and Test Dataset
import h5py
 
hdf5_path = '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/21jun_300x0n1n2n3n4cells.hdf5'
label_file_path =  '/content/gdrive/MyDrive/CY2003_MnT/synthetic_dataset/synthesised/20jun_300x0n1n2n3n4cells.csv'
 
label_df = pd.read_csv(label_file_path)
 
# convert csv file to labels
labels = {}
for index, row in label_df.iterrows():
  labels[row.filename] = row.n_cells
 
training, test = train_test_split(all_labels, test_size = 0.2, random_state = 42)
training, validation = train_test_split(training, test_size = 0.2, random_state = 42)
 
# partiion the IDs into training, validation, and test
partition = {"training": training, "test": test, "validation": validation}
 
Data Augmentation using DataGenerator
 
class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, file_name, list_IDs, labels, batch_size=32, n_channels=1,
                 n_classes=5, shuffle=True, augmentation = False, dim = (40, 40, 1)):
        'Initialization'
        self.dim = dim
        self.batch_size = batch_size
        self.labels = labels
        self.list_IDs = list_IDs
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.on_epoch_end()
        self.augmentation = augmentation
        self.file_name = file_name
 
    def __len__(self):
        'Denotes the number of batches per epoch'
 
        if self.augmentation:
          return 2 * int(np.floor(len(self.list_IDs) / self.batch_size))
        else:
          return int(np.floor(len(self.list_IDs) / self.batch_size))
 
    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
 
        if self.augmentation:
          #indexes = self.indexes[index*self.batch_size//2:(index+1)*self.batch_size//2]
          indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        else:
          indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
 
        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]
 
        # Generate data
        X, y = self.__data_generation(list_IDs_temp)
 
        return X, y
 
    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)
 
    def __augment(self, data):
 
      # randomly rotate
      data = np.rot90(data, np.random.randint(1, 4))     
 
      # randomly add noise
      data += np.random.rand(*data.shape) * 0.2
 
      # flip
       if np.random.randint(2):
         data = np.fliplr(data) # flip image array left to right
       if np.random.randint(2):
         data = np.flipud(data) # reverse image array along axis 0
 
 
      data = np.absolute(data) # just in case the gaussian noise makes it negative
 
      return data
 
    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim))
        y = np.empty((self.batch_size), dtype=int)
 
        f = h5py.File(self.file_name)
 
        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            #X[i,] = np.load('' + ID + '.npy')
            data = np.array(f.get(str(ID)))
 
            #print(ID, np.max(data))
 
            data *= 255.0
 
 
            if self.augmentation:
              if i <= self.batch_size // 2:
                pass
                # do nothing
              else:
                data = self.__augment(data)
 
            else:
              # do nothing
              pass
 
            X[i,] = data
            #assert not np.isnan(data)
 
            # Store class
            y[i] = self.labels[ID]
 
        return X, y
Specifying Parameters for Data Generators
params = {
    'dim' : (40, 40, 1),
    'batch_size': 32,
    'n_classes': 5,
    'n_channels':1,
    'shuffle': True,
    'augmentation' : True,
}
 
training_generator = DataGenerator(hdf5_path, partition['training'], labels, **params)
validation_generator = DataGenerator(hdf5_path, partition['validation'], labels, **params)
Defining and Compiling Deep Learning Model
model = tf.keras.Sequential([   
  layers.Conv2D(16, 11, input_shape = (40,40,1), kernel_regularizer=regularizers.l2(1e-4),), #do i need so many filters? who knows?
  layers.BatchNormalization(),
  layers.Activation("tanh"),
  layers.MaxPooling2D(pool_size = (2,2)),
  layers.Dropout(0.5),
  layers.Conv2D(32, 7, kernel_regularizer=regularizers.l2(1e-4)), #do i need so many filters? who knows?
  layers.BatchNormalization(),
  layers.Activation("tanh"),
  layers.MaxPooling2D(pool_size = (2,2)),
  layers.Dropout(0.5),
  layers.Flatten(),
  layers.Dense(5, activation = "softmax", )
])
 
 
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate = 0.001, name = 'help'),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
Training the Model with Checkpoint Callbacks
checkpoint_filepath = '/content/gdrive/MyDrive/CY2003_MnT/mlstuff/models/21jun_synthetic_augmentation'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath = checkpoint_filepath, save_weights_only=False, monitor = 'val_accuracy', mode = 'max', save_best_only=True)
 
 r = model.fit_generator(generator = training_generator,
                        validation_data = validation_generator, epochs = 50, callbacks = [callback,model_checkpoint_callback])
Obtain Accuracy and Loss for Every Epoch Graphs
job_name = "21jun_synthetic_dataset"
print(r.history['loss'])
print(r.history['val_loss'])
plt.plot(r.history['loss'], label='train loss')
plt.plot(r.history['val_loss'], label='val loss')
plt.legend()
plt.savefig('/content/gdrive/MyDrive/CY2003_MnT/mlstuff/plots/{}_loss.png'.format(job_name))
plt.show()
plt.clf()
 
# plot the accuracy
plt.plot(r.history['accuracy'], label='train acc')
plt.plot(r.history['val_accuracy'], label='val acc')
plt.legend()
plt.savefig('/content/gdrive/MyDrive/CY2003_MnT/mlstuff/plots/{}_accuracy.png'.format(job_name))
plt.show()
 
model.save('/content/gdrive/MyDrive/CY2003_MnT/mlstuff/models/{}.h5'.format(job_name))
plt.tight_layout(pad=0.75)
plt.show()
 

Load Model and Do Prediction and Plot for Image Visualisation

model = tf.keras.models.load_model('/content/gdrive/MyDrive/CY2003_MnT/mlstuff/models/21jun_synthetic_augmentation')
 
# numpy array as input
 
# first load prediction set
prediction_path = '/content/gdrive/MyDrive/CY2003_MnT/training_data/predictionset/'
answers = []
 
a_df = pd.read_csv('/content/gdrive/MyDrive/CY2003_MnT/training_data/3n4_50x50_removedlabels.csv')
prediction_set = []
 
for f in os.listdir(prediction_path):
  _img = cv2.imread(os.path.join(prediction_path, f), 0)
  _img = cv2.resize(_img, dsize=(40, 40), interpolation=cv2.INTER_CUBIC)
  _img = _img.astype("float32") / np.max(img)
  _img = _img.reshape(_img.shape + (1,))
 
  prediction_set.append(_img)
 
prediction_set = np.array(prediction_set)
results = model.predict(prediction_set)
 
# then load answers
prediction_filenames = [f for f in os.listdir(prediction_path)]
df = a_df[a_df['filename'].isin(prediction_filenames)]
 
# make prediction
prediction = np.argmax(results, axis =1)
 
# plot picture with actual answer and prediction
w = 10
h = 10
fig = plt.figure(figsize=(9, 13))
columns = 4
rows = 5
 
ax = []
count = 0
for i in range(columns*rows):
    img = plt.imread(prediction_path + prediction_filenames[count])
    # create subplot and append to ax
    ax.append( fig.add_subplot(rows, columns, i+1) )
    ax[-1].set_title("Actual: " + str(df.loc[df['filename'] == prediction_filenames[count], 'n_cells'].iloc[0]) + " Predicted: " + prediction[count].astype('str'))  # set title
    count += 1
    #ax[-1].set_axis_off() # remove axis
    plt.imshow(img)
 
plt.tight_layout(pad=0.75)
plt.show()

 

6. Performance with Prediction Dataset

Developing the machine learning model was a difficult and time-consuming process. In hindsight, most of our problems were attributable to a few factors: (1) that our model involved cropping, and (2) the distribution of input labels, and (3) there were not that many features in the input images in the first place for the model to converge towards.

Firstly, since our model takes in small input images and attempts to classify the image as having {0,1,2,3…} cells, this means that we have to crop the original image and rescale accordingly such that the model accepts it as input. This introduces a lot of edge effects: the finer the cropping, the higher the likelihood that there were cells along the crop line. This introduces significant ambiguities. For example, in the image below, would it be classified as 0 or 1? These small design choices would affect what the CNN model can actually learn from the image.

 

At the same time, the input labels were not well distributed, which took a while to realise. Initially, our model was reporting high accuracy for its training and vaildation sets. Pictured below the accuracy and loss plots from our model; accuracy reports the percentage of correct predictions, while the loss reflects the penalty applied to the model

 

 

During training, we often use the metrics of validation accuracy and test accuracy to evaluate the performance of the model. Initially, we observed validation accuracies of around 0.90. This seemed great, until we realised that when we applied the model onto our test set, it would consistently predict that it would have 0 cells.

It turned out that our input training data mostly consisted of either 0 or 1 cells. Hence, for each batch that the model trains on, the model learnt the guess that every image had 0 cells, since most of the training data had 0 cells anyway.

After this happened, we started to look into synthesising images directly, since this would mean that we could generate a good spread of training data. At the same time, we also implemented a DataGenerator, such that our model would automatically apply operations such as rotation or skew to increase the effective size of the training data.

However, by this point, we were already exploring other non-machine learning solutions that would be easier to implement. This was because we ran into a lot of difficulties with the numerical stability of our model, in which it would often return ‘NaN’ (not a number) as a result of trying to divide by zero. The list of things we tried included simplifying the model, lowering the learning rate, modifying the last activation function to encourage it to not divide by 0, batch normalisation, L1 and L2 regularisation.

What actually helped was just changing the way in which the labels was encoded.

In conclusion, we learnt a lot about the difficulty of using machine learning models. We have demonstrated that it is possible to build such a model, but due to time constraints along with the technical difficulty of developing with a machine learning model, we eventually ended up with more traditional image-processing methods.

In the future, it would be interesting to apply multiple models at the same time, which would help to cross-validate the results of each model, and allow us to better quantify the certainty of the prediction.


References

[1]  Y. Kong et al., “Automated yeast cells segmentation and counting using a parallel U-Net based two-stage framework,” OSA Continuum, vol. 3, no. 4, p. 982, 2020.

[2] T. Falk et al., “U-Net: deep learning for cell counting, detection, and morphometry,” Nat. Methods, vol. 16, no. 1, pp. 67–70, 2019.

[3] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” J. Big Data, vol. 6, no. 1, 2019.

Leave a Reply