Code Refactoring

Since we now have multiple nodes which can read/write to SD cards (only one of the hand soldered prototypes worked well, the other had signal integrity issues), we can finally write received data from the client to the SD card.

Current Software Architecture

For the sake of simplicity, handling of received packet was done entirely in the ESP-NOW ISR. This worked fine when MtftpClient did not actually write contents to a file (but rather just logged metadata).

However, once MtftpClient started writing to the SD card, the write operations took a substantial amount of time, leading to severe packet loss (as the ESP-NOW ISR took a long time to exit). The obvious solution is to push received data to a ring buffer, then MtftpClient::loop() reads from this ring buffer and actually writes the data to the SD card. While this works fine, we run into yet another issue here – since the server is transmitting data packets as fast as it can read from the SD card, the client cannot actually write data to the SD card fast enough. This results in the ring buffer overflowing.

Rather than bolting on flow control, we can refactor the client to handle incoming packets differently – all received packets are pushed to a ring buffer, then MtftpClient::loop() parses incoming packets and writes them to the SD card as seen below.

Refactored Architecture

By refactoring the packet handling on the client in this way, we solve two problems – firstly, we no longer perform long operations in the ISR. Secondly, if we make the ring buffer larger than the maximum size of the data transferred in a window (plus a bit of overhead for the ring buffer itself to manage the items), we can now buffer an entire window worth of packets. MtftpClient::loop() will only send an ACK once an entire window has been processed successfully, which will only happen once all the writes are done.

This also doubles up as flow control:

  1. The server will send an entire window at once
  2. The client buffers (and requests retransmission of missing blocks if necessary) the entire window in memory
  3. The client slowly reads from the ring buffer and writes data to the SD card
  4. The client acknowledges the window once all data has been processed
  5. The server continues transmission (ie go back to Step 1)

With this implemented, we can fully test the communication between the client and server. A test with 2 nodes next to each other transferring a 1Mb file was successful, the transferred file has been confirmed to be the same by hashing it and comparing that to the original file.

However, in a realistic environment, there would be interference from the environment (whether from Wi-Fi networks, or even from our UAV remote control equipment), which will cause packets to be lost (but not corrupted, the frame check sequence ensures that data is valid). This presents problems especially when key signaling packets (eg ACK) are lost, and one side is left waiting forever for a packet that was already lost. Although the current time-out functionality will handle these cases, relying entirely on the time-out to handle missing packets affects throughput severely. Certain lost packets (eg the server’s SYNC response) can be detected by the other party and handled immediately without having to wait for the timeout.

Identifying these cases and handling them separately should improve throughput.

Separately, multiple bugs were fixed:

  1. Race condition between onTimeout call and loop
  2. The counter used to track buffered ESP-NOW packets was observed to underflow then lock-up the entire system since it never fell below MAX_BUFFERED_TX.  This was fixed by using a counting Semaphore instead of just a variable.

On our most recent test flight last Sunday, we strapped the collector to a UAV and monitored the ground node. Although none of the nodes locked up this time, the data written to the SD card by the collector was corrupted. More debugging is needed to identify and fix this issue.

Edit: The data corruption observed in data written by the collector has been fixed. file_offset is incremented after every write of an in-order, non-buffered block, but writing the buffer to file after handling missing blocks did not increment file_offset.

Leave a Reply

Your email address will not be published. Required fields are marked *