The OpenCL kernel execution time between sending and receiving a packet was observed to be between  200~300 microseconds. The high latency in this case is due to the OpenCL function calls and the overhead of going through two PCIe connections. 
Based on these experiments, we expect to see a 100x improvement in latency of communications between two FPGAs that are directly connected, which will open up a range of new applications for FPGAs in the cloud.  This also argues for disaggregating FPGAs and host computers and treating the FPGA as a first class citizen in the cloud.  
OCT documentation is available on github where we host getting started tutorials for both MOC and CloudLab.  These tutorials demonstrate the workflow for stand-alone and network-attached accelerator development and deployment from the ground up.

Security in the Cloud

The bare metal approach used in OCT gives the research community maximum freedom for system experimentation and evaluation but also bears certain risks when it comes to security.  The provision of bare metal servers gives users access to all components of the system, which means that they can also compromise (by accident or on purpose) the firmware of the system.  Such modifications can impact the security of the system and subsequent users will work on a compromised system without being aware of it.  OCT uses the mechanisms of Elastic Secure Infrastructure (ESI)   \cite{others2018}  and its attestation service to provide an uncompromised system and make it available to an experimenter.  Currently, ESI provides this service only for the servers that house the FPGAs in OCT.  To ensure that a new user receives an uncompromised FPGA with the start of every new lease of a bare metal system, we enforce the execution of a procedure that is automatically performed at the startup of the host server.  This procedure makes sure the FPGA is put into a known state and no information accidentally or intentionally left behind from an earlier user continues to reside on it.  An official Xilinx Run Time (XRT) is installed as is the hardware shell.  The shell is only reinstalled if the new user wishes to change the version, however all previous user logic is removed.  In addition, the network the FPGAs directly connects to is isolated to guarantee that a networked FPGA cannot inject spurious packets into a production network.

Example Application:  Machine Learning on FPGAs in the Cloud

We provide sample applications including the FINN framework for machine learning.  FINN \cite{Blott_2018} is developed and maintained by Xilinx Research Labs to explore deep neural network inference on FPGAs.  The FINN compiler is used to create dataflow architectures that can be parallelized across and within different layers of a neural network, and transform the dataflow architecture to a bit file that can be run on FPGA hardware.
With the amount of resources available in the OCT, we are particularly interested in implementing network-attached FINN accelerators split across multiple FPGAs with convolutional neural network types such as MobileNet and ResNet, whose partitioning is discussed by  \cite{alonso2021elastic}.  Figure 2 shows an arrangement of this where MobileNet is  implemented with three accelerators that are mapped to two FPGAs.  Two of the accelerators function stand alone, while the third, which contains all the communications required between the FPGAs, is split between two Xilinx U280s. The two halves of this accelerator are connected using the network infrastructure of the UDP stack,  which enables communication between them via the switch.