Further discussion

We have compared qNEHVI and U-NSGA-III using both synthetic and real-world benchmarks, considering different experimental parameters such as dimensionality and batch size which materials scientists may face when implementing closed loop optimisation in HTE. Our results suggest that qNEHVI is extremely sample efficient in arriving at the PF to maximise HV gain but fails to exploit it. In contrast, we report that U-NSGA-III has a consistent optimisation trajectory, and better exploits the PF while maintaining more near-optimal solutions.
We thus make the case for MOEAs for materials experimentation besides computational design. We also argue that such implementations would be best when the objective space is mildly discontinuous (which can be the case for structural problems such as alloys) since small changes in inputs can cause the outputs to vary wildly in objective space, and an evolutionary-based strategy can navigate with better resolution. This is consistent with work by Liang Q. et al \cite{Liang_2021} on single-objective optimisation, which noted that having “multiple well-performing candidates allows one to not only observe regions in design space that frequently yield high-performing samples but also have backup options for further evaluation should the most optimal candidate fail in subsequent evaluations”.
Furthermore, MOEAs also scale better in terms of computational cost for a high dimensional and high throughput context, where they have the means to converge while maintaining both diversity and feasibility. HV-based MOBOs such as qNEHVI scale poorly to high dimensionality and many-objective problems due to the cost of computing HV. Depending on the HTE set-up, the ML component may not be able to leverage on powerful cluster computing for computationally intensive problems/models. MOEAs with lower computation overhead such as U-NSGA-III would be a better choice in such scenarios. With advancements in HTE set ups allowing for automation and parallel sampling, we expect research groups to leverage on higher throughput systems with short turnarounds. This makes the implementation of MOEAs much more practical to explore complex search spaces when paired with larger evaluation budgets of 103 to 104 data points.
The choice of batch size to balance optimisation performance while minimising experimental cycles is also important. Empirically, our results obtained suggest that a smaller batch size of around 4 is ideal for the limited evaluation budget of 192 points, although larger batch sizes are preferred for more complex problems (with added difficulty from disconnected regions in objective space, or perhaps presence of local optima).
A caveat of our work here is that the synthetic problems we chose are a generalisation of bi-objective spaces with specific Pareto geometry that may not translate well for real-life experimentation especially for many-objective (M>3) problems. Newer benchmarks with higher difficulties and complex geometries/PFs \cite{Fan_2020}  are tailored towards challenging MOEAs with massive evaluation budgets of up to 107 total observations. An example would be MW5 from the MW test suite, which has a narrow tunnel-like feasible regions that are practically impossible for GPs to model, resulting in MOBOs failing to converge. Indeed, R. W. Epps et al noted that it is “difficult to impose complex structure on the GPs, which often simply encode continuity, smoothness, or periodicity”. \cite{Epps_2020} We refer to other publications which study the differences between surrogate models in BO,  \cite{Liang_2021,Lim_2021,Yan_2021} as well as AI techniques that scale MOBOs to higher dimensional spaces. \cite{Moriconi_2020,eriksson2021high}
Furthermore, materials experimentation is usually afflicted with real-world imperfections and deviations during synthesis, or uncertainty due to characterization equipment resolution. For example, MacLeod et al noted that “the tendency of drop-casted samples to exhibit a wide range of downwards deviations in the apparent conductivity due to the poor sample morphology”. \cite{MacLeod_2022} The effect of noise causes deviations in objective values from the ‘true’ ground truth, and although unclear, is an unavoidable aspect of optimisation which should be tackled. \cite{Koch_2015,Horn_2017} In SI 6, we perform a comparison of qNEHVI and U-NSGA-III on varying amounts of white noise on outputs.