Edinburgh Research Explorer Evolution and Impact of High Content Imaging

The field of high content imaging has steadily evolved and expanded substantially across many industry and academic research institutions since it was first described in the early 1990 ′ s. High content imaging refers to the automated acquisition and analysis of microscopic images from a variety of biological sample types. Integration of high content imaging microscopes with multiwell plate handling robotics enables high content imaging to be performed at scale and support medium-to high-throughput screening of pharmacological, genetic and diverse environmental perturbations upon complex biological systems ranging from 2D cell cultures to 3D tissue organoids to small model organisms. In this perspective article the authors provide a collective view on the following key discussion points relevant to the evolution of high content imaging:


Introduction
High content imaging encompasses and integrates the research disciplines of cell biology, photonics, laboratory automation and image analysis to robustly interrogate the phenotypes of individual cells, multicellular tissue samples and small model organisms at scale. The field of high content imaging (HCI), also known as high content screening (HCS), was inspired and evolved from flow cytometry and digital imaging microscopy technologies which enable multiplex labelling of biomarkers on a cell-by-cell basis [1]. In 1997, Cellomics Inc., one of the pioneers of HCI, developed the first fully integrated HCI platform (ArrayScan) for HCS applications [2]. The ArrayScan and subsequent HCI platforms from other groups provided end-to-end hardware and software solutions for automating image acquisition, image processing, image analysis, image archiving, and image visualisation (Fig. 1).
These developments revolutionised microscopic analysis of cells and supported a shift away from subjective reporting and manual quantification of observations to fully quantitative cell biology permitting an accelerated approach to new knowledge generation. Thus, similarly to advances in next generation sequencing technology, automated HCI contributes to the modern era of hypothesis-free "discovery science" complementing more traditional hypothesis-driven research paradigms. The significant efficiency gains and accelerated discovery of potential new therapeutic targets, chemical starting points and early prediction of toxicity provided by HCI was a major incentive supporting early adoption of the technology by the pharmaceutical industry. Academic groups have subsequently contributed powerful and accessible image analysis software and machine learning applications to enable deep phenotyping of cell biology (e.g. therapeutic mechanism-of-action) as well as increased biological sample complexity. Together in close partnership, academia and industry have made rapid progress in collecting and analysing HCI data.
Initial HCS was typically performed in 2-dimensional cell cultures formatted in 96-or 384-multiwell plates using platform proprietary predefined image analysis algorithms capable of extracting one to a few quantitative measurements per condition. Subsequent evolution of both commercial and general-purpose open source image analysis software packages including Definiens [3], CellProfiler [4] and Advanced Cell Classifier [5], allowed non-experts in image analysis to create sophisticated bespoke algorithms tailored towards complex phenotype quantification. This rapid evolution of free, general purpose software provided an important alternative and complementary approach to proprietary high throughput screening technologies which emerged as the predominant drug discovery engine of the biopharmaceutical industry in the early 1990s.
As a result of sequencing the human genome and subsequent advances in understanding disease at the genetic level, the pharmaceutical industry invested heavily in target-directed high throughput screening technologies in what was perceived as a new era of rapid and efficient discovery of highly selective and potentially personalised drug candidates. While many high throughput screening campaigns and modern target-led drug discovery strategies have produced remarkable successes in delivering effective medicines, high attrition rates in late stage clinical development prevail [6]. Advances in next generation sequencing (NGS) have revealed remarkable molecular heterogeneity within and between patients and adaptation in underlying disease mechanisms for many disease types that evade treatment. These findings starkly highlight the challenges in predicting which drugs and which drug targets will translate into clinically meaningful efficacy for many complex disease areas. It has become increasingly apparent that genes and proteins function as parts of integrated signalling pathway networks which contribute to extensive compensatory capacities and plasticity in cell fate. In contrast to traditional high throughput screening assays that measure the activity of single readouts in targets of interest, HCS promises broad attainment of deeper phenotypic knowledge about the effects of therapeutics, capturing complex signals like interacting signalling pathways, transcription factor dynamics, and polypharmacology [7][8][9]. Thus the advent of HCS represents an important evolution in the drug discovery paradigm from reductionist target biology to more systems level understanding of cell phenotype.
Faithfully modelling human disease in medium-to high-throughput assays is perhaps the most significant challenge in drug discovery. However, recent technological breakthroughs in human induced pluripotent stem cells (iPSC), creation of genetically well-defined models of disease through CRISPR-Cas9 gene editing, derivation of primary human cells, and advances in 3-dimensional (3D) in vitro biology techniques are converging towards accurately recapitulating specific segments of disease pathology in screening formats [10] (Fig. 2). Multicellular co-culture, microphysiological systems (MPS), 3D microtissue spheroid, and organoid models have been gaining in popularity and better represent the complex tissue architecture found in vivo. Moreover, such models are particularly well suited for the latest high content imaging platforms which provide spatial resolution in X, Y, and Z dimensions [11][12][13][14] (Fig. 2). Recent advances in 3D bioprinting and miniaturisation of microfluidic devices further improve assay reproducibility and support the generation of homogeneous multicellular 2D and 3D models in standard microtiter plate screening formats [15,16] (Fig. 2).
The development of more disease-relevant models plays to the strength of academic research centers with deep understanding of disease biology and access to clinical specimens, which represent fruitful areas for academic-industry collaboration in which to maximise impact.
Analytical challenges scale alongside assays that capture increasing amounts of broad (many measurements) and deep (high number of multiparametric features) data. Early efforts to deal with data on this scale relied upon expert knowledge to craft tools for signal extraction and analysis. Advances in the application of Artificial Intelligence/Machine Learning (AI/ML) technology coupled with significant improvements in computational power have found numerous applications in the drug discovery process [17,18], which has impacted every area of the drug discovery pipeline from target identification to clinical trials [19]. This revolution has been driven by significant improvements in computational power, with consumer grade graphics cards delivering teraflops of performance coupled with the open nature of communities supporting these efforts and releasing powerful open source toolkits [20]. Consequently, AI/ML techniques are routinely applied to high content imaging data to classify cell phenotypes and predict therapeutic mechanism-of-action [21,22,23]. Artificial neural networks (ANNs) and deep learning are growing areas of interest in biological image analysis [24] and are being applied to a variety of prediction tasks. Convolutional Neural Networks (CNNs) have been particularly impactful for image analysis, enabling deep architectures [25] and techniques to be applied in unsupervised [26,27] and supervised settings for small molecule mechanism-of-action prediction [28,29].
The term "phenomics'' was first coined to describe the comprehensive study of phenotypes [30], which provides functional context to Fig. 1. Evolution of High Content Imaging hardware and software solutions, including open-source software for raw image analysis and secondary data analysis has contributed to increased adoption and variety of HCI applications. These developments include more sophisticated analysis of biological samples of increasing complexity including co-cultures and 3D models and integration of high content imaging with other single cell multiomics technologies. genomic, transcriptomic and proteomics data. Integration of high content imaging data with other multiomics data types using AI/ML is required to provide a systems biology level understanding of cell phenotype and therapeutic mechanism-of-action. Multidisciplinary research collaborations across academic and industry sectors are contributing to a new era of "phenomics drug discovery" where HCI is core to the development of more disease relevant and mechanistically informative drug discovery. Below we discuss the evolution and impact of HCI from both academic and industry perspectives. We describe the significant advances in HCI analysis and the development and application of multiparametric phenotypic profiling which has revolutionized the field. We describe the important role of image data repositories and image data sharing standards to further advance the HCI field and we provide our future perspectives on the next phase of HCI hardware and software evolution.

Evolution and impact of high content imaging: an academic perspective
The academic sectors' considerable strengths in biology and data science are contributing significantly to the evolution of HCI. Various academic research groups have gained deep knowledge of specific areas of human disease biology from several years and even decades of intense and focused research effort. In addition, academic research centres with close links to the clinic and ready access to human volunteer and/or patient samples have provided a major contribution to the development and application of patient derived cell and tissue models for translational research. For example, development of co-culture protocols [31] have enabled production of large amounts of conditionally reprogrammed cells from accessible biopsy specimens, from both healthy tissues and tumor samples that have subsequently been applied in HCI studies [32]. Early adoption of HCI capabilities in academia pushed the boundaries of sample complexity, incorporating complex tissue samples such as coeliac tissue biopsies into primary screening assays [33]. Following many years of academic research efforts the generation and culture of 3D organoid tissues is now routine [34] and can be performed at scale for high throughput screening and HCI applications [11,12]. The adaptation of fresh human tissue samples for in vitro cell culture, ex vivo tissue slice and organoid translational research applications overcome many of the disadvantages of using transformed cell lines for drug discovery [34,35]. While primary human and patient-derived ex vivo models are of high value, the relevant tissue is, in many cases, difficult to obtain, or available only after the patient's death (e.g. heart, brain, and healthy liver). A major breakthrough in the ability to develop tissue specific cell-based disease models, including patient-derived cell assays at scale, has been achieved through the development of human induced pluripotent stem cell (iPSC) technology [36]. Protocols to derive iPSCs were first developed in academia and academia continues to be an important source of iPSC models, including specialised iPSC lines with various gene editing strategies to create genetically well-defined models of disease and "normal" gene corrected counterparts. Many human iPSC derived cell models have been adapted for high throughput and high content imaging formats [37][38][39]. Additionally, approaches to recapitulate disease biology with gene editing in otherwise normal tissue derived iPSC models to specifically map genotype to phenotype and simulate disease trajectories are in early stages of academic development [40]. However, maintaining human iPSC cell lines and optimising differentiation protocols at scale with high levels of consistency for screening applications requires significant resources and close adherence to standard operating procedures typically found in industry or core academic research facilities.
Data science is a rapidly emerging field in biomedical research which has been driven by the need to manage, integrate and interpret big data sets generated through advances in technology platforms such as next generation sequencing, proteomics, digital pathology and high content imaging. Academia has played a major role in developing data science by bringing together different disciplines from the numerical sciences (mathematics, statistics, computer science, bioinformatics) and other diverse fields such as astronomy and engineering to solve similar datarelated problems, such as scalable data processing and data visualisation. This collaborative effort has led to many critical open source software libraries that are essential to data-driven initiatives, as well as the development of pioneering AI/ML approaches that reach across domains, and commonly into biomedical research applications. Early data science contributions to HCI from academia included the provision of open source image analysis solutions (e.g. CellProfiler [4]) that were readily compatible with automated analysis across large numbers of images. Additional early open source software developments included tools that allowed biologists to apply machine learning based classification of cell phenotypes from images such as CellClassifier [41], Recent advances in human iPSC technology, patient-derived models, 3D bioprinting, 3D tissue organoids, CRISPRCas9 gene-editing and novel microfluidic devices are converging with the latest advances in high content imaging to produce more disease-relevant and mechanistically-informative in vitro models for drug discovery and basic research. Further integration of high content imaging data with orthogonal multiomics datasets and emerging AI/ML solutions are contributing to the new field of "phenomics".
Advanced Cell Classifier [5], ilastik [42] and KNIME [43]. The provision of both commercial software (e.g. Definiens) and open source tools combined with rapid advances in computing power and multiparametric imaging, facilitated faster and more reliable analyses that revealed more nuanced insights and brought forth a new era of high content imaging known as "phenotypic profiling". We are currently living in this era, where data science and AI/ML are rapidly improving insights and all intermediate steps from experimental design to data acquisition and data processing.
Further evolution of phenotypic profiling methods in academia included development of the Cell Painting assay: a relatively low cost multiplex assay that "paints the cell" with multiple fluorescent dyes to obtain quantitative phenotypic profiles of cell morphology without the need for specific antibody labelling or genetically engineered probes [44,45]. The canonical Cell Painting assay multiplexes six fluorescent dyes, imaged in five spectral channels, to reveal eight broadly relevant cellular components or organelles. The Cell Painting assay is flexible, limited only by fluorescent channel overlap, and profiles generated are robust across a number of microscope setting changes as measured by percent replicating metrics [46]. The Cell Painting assay can be modified to specific screening conditions by, for example, swapping an existing canonical channel with a targeted dye to a specific biological entity of interest, such as adding a stain for lipid droplets in a screen for metabolic disease treatments [47]. Studies have also shown that brightfield imaging may contain just as much if not more detail than the unbiased Cell Painting stains [48,49], but this performance strength may be experiment specific [46]. In the future, we may modify the Cell Painting panel to flexibly focus on specifically known biological markers while allowing the brightfield channel to drive unbiased cell morphology analyses. In a pilot study of bioactive compounds, the Cell Painting assay detected a range of cellular phenotypes and the multiparametric phenotypic profiles were used to cluster compounds with similar annotated protein targets or chemical structure [45]. A recent study to evaluate if human genes can be functionally annotated using the Cell Painting assay demonstrated that 50% of the 220 genes tested yielded detectable morphological profiles which group into biologically meaningful gene clusters consistent with known functional annotation [50]. The JUMP-CP consortium (Joint Undertaking for Morphological Profiling-Cell Painting) led by the Broad Institute at MIT and Harvard and including several pharmaceutical industry partners aims to create a large public Cell Painting dataset of over 136,000 genetic and chemical perturbations [51]. It is anticipated that this public resource will catalyse new drug discovery programs across both academia and industry by enabling the prediction of compounds' mode of action and toxicity, characterising disease phenotypes and uncovering new therapeutic target biology.
Academic research funding has generally been directed towards answering hypothesis-driven research questions and hypothesis-free applications have historically been dismissed by funding panels as "fishing expeditions". However, with the abundant successes that sprung from the human genome project [52], hypothesis-free research has demonstrated significant value in basic and translational academic research and yielded a new discipline of "discovery science". HCI assays can be designed to both test specific hypotheses and enable robust hypothesis generation with discovery science. Early adopters in academia employed HCI assays to identify compounds which control centrosome duplication [53]. Other groups combined HCI with siRNA screens to support early functional genomic screens [54,55] to reveal new biology and therapeutic targets; an approach which has now been widely adopted across academia and industry using the latest generation of arrayed CRISPR libraries [56]. In summary, academia has played a major role in the evolution of HCI capabilities and also has been a beneficiary of the overall evolution of HCI technology as a powerful tool for new knowledge creation. The academic sector is also strongly positioned to play an important role in the next evolution of HCI through continued development of hardware, software and data analysis solutions and through contributing novel biological capabilities and applications, especially applied to rare diseases. As discussed further below, these developments will benefit from strong academic-industry collaboration and partnerships.

Evolution and impact of high content imaging: an industry perspective
HCI is applied as a mainstay in the pharmaceutical industry across the discovery pipeline, from target identification through to candidate selection.
Two-dimensional culture multicolour imaging assays remain the assay category of primary impact in industry as gauged through assay abundance, application in progressing discovery programs, and influence on decision making. Largely this is due to simplicity of setup and direct biological relevance of the readout from the target-specific probes used. Yet even within these assays, methodological evolution is evident through increased frequency of use of primary cell, co-cultures, or iPSCderived models. The trend to utilise more physiologically relevant cell models early in the discovery pipeline has been achieved through scalable cell factories for iPSC line generation and differentiation, in addition to simplified isolation procedures and commercial availability of primary cell sources. Indeed, this change typifies industry priorities by internalizing academic protocols to streamline methods and achieve robustness for routine, scalable implementation.
In industry, safety and efficacy profiling have been a primary beneficiary of adoption of HCI, especially when coupled with complex in vitro models. HCI in organoids in particular has shown application in screens for efficacy, lead identification, and safety [11,57]. Deploying these HCI methods aims to reduce high attrition rates of candidate therapeutics. A meta-analysis of 2003-2011 clinical phase data indicated only 10% of candidates entering clinical trials resulted in FDA approvals; with efficacy or safety cited as predominant reasons in first review response [58]. The probability of launch statistics are reflected within more recent analyses and as a sub-category, Phase II failure, in particular, has been identified as 79% attributable to safety and efficacy [59]. To provide more predictive safety assessment in early discovery, HCI has been adopted in predictive toxicology, with industry groups overwhelmingly classifying the technology as a current or near-term game-changer [60]. In particular, the use of HCI in complex models has enabled more accurate toxicological assessment in vitro, such as the use of MPS for detecting hepato-and renal-active compounds [61]. HCI also provides readouts in MPS liver models labelled with general cell health, lipid and bile canalicular markers to provide evidence of different mechanisms of hepatotoxicity [62]. The economic benefit of full integration of such assays is placed at billions of dollars per annum [62], yet it is worth recognising that safety profiling is applicable to tens-of compounds rather than hundreds, so it needs to be carefully integrated into the discovery pipeline, such as at candidate selection. For higher throughput methods that can be utilised in early discovery, spheroid systems show improved predictivity over 2D sandwich and monolayer cultures [63]; however, despite some assay systems using HCI to provide whole-spheroid intensity or volume readouts, more often than not, a simpler biochemical readout for cytotoxicity supersedes the collection of HCI data [64]. An expected revolution in the outlook of 3D HCI data use is likely to come from adoption of next-generation hardware; most notably light sheet microscopy adaptations that provide plate-based imaging facility, reduced photodamage, and improved acquisition speeds for suitable z-sampling to enable accurate measurements of 3D substructure [65,66]. The reader is referred to the Hardware section of this article for further discussion on light sheet and options available. Other HCI methods are under exploration to improve predictions of drug toxicity, which are particularly effective when augmented with omics data sources [38,67]. A notable academic-industry partnership, the Omics for Assessing Signatures for Integrated Safety (OASIS) consortium, is leading efforts in hepatotoxicity prediction involving image profiling and exemplifies a cross-sector effort with many additional benefits including rich annotation sources of in vivo studies against anonymized compounds, assay standardization, and generation of large, well annotated image databases that can be leveraged by AI/ML methods.
The category of assays relating to image-based morphological profiling are at a stage of maturity where they are now being practically applied in industry discovery efforts. Recursion Pharmaceuticals are driving the largest-scale application of image-based profiling, having placed Cell Painting at the centre of their Phenomics strategy for hit identification and progressing five assets to clinical trials in 2022. Recursion releases large annotated image reference sets, (e.g., RxRx3), which are composed of millions of images of tens of thousands of unique perturbations. RxRx3 alongside the JUMP consortium, and OASIS provide support for innovative method development and benchmarking of computational approaches, which has had a large impact in extending utility of Cell Painting data [68] (Fig. 3).
To maximise the capacity of these large datasets as inference systems, the HCI field requires understanding of the image acquisition landmarks (i.e. perturbation replicates, sample size, plate distribution, incubation time, etc.) and analytical methods by which newly collected datasets might integrate seamlessly and demonstrate sufficient statistical similarity in order to reliably annotate perturbation cell states. An underappreciated but critical aspect when generating data to build or query these large datasets relates to careful quality control in wet-lab procedures. Indeed, for this reason industry settings are optimizing generation of high quality profiles by adopting standardised wet-lab approaches to align with the latest academic protocols (e.g., JUMP Cell Painting), full process automation, and incorporating nuisance compound sets [69,70]. Furthermore, industry is improving data quality by systematically addressing batch effects through adoption of effective quality controls including cell line stratification, plate position randomisation, cell counts, staining distribution, and image quality control measures such as focus scoring.
Image profiling exemplifies the inextricable integration of HCI within industry, as it has improved drug discovery pipelines by expediting target identification through to medicinal chemistry, providing an opportunity for in silico drug prediction, and serving as a cornerstone for AI-driven drug discovery.

Evolution of high content image analysis
In an image analysis experiment, a data scientist outlines, or segments, objects of interest, such as cells, in order to extract numerical descriptors suitable for downstream statistical and machine learning analyses. While there is no one-size-fits-all pipeline for all imaging datasets, we are converging on a canonical image processing pipeline (Fig. 4) [71][72][73]. Evolving organically, the pipeline includes image quality control, image correction, cell segmentation, cell feature extraction, and batch effect correction. After the mid-2000s, various methods have been developed to perform each step in this pipeline, but one of the most common approaches uses CellProfiler, which is a user-friendly, flexible tool that facilitates image data processing with a dynamic plugin system to incorporate and improve various pipeline steps [74]. CellProfiler allows automated image analysis and object segmentation using intensity thresholding and watershed-based methods. In addition to image segmentation, CellProfiler orchestrates the pipelines, and decoupling each of these steps has led to independent optimization and many analysis improvements.
As a first step after image acquisition, image quality control can be a manual and laborious process that is user subjective. Efforts to automatically flag cells based on poor focus and debris using machine learning and simulated data have reduced manual requirements thus increasing throughput and confidence in biological findings [75,76]. Next, image correction, which adjusts technical artefacts based on image capture, is an often overlooked step that is growing in appreciation and importance [77,78]. The most common adjustment is illumination correction (IC), which adjusts for uneven lighting induced by the microscope; most often a phenomenon called vignetting, which causes the edges of the field of view to be darker than the centre [78]. There are currently several different methods that adjust for illumination [77][78][79], and different microscopy approaches may require unique solutions (e.g. modelling live cell imaging for increased photobleaching over time) [79]. While increasing in importance, the field currently lacks approaches to systematically identify if illumination correction is needed, if it was successfully applied, or the extent to which it impacts biological findings. Furthermore, in multiplex imaging applications, stains can have overlapping emission wavelengths resulting in bleed through across spectral channels, which is particularly important to adjust for when measuring colocalization between structures. However, efforts to Fig. 3. The Cell Painting assay utilises a collection of fluorescent dyes to label multiple subcellular compartments, image analysis algorithms can then measure multiple features in each of these compartments to create a phenotypic fingerprint for every cell before and after compound treatment. Compound or genetic-induced phenotypic fingerprints can be interrogated by multivariate statistics or machine learning models to classify cell phenotypes and predict compound mechanism-ofaction, toxicity and activity across other assays and model systems. Consortia and/or public datasets which are exploiting Cell Painting data include: JUMP-CP (https://jump-cellpainting.broadinstitute.org/); RxRX3 (https://www.rxrx.ai/) and OASIS (omics for assessing signatures for integrated safety consortia). compensate for this bleed through are also in early days and require more methods, software development and statistical benchmarking [80]. Following image quality control and adjustments, data scientists extract high-dimensional cell biology features, which describe various phenotypes, cell states, and technical artefacts. There are existing tools to extract so-called hand-engineered features, which are based on classical computer vision algorithms [81][82][83][84]. There are also emerging solutions, based on deep learning, which promise to learn more informative morphology features [35,[85][86][87][88][89][90][91]. While deep representation learning is a hot topic, it is still yet to be seen if these features will supplant the more interpretable hand-engineered features that the computer vision community has developed over decades. Additional methods, based on batch effect correction are also becoming increasingly important as data size increases, and it is unclear at what stage to perform batch correction either as an image-processing step or post feature extraction [92].
Cell segmentation is a particularly challenging and important step because of the huge variability in imaging equipment, imaging modalities, fluorescence markers, and therefore it has received a lot of research attention. It is clear that different segmentation algorithms impact how data scientists identify objects [93], but to the extent that segmentation impacts biological insights is yet to be determined. In the early days of HCI, most segmentation methods were based on manual thresholding, which is time consuming and error-prone. Localization of object centers can also be achieved using the minimum spread square loss function [94]. Today, many machine learning methods have emerged to automate segmentation of HCS data. Most segmentation models are inspired by a U-Net architecture that utilizes downsampling encoding layers followed by upsampling decoding layers to segment the image. [95] U-Net also includes skip connections between the encoder and decoder to preserve the spatial information from the input image, which improves segmentation accuracy. For example, the popular models StartDist [96], CellPose [97], DeepCell [98] are modified U-Net architectures that were trained on huge datasets and are now capable of segmenting a wide variety of image data with minimal or even no training. Furthermore, software tools such as ICY [99], QuPath [100], and ilastik [42] offer the user more flexibility to train their own algorithms. These approaches often require significant user input for challenging datasets with high confluence or heterogeneous cytoplasm, which makes generalization to other datasets difficult. However, once trained on a particular dataset they can be improved through fine-tuning. In summary, deep learning can learn the important features required for accurate cell segmentation directly from raw images and can handle heterogeneous imaging data capturing various staining and imaging modalities. It remains a rich research area with many groups proposing new approaches, both generic and cell type specific, which extends beyond high content imaging to other data modalities such as electron microscopy, pathology, and spatial transcriptomics [95,97,[101][102][103].
Due to the rapid advances in imaging technologies, we are able to capture different biological scales that consist of highly variable structures from organelles and molecules to organoid and vascular morphology. To date, bespoke methods are often required to segment these images [104]. However, we anticipate new deep learning models to be able to segment various types of objects. For example, the recent release of the Segment Anything Model (SAM) by Meta [105], which is not restricted to biological imaging, could be a very promising direction. For example, within a few clicks we were able to obtain very accurate segmentation of challenging tissue image data (Fig. 5).
Other emerging tools include an academic-industry partnership with the University of Wisconsin-Madison and Microsoft presenting their segmentation model Segment Everything Everywhere All At Once (SEEM) [106]. These large models based on transformer architectures offer zero-shot learning for a variety of generalized tasks, including cell segmentation, and may represent the next generation of segmentation approaches able to immediately generalize to diverse datasets. In the coming months and years, our field will continue stress testing and fine-tuning these approaches in their application to HCS segmentation.

Evolution of high content data analysis pipelines towards multiparametric and phenotypic profiling applications
In 2004, Perlman et al. published a landmark paper describing the ability of multiparametric high content phenotypic measurements to derive compound fingerprints, which showed that compounds with similar mechanism-of-actions induced similar cell morphologies [17]. Early examples from academia and industry explored the use of machine learning classifiers to predict mechanism-of-action of phenotypic hits by comparing the similarity of their high content phenotypic profiles with a reference library of well-annotated compounds [22,23]. Further development of high content phenotypic profiling assays combined with multivariate statistics and machine learning led several academic groups and academic-industry collaborations to further demonstrate the utility of image-based phenotypic profiling in discriminating phenotypes [22,23,[107][108][109]. These initial screens generated biological insights from terabytes of imaging data and were no doubt important and useful Fig. 4. Standard HCI experimental pipeline. After experimental design (A) scientists perform wet lab work to acquire high content cell images, which then requires several canonical image analysis steps. Cell segmentation is optional, but will allow single-cell profiling downstream. After image featurization, (B) scientists perform all the image-based profiling steps, to prepare data for downstream analyses. (C) This full pipeline must be orchestrated by reproducible software tools to ensure data provenance and to enable benchmarking. Both wet lab and dry lab biologists must be included in all processes from experimental design to results interpretation.
applications. However, the analysis pipelines and software infrastructure to handle this early data deluge were underdeveloped. This kicked off a scientific arms race between data collection and data analysis, and the cycle continues today as larger and larger datasets are continuously generated that outpace our ability to fully analyze them. As we are screening more and more drugs, we are also being humbled to learn that finding effective drugs with HCS is difficult, and therefore, we are pairing increased data collection with concurrent improvements in phenotypic profiling methods and software with the expectation that better infrastructure and computational approaches yield higher value screens. Novel hit calling or ranking methods that can be applied to Cell Painting profiles are of particular interest. Existing metrics, for example scalar projection, have been successfully used to rank profiles against on and off-perturbation phenotypes [110]. However, many of these measures have limitations and the field will benefit from increased collaboration with experts in mathematical disciplines to build improved approaches of similarity ranking.
Our increased ability to analyse high content data and thus derive insights from phenotypic profiling applications involve advancing method development at each step in the processing and analysis pipelines. After extracting thousands of features from each image or cell object, a data scientist must apply a bioinformatics pipeline to process these features, preparing them for downstream discovery. Much like with image analysis, this data analysis strategy has evolved organically within academic and industry labs [71]. Scientists and engineers have developed specific software tools for HCS data processing including Pycytominer [73], bioprofiling.jl [72], Phenonaut [111], and Strato-MineR [112] which use either open source or closed source strategies. Canonically, the bioinformatics steps include single-cell processing to aggregate features within each well, metadata annotation, normalisation, feature selection, and consensus signature discovery (Fig. 4B). If images were not flagged in the image analysis steps, these pipelines can also filter single cells to remove incorrect segmentations, debris, out of focus cells, or other issues that may confound results. HCS routinely measures millions to billions of single cells, which makes single cell analysis extremely challenging as current data analysis infrastructure scales poorly to this many data points. Therefore, the aggregation step, while inherently losing single-cell heterogeneity information, is currently required. The aggregation step may also remove the need for additional quality control filtering as single cells will not contribute much when transforming features to their median value. Nevertheless, while the promise of microscopy is its inherent single cell nature, we will only realize this promise in HCS by developing new scalable software and methods to leverage single cell information, which are currently being developed [113].
A decade ago, building a method and demonstrating utility was good enough for high impact. While not universally the case now, it is much more impactful to release methods with usable, well documented, version controlled software. This software facilitates method application, development, and benchmarking and lives on to be further developed well beyond method publication. We have seen this success with CellProfiler [4] and FiJi [114] and now Napari [115], as software evolves much faster than print. In addition to software for methodology, we also require software for reproducible orchestration of data analysis pipelines. CellProfiler serves as an image analysis orchestration engine, but is one that requires extensive biological expertise and experience with manual parameter toggling. Pipelining software such as snakemake [116], Workflow Description Language (WDL) [117], or nextflow [118] will enable rapid pipeline development, repurposing, and benchmarking, but there are currently no existing orchestration engines tailored specifically to full HCS data analysis pipelines (Fig. 4C).
Benchmarking each individual pipeline step is also incredibly challenging and requires standard benchmarks the field agrees upon. It is possible to test each individual step in isolation, but until we test how each step impacts the overall HCS end goal of quantifying phenotype, it's difficult to determine and compare performance. To date, most large-scale screens analyse their data with bespoke pipelines that are presented separately from data with questionable reproducibility. Software like AnnData [119] facilitate scalable data for specific scientific ecosystems, but the emerging paradigm is for language agnostic data types based on the Apache Software ecosystem to maximise programming language cross-compatibility and cloud computing [120]. Another important step in analysing HCS data is the ability to effectively link features to images in visualizations. Effective data visualization plays an important role when communicating results for facilitating biological interpretation. General-purpose tools, such as heatmaps or t-SNE plots, do not fully capture the structural nature of cell image data, and few tools have emerged to tackle these domain-specific challenges. For example, PhenoPlot [121] and the subsequent ShapoGraphy [122], allow generating pictorial quantitative representation of data using glyph visualisation. The tools map images to cell-like structures which enable easier interpretation (Fig. 6). The user can design their own structure in a web based app depending on the parameters they are extracting from the images. Other tools that allow data interaction in a graphical user interface include Mineotaur [123], Facetto [124], or Loon [125]. These tools link quantitative data points with raw image data to identify key trends in cell image features.
Data analysis advances developed for HCS applications are extending into cell biology studies more broadly, as the increasing volume of experiments are collecting datasets that outpace our ability to analyse everything by eye [126]. Accelerating this life cycle, publicly-available repositories dedicated to large-scale imaging data are coming on the scene (see section: The role and evolution of image data repositories and sharing standards), which increases the pace of method and software development and increases our agility and pace at fully leveraging HCS data to discover effective treatments.

The role of data integration and multiomics
The opportunity for integration of imaging and omics data is that information from integration will be greater than the sum of parts. Omics platforms have been deployed but integration remains an active area of research that will benefit from further academic-industry collaboration and the use of new frameworks for analysis. Multiomics refers to the collection and use of readouts from multiple omics technologies, aiming to capture the complete state of (macro-)molecules present in the measured biological substrate [127]. Practically, this refers to genomics, transcriptomics, proteomics, metabolomics [128], and commonly includes phenomics derived from HCI/HCS and other data sources which might provide complementary information on cell states. AI/ML literature uses the more general term "Multiview" to refer to using multiple representations or "views" of an underlying system state capturing information across different timescales, resolutions and with different batch effects and biases. It has been demonstrated multiple times that combining omics data, to derive cellular state, results in the addition of unique complementary information, enhancing performance in prediction tasks [129,130]. As earlier noted, roughly half of studied genes produce a detectable morphological change [50,131], and therefore it is often seen as the job of complementary omics to fill these blindspots. As well as providing a more complete detection of cellular responses, Joyce and Bernhard [132] document a pairwise matrix of omics views and resolve the potential biological insights achievable for all pairs including enzyme annotations, regulatory complexes, binding sites, gene regulatory networks, and functional annotations. Whilst powerful, complementary views are costly to generate in terms of time, reagents and data analysis requirements. This has typically limited multiomics to the bookends of the drug discovery pipeline. At the beginning of the pipeline, target validation makes heavy use of multiomics [133], applying multiple techniques to ensure perturbation of a given target leads to the desired therapeutic response. At the opposite end of the pipeline, candidate validation aims to collect extensive information on the most promising treatments, ensuring they are hitting the correct targets with little or no off-target effects and low toxicity [134,135]. Traditionally, the middle of the pipeline where throughput is critical, has not been augmented by the collection of multiomics data, However, this is now changing as speed of data acquisition and analysis increases, assays improve, automation increases in scale, and costs decrease. Scientists are collecting more imaging data alongside other omics technologies, which are transcending the traditional single readout high throughput screening setups. With continued development of assay technologies, increased multiomics throughout the discovery pipeline would allow many benefits, such as earlier triage of problematic treatments, prioritisation of the most promising agents, and the overall ability to make better informed decisions during the discovery process. Thus, HCI is a prime candidate for further integration into the entire drug discovery pipeline.
Currently, science is embracing multidisciplinary applications of techniques from diverse fields, including information theory, computer science, machine learning and statistics, for integration of different omics views aiming to improve predictions over single omics/single view applications. Whilst many approaches exist, they can all be placed into three broad categories as defined by Rappoport [136]; early-, midand late-stage integration. Early integration combines different views in a pre-processing step isolated from any prediction task. Examples of this include simple concatenation of features across views [130], while more involved transformations align correlated features across views [137]. Mid-integration refers to techniques in which feedback from the prediction task is used to direct the transformation and joint embedding of views in a shared phenotypic space [138]. Finally, late integration covers approaches which carry out predictions on individual views and then unify predictions using a mechanism like consensus scoring [139,140]. Another example of late integration is correlating phenotypes discovered in images to other omic data measuring comparable samples which can be applicable to single and bulk datasets [141]. Extensive reviews of multiomics integration techniques may be found in literature [136,142,143].
With such a wide choice of integration approaches available in literature, there is no clear choice of what is subjectively 'best', which cannot be determined without a comprehensive benchmark matching the omics views available to the researcher, experimental setup and prediction task. Integration techniques that work well for certain omics views are unlikely to be performant across different pairs, triplets or higher order groups. A benchmark prioritising powerful omics to pair with phenomics and a selection of prediction tasks would be a valuable asset to all within the HCI community. While compound mode of action prediction is arguably the most prioritised task in HCI and drug discovery, large public datasets are only now becoming available to benchmark AI/ML techniques applied to HCS data [51,68]. Critically for multiomics integration, benchmarking using HCI requires pairing with omics views from different sources, which will likely increase batch effects. While a common integration task aims for better predictions, one view may also be used to predict another, with examples demonstrating the use of small molecule structure [144,145] and HCI [146,147] to predict proteomics profiles. The use of fast low cost technologies like HCI to make predictions in this manner allows application of well supported tools and databases outside of the collected omics technology ecosystem, allowing application over more of the drug discovery pipeline.

The role and evolution of image data repositories and sharing standards
In HCS experiments, scientists collect a large amount of data, which makes tracking both the data itself and information about the data challenging. This information about data, also known as metadata, is critical for reproducibility, especially given the high costs and low success rates of these experiments. This need for reproducibility is even more critical given the current reproducibility crisis in many fields. In our era where experiments fail to replicate more often than not [148], we must continue to report as much metadata as possible, using standardized identifiers [149]. Recently, researchers have proposed a Recommended Metadata for Biological Images (REMBI), which aims to provide metadata guidelines for diverse microscopy communities, begin discussion about standardising metadata identifiers, and promote reuse of microscopy datasets [150]. Over the past five years, more emphasis has been placed on sharing these large datasets, which has increased the importance of tracking metadata information and reproducibility. These efforts all fall under the umbrella of FAIR research, which aims to make data Findable, Accessible, Interoperable, and Reusable [151]. Ultimately, the results and utility of the initial HCS are validated downstream in the efficacy of the hits that were prioritised for follow-up characterization. Therefore, the scale and lag time to validation makes assessing experimental reproducibility especially challenging. Conversely, the reproducibility of the computational analyses is easier to track since it can be directly tested, however, this requires providing version-controlled code for the full pipeline and version-controlled computational environments.
Metadata standards and improved file types are important for the growth of data sharing and reuse, but the heterogeneity of microscopy data makes it challenging to keep track of all the different items. These include microscopy parameters, cell culturing conditions, assay materials, perturbation and other treatment details, amongst others. Emerging standards for these identifiers are improving computational analyses and machine readability to facilitate data reuse. Ten years ago, such standards did not exist, which made most data non-interoperable or required significant effort to convert metadata to the same language. Furthermore, the evolution of file types is ongoing, with TIFFs being the de facto file type shared in the past. While they contain images alongside important metadata, they can be slow to load and are not optimised for cloud computing. Emerging file types that fit the needs of academia and industry in HCI are actively being developed, such as OME-ZARR [152]. File types are also evolving for intermediate data types in table format, moving from CSV to database standards and more performant data based on Arrow (e.g., Parquet) that are now being tested and implemented. It is crucial to consider interoperability when designing new file types to ensure that they are widely adoptable by the scientific community.
Data sharing resources are now equipped to handle large highcontent datasets, enabling researchers to share and access a vast amount of data with ease. One such resource is the Image Data Resource (IDR), a growing service that currently stores over 100 studies totalling about 400 TB of images of reference cell and tissue imaging data [153]. IDR can version control your data and provide a direct object identifier, allowing anyone to use and cite the data deposited. There are several other microscopy data sharing resources available including, EMPIAR [154,155], BioImage Archive [156], and Cell Image Library [157] but IDR is the primary third-party host for HCS data. More recently, the Registry of Open Data on AWS (RODA) is now storing large HCS datasets, with the JUMP Cell Painting consortium currently hosted there. There are also several in-house data sharing repositories, such as the Allen Institute for Cell Science Explorer, Recursion RxRx [158], Broad Bioimage Benchmark Collection [159], and various others operated by academic institutions. While there are many advantages to self-hosted data repositories, they tend to have less emphasis placed on metadata standards and identifiers, and enforcing compliance can be more challenging. There remains a need for further standardization and interoperability between these repositories and sharing platforms. As mentioned earlier, the heterogeneity of microscopy data makes it challenging to develop and implement standardised metadata and file types. Continued efforts to develop and improve these standards will be crucial in promoting efficient and widespread sharing of high-content microscopy data. However, with the growth of these data sharing resources, the possibility of microscopy data reuse and secondary analyses is growing, which will lead to increased use in validating experiments and testing model generalizability. In addition to sharing raw images, it is also helpful to share other high-value intermediate data types, such as illumination corrected images, embeddings, and single-cell and bulk feature extraction methods, which may be shared in other repositories. Wilson et al. describes more considerations for sharing microscopy images [154] .
The field of microscopy data sharing is still in its early stages, but there has been significant progress made over the past decade and the evolution of these repositories and standards will likely continue to be shaped by advances in technology and changes in research practices. For example, the growing use of AI/ML in image analysis may require new approaches to data sharing, data management, and a heightened emphasis on standardization and reproducibility. One important challenge is ensuring that the data used to train machine learning models is consistent and well-documented. If the data are not standardized, then the models will not be generalizable to other datasets, leading to poor performance and limited impact. Another challenge is the interpretability of machine learning models in the context of microscopy data. While these models can often achieve impressive results, it is important to understand how they are making their predictions, especially in the context of drug discovery where a false positive or negative could have significant consequences. Efforts to develop interpretable machine learning models and standards for reporting their performance and predictions will be crucial in ensuring their utility in this field. Therefore, it will be important to continually evaluate and adapt image data repositories and sharing standards to ensure they meet the needs of the scientific community. As we continue to share and analyse high-content microscopy data, we will accelerate the pace of drug discovery and ultimately lead to faster, more cost-effective cures for a wide range of diseases.

Future perspective of high content imaging hardware and software
Hardware: Following a sustained period of improvements to commercial HCI instruments over several decades, accelerated technology development over the past 5 years has delivered significant advances in capability. These advances include; improved 3D imaging as exemplified by the OperaPhenix (Perkin Elemer); ImageXpress-confocal HT.ai (Molecular Devices) and CellInsight CX7 LZR Pro (ThermoFisher) platforms; unprecedented speed such as that demonstrated by endeavor GT (Araceli) and new capabilities that combine single cell imaging and sample picking for multiomics analysis provided by the Single Cel-lome™ System SS2000 (Yokogawa) (Fig. 1). Other advances include development of bespoke HCI platforms specifically designed for small model organisms such as Zebrafish (VAST BioimagerTM (Union Biometrica) [160].
Despite these significant advances the development and implementation of HCI infrastructure has often struggled to keep pace with the ever increasing diversity and complexity of a new generation of ever evolving and more sophisticated 3D models and microfluidic devices [161,162]. Thus, a number of HCI challenges and gaps remain including acquisition and analysis of 3D models with single cell and subcellular resolution at depth to explore the heterogeneity of complex 3D multicellular structures in both fixed samples and longitudinal monitoring in live cell assays [163]. Multiphoton microscopy is currently the most powerful technique for realising single cell fluorescence imaging and segmentation at depth, however conventional multiphoton microscopy is too slow for high throughput screening applications across sufficient sample numbers. Research efforts are underway to increase the speed of multiphoton microscopy through parallelized signal acquisition using multiple laser beams or time-gated camera detection systems, which have demonstrated proof-of-concept including in automated multiwell plate formats [164][165][166]. Other advances in HCI hardware development from academia include adaptation of "light sheet" microscopic imaging for multiwell plates [167]. Light sheet fluorescence microscopy (LSFM) uses a thin sheet of light to excite only fluorophores within a single planar volume in front of the objective [168]. Light sheet microscopy therefore provides true optical sectioning capability facilitating 3D imaging with reduced photobleaching and phototoxicity of the sample. Oblique Plane Microscopy (OPM) is a "fast light sheet" microscopy technique that uses a single high numerical aperture microscope objective to both illuminate a tilted plane within the specimen and to collect fluorescence from the tilted illuminated plane [169,170]. As OPM is compatible with a conventional microscope, it can be used to image conventionally mounted specimens on coverslips, tissue culture dishes or standard multiwell plates. The OPM has demonstrated its rapid multiwell plate imaging capability to image 3D responses of tumour spheroids to glucose over time and to map and quantify cell morphological plasticity in 3D [65,171]. Alternative platforms which are commercially available include the Lattice Light Sheet 7 from Zeiss which delivers an easy-to-use automated light sheet instrument suitable for multiple sample carriers including multiwell plates.
The development of multidisciplinary consortia encompassing expertise in photonics, automated microscopy, image analysis and biology to deliver open source hardware and software solutions that can be exploited by both commercial and academic institutions is well placed to accelerate the evolution and adoption of high content imaging. The MACH3CANCER (advancing Microscopy to Accelerate understanding of Complexity and Heterogeneity of 3D Cancer) is one such consortium funded by cancer Research UK (https://mach3cancer.org/). The tools and resources developed by MACH3CANCER, which will be shared with the community, include: new single cell-resolved open source high content analysis platforms (openHCA) specifically designed for 3D cell cultures and organoids. The open source approach will enable other laboratories to replicate these capabilities and ensure that the openHCA instrumentation can be easily upgraded to new functionality (with no barriers due to proprietary hardware or software) and be integrated with other HCA capabilities including commercial platforms. Another academic-industry consortium includes the Transformative Imaging for Quantitative Biology (TIQBio) partnership which aim to provide advances in high resolution imaging of 3D models in an unperturbed manner.
Software: As more complex multiparametric data analysis approaches have evolved for HCI, analysis pipelines may include techniques integrating 'traditional' algorithms spanning statistics, signal processing and information theory to techniques from AI/ML including classification, regression, pooling, sampling, normalisation, batch correction, imputation, transformation, and application of generative methods (see section "Evolution of high content data analysis pipelines towards multiparametric and phenotypic profiling applications" for more details). Closed-source proprietary tools exist and allow non-expert users to apply pipelines including such techniques like BIOVIA Dassault Systèmes' Pipeline Pilot, Perkin Elmer's HC profiler (powered by TIBCO Spotfire®), and HCS StratoMineR [112]. However, the significant contributions of academia to the field highlights the cutting-and often bleeding-edge nature of research being carried out within these institutions, requiring unhindered access to source code, intermediate data and the ability to migrate analysis to a variety of compute platforms unrestricted by licensing and access controls. Critically, integration of new literature techniques is rapidly enabled with the drive towards open access publishing and adherence to FAIR principles [151], resulting in source code for new techniques often being readily available in public repositories. Rapid iteration and integration of techniques is massively helped in an open-source software ecosystem, where contributions from many experts in their fields may flow into one well structured and well managed software package for unrestricted use, evaluation, and improvement by the community. Continued evolution of high content analysis includes the need to integrate HCI with multiple data types (see section "The role of data integration and multiomics") [130]. Data integration workflows for multiomics data take many forms across academia and industry and efforts with limited resources can easily fall short of data integration best practices, with additional data and processing requirements dramatically increasing pipeline complexity upon combination and processing of high content imaging, proteomics, metabolomics and other omics data. Open source initiatives such as Phenonaut [111] aim to standardise multiomics and single omics workflows operating on feature data, but will only be successful in establishing standards and driving the field forward in an open, collaborative, and FAIR way with community involvement addressing new methods, benchmarking, and best practices.

Current gaps and recommendations
While significant advances in HCI analysis have evolved to extract multiple features and classify a broad variety of cell phenotypes in 2D cultures at single cell level, the majority of phenotypic profiling studies are performed on aggregated whole well/cell population level. This is particularly true for 3D model systems where high content phenotypic profiling at the single cell level at sufficient imaging depths remains to be fully realised. Improvements in imaging hardware for 3D resolution and software solutions for handling single cell level data are required to study the heterogeneity of cellular response and distinct cell types in more complex 2D co-culture and 3D cell models. Such developments will support the next evolution of high content phenotypic profiling applications using more complex and physiologically relevant model systems. The evolution towards open source HCI hardware could deliver new HCI instruments and upgrades at reduced costs supporting expansion of the technology beyond core screening facilities and increased adoption of robust quantitative cell biology across industry and academic research groups. The development and evolution of HCI hardware may also support a paradigm-shift in the development of screening applications beyond standard multiwell plate formats toward more bespoke and physiologically relevant 3D models and microfluidic devices.
The field of single cell technology is advancing rapidly, evolution of HCI in parallel with other single cell technologies including integrating image-based single cell phenotypic classification followed by cell picking and collection to feed into single cell transcriptomics and proteomics profiling is becoming realised with new platforms such as the Yokogawa SS2000, Beacon(R) optofluidic system, and Sartorius CellCelector platforms. However, this is an area that requires additional investment to deploy across research programs and user groups and maximise the full potential to embrace the heterogeneity in cell phenotypes within biological samples and more comprehensively explore cell state transition at an integrated phenotypic, transcriptomic and post-translational pathway level.
One of the benefits of HCI and image-based phenotypic profiling is low cost when applied at scale relative to genomic, transcriptomic, and proteomic profiling technologies. This disparity however limits multiomics integration resulting in gaps in methods for data integration and analysis. Further investment in generation of orthogonal high throughput transcriptomic and proteomic profiling technology and data sets which can be paired with appropriately matched HCI data would support benchmarking of different data integration approaches. Investments in the provision of publicly available multiomic datasets across platforms at different scales to integrate with HCI datasets is required to evolve the field of quantitative biology and HCI applications.
A concerted move away from development of bespoke image analysis pipelines which are separate from the data towards integration of language agnostic data types associated with raw image data will maximise cross comparison and quality assessment of analysis approaches, adoption across multiple research programs and institutions and thus increased scalability.

Conclusion
Academic and industry contributions to the field of HCI have been complementary and synergistic. Consortia activity and collaboration which provide precompetitive tools and datasets have been crucial to continued evolution of the field and include new applications and broadly adopted strategies for precise classification of cell phenotypes and prediction of biological mechanism-of-action and in vivo toxicology. These developments have contributed to significant evolution of HCI technology and applications to answer fundamental basic research questions, perform discovery science, and to improve decision making across discrete stages of the drug discovery process.
Further interdisciplinary collaboration between data science, biological assay development and HCI hardware solutions will continue to evolve the field towards delivering higher quality and more quantitative solutions which also support more disease relevant and mechanistically informative drug discovery applications. The past three decades since the inception of HCI has been witness to substantial improvements in technology and applications which have stimulated and continue to stimulate increased adoption of HCI across research institutions in academic and industry sectors. The future of HCI looks bright: undoubtedly the replacement of manual microscopic imaging and analysis with automated solutions provide a step change in increased robustness of biomolecular imaging and functional biology studies, which are less prone to bias and artefacts due to low sample throughput. This step change and the availability of a greater variety of HCI platforms and open source software and data analysis solutions will ensure the trajectory of increased HCI adoption continues. The rapid development of new technologies from the fields of 3D biology, CRISPR gene-editing, human iPSC, multiomics, single cell technologies and AI/ ML are converging with HCI contributing to a new era of cell biology and drug discovery (Fig. 2). Together these developments promise to contribute significant benefits across multiple research areas including data science, cell biology, chemical biology and healthcare which in turn will support continued investment in HCI and further enhance the significant impact that HCI contributes to biomolecular and biomedical research.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.