Posters

Posters Abstracts

A general framework for evaluating cross-platform concordance in genomic studies

Timothy J. Peters1, Terence P. Speed2 and Susan J. Clark3

1Immunogenomics Laboratory, Garvan Institute of Medical Research, Darlinghurst, NSW 2010, Australia
2Bioinformatics Division, Walter and Eliza Hall Institute, Parkville, VIC 3052, Australia
3Epigenetics Laboratory, Garvan Institute of Medical Research, Darlinghurst, NSW 2010, Australia

The reproducibility of scientific results from multiple sources is critical to the establishment of scientific doctrine. However, when characterising various genomic features (transcript/gene abundances, methylation levels, allele frequencies and the like), all measurements from any given technology are estimates and thus will retain some degree of error. Hence defining a “gold standard” process is dangerous, since all subsequent measurement comparisons will be biased towards that standard.

In the absence of a “gold standard” we instead empirically assess the precision and sensitivity of a large suite of genomic technologies via a consensus modelling method called the row-linear model. This method is an application of the American Society for Testing and Materials Standard E691 for assessing interlaboratory precision and sources of variability across multiple testing sites. We analyse a series of transcription and DNA methylation datasets containing both sequencing and array technologies, allowing a direct per-technology, per-locus comparison of sensitivity and precision across all common loci.

We implement and showcase a number of applications of the row-linear model, including direct comparisons of the sensitivity and precision of these platforms. Our findings demonstrate the utility of the row-linear model in evincing varying levels of concordance between measurements on these platforms, serving as a process for identifying reproducibility caveats in studies where cross-platform validation is performed.

Further reading:
Peters, T. J., French, H. J., Bradford, S. T., Pidsley, R., Stirzaker, C., Varinli, H., Nair, S., Qu, W., Song, J., Giles, K.A., Statham, A.L., Speirs, H., Speed, T.P. and Susan J Clark. In press. Evaluation of cross-platform and interlaboratory concordance via consensus modelling of genomic measurements. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty675

Preserving value in large scale data analytics through selective re-computation: the ReComp framework in the P4NU Turing Institute initiative for preventive HealthCare


Paolo Missier, Jacek Cała
Newcastle University School of Computing
Newcastle upon Tyne, United Kingdom

Many e-science tools and infrastructure exist that address the problem of how to make experimental science reproducible [1][2][3][4]. In contrast, little thought has gone into why and when an experimental process should be reproduced. Such questions are, however, dominant in Data Science, including AI, where computations are resource-intensive and expensive, and at the same time the outcomes have a relatively short life span. For instance, AI models require periodic re-training (which in some cases can be incremental), i.e., when substantial changes occur in the profile of the training set that was used for learning the model. More generally, we have observed [5] that the value of knowledge assets generated by analytics processes decays over time, as a consequence of changes in input datasets, external data sources, libraries, and system dependencies. For Big Data problems, deciding when refreshing past outcomes becomes beneficial requires non-trivial strategies, as complete re-computation in reaction to any changes is expensive and inefficient, as many changes have low impact on current results. Assuming that the process is reproducible in the first place, it is therefore natural to ask when, and to what extent, re-computation should take place. To address these questions, we have developed ReComp1, a generic meta-process that takes as input a process P, the history of P’s executions including outcomes O, and changes C. ReComp tries to determine (i) the extent of the impact of C on current outcomes, and (ii) the fragments of P that need to be re-executed in order to refresh O. Although ReComp is generic, the impact estimation functions are necessarily data- and process-specific. In this talk we present prior results for controlling reproducible pipelines for genomics (variant calling and interpretation) [6], and report on work in progress on ReComp impact functions for specific machine learning models.

1Recomp.org.uk



References
[1] V. Stodden, F. Leisch, and R. D. Peng, Implementing reproducible research. CRC Press, 2014.
[2] L. C. Burgess et al., “Alan Turing Intitute Symposium on Reproducibility for Data-Intensive Research -- Final Report,” 2016.
[3] G. K. Sandve, A. Nekrutenko, J. Taylor, and E. Hovig, “Ten Simple Rules for Reproducible Computational Research,” PLoS Comput. Biol., vol. 9, no. 10, p. e1003285, 2013.
[4] F. S. Chirigati, M. Troyer, D. Shasha, and J. Freire, “A Computational Reproducibility Benchmark,” IEEE Data Eng. Bull., vol. 36, no. 4, pp. 54–59, 2013.
[5] P. Missier, J. Cala, and M. Rathi, “Preserving the value of large scale data analytics over time through selective re-computation,” in Procs. 31st British International Conference on Databases - BICOD, 2017.
[6] J. Cała and P. Missier, “Selective and Recurring Re-computation of Big Data Analytics Tasks: Insights from a Genomics Case Study,” Big Data Res., vol. 13, pp. 76–94, Sep. 2018.

pISA-tree - being FAIR on the institutional level

Andrej Blejec1, Špela Baebler1, Anna Coll Rius1, Marko Petek1, Živa Ramšak1, Maja Zagorščak1 and Kristina Gruden1

1National Institute of Biology, Ljubljana, Slovenia

In recent years we experience a paradigm shift towards the open access of scientific data. FAIR principles (Wilkinson MD et al., 2016) are promoted as the way of inter-institutional collaboration. However, before being FAIR on the higher level, one also needs to be FAIR on the lower - intra-institutional and research group - levels. Too often the project information resources are scattered across the diverse data storage infrastructure, in a more or less chaotic way, that might be practical on short term manner but imposes tremendous difficulties in finding and reusing the information at any later stage. In addition, uploading the information onto any data sharing FAIR platforms could be too demanding and often discourages researchers to share their data. Here we present the pISA-tree - a data management solution that can easily support FAIR data management and contribute to the reproducibility of research and analyses. Our aim was to provide a straightforward system that will help and motivate researchers to better organize their data. A set of batch files without any dependencies, based on mapping the research information layer onto the standardized directory tree organization, are in compliance with the ISA framework (isacommons, 2018) and encourage the users to use metadata at an early stage of the research preparation. This brings benefits in improving the experimental design and finding the necessary information at any stage during the actual wet-lab experiment. Besides, multilevel statistical and computational analyses can be prepared in a fully reproducible way. Since information is structured in a standardized way and augmented by metadata, connection to FAIR based data management and cloud platforms can be established. This is especially true for platforms that are based on ISA framework. pISA-tree is freely available on GitHub (https://github.com/NIB-SI/pISA).


References
1. http://dx.doi.org/10.1038/sdata.2016.18
2. https://www.isacommons.org/

Systematic assessment of genomic biomarkers for anticancer drug sensitivity prediction in patient derived xenografts
Arvind Singh Mer1,2, Wail Ba-alawi1,2, Petr Smirnov1,2, Ming-Sound Tsao1,2, David Cescon1, Anna Goldenberg3,5,6, Benjamin Haibe-Kains1,2,3,5,7 *


1 Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
2 Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
3 Hospital for Sick Children, Toronto, Ontario, Canada
4 Département de management et technologie, Université du Québec à Montréal, Canada
5 Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
6 Vector Institute, Toronto, Ontario, Canada
7 Ontario Institute for Cancer Research, Toronto, Ontario, Canada

One of the key challenges in cancer precision medicine is finding robust biomarkers of drug response. Patient-derived xenografts (PDXs) have emerged as reliable preclinical models since they better recapitulate tumor response to chemo- and targeted therapies. However, the lack of standard tools poses a challenge in the analysis of PDXs with molecular and pharmacological profiles. Efficient storage, access and analysis is key to the realization of the full potential of PDX pharmacogenomic data. We have developed Xeva (XEnograft Visualization & Analysis), an open-source software package for processing, visualization and integrative analysis of a compendium of in vivo pharmacogenomic datasets. The Xeva package follows the PDX minimum information (PDX-MI) standards and can handle both replicate-based and 1x1x1 experimental designs. We used Xeva to characterize the variability of gene expression and pathway activity across passages. We found that only a few genes and pathways have passage specific alterations (median intraclass correlation of 0.53 for genes and positive enrichment score for 92.5% pathways). Activity of the mRNA 3'-end processing and elongation arrest and recovery pathways were strongly affected by model passaging. We leveraged our platform to link the drug response and the pathways whose activity is consistent across passages by mining the Novartis PDX Encyclopedia (PDXE) data containing 1,075 PDXs. We identified 87 pathways significantly associated with response to 51 drugs (FDR < 5%), including associations such as erlotinib response and signaling by EGFR in cancer pathways and MAP kinase activation and binimetinib response. We have also found novel biomarkers based on gene expressions, copy number aberrations (CNAs) and mutations predictive of drug response (concordance index> 0.60; FDR < 0.05). Xeva provides a flexible platform for integrative analysis of preclinical in vivo pharmacogenomics data to identify biomarkers predictive of drug response, a major step toward precision oncology.

The manufacture method of reference materials for NGS proficiency test of somatic variants
Jongsuk Chung1,3, Yeon Jeong Kim1, Eunjeong Cho1, Ki-Wook Lee1,2, Taeseob Lee1,2, DongHyeong Sung2, Donghyun Park1, Dae-Soon Son1, Woongyang Park1,2,3, *


1*Samsung Genome Institute, Samsung Medical Center, Seoul, Republic of Korea, 06351
2Department of Health Sciences and Technology, Samsung Advanced Institute for Health Sciences & Technology, Sungkyunkwan University, Seoul, Republic of Korea, 06351
3Department of Molecular Cell Biology, School of Medicine, Sungkyunkwan University, Suwon, Republic of Korea, 16419


NGS is not only being used for the genome investigation, but also for clinical diagnosis of disease associated genetic alteration. Many laboratories are trying to develop their own NGS diagnostic methods with various processes for cancer diagnosis. So, it is important to measure the performance of variant detection under the development or maintenance or for monitoring by regulatory faculty. There are few reference materials - NA12878 from NIST, other commercial products. However there were limitations of mentioned reference materials. NA12878 can only give the information about germline variants as about 50% VAF. In addition, in commercial products, kinds of variants and positions were not controllable. In some cases, cell-line mix would be used, however, it is hard to include interested variants with desired VAF level. So, we developed the manufacture method of reference materials, which can have the interested variants with desired VAF level. The basic principle of manufacture is the mixture of the 1kb fragments which interested variant (68 targets in CDxs approved by FDA) is located in the middle and DNA extracted from breast cell-line(SKBR3) as the mixture ratio practically calculated in each variants position. Fragments with variants were made by ultra oligomer synthesis and PCR mutagenesis, and validated with sanger sequencing and ddPCR. By this method, we made 3 reference materials having 20 variants with 5-50% VAF. In addition, we prepared and tested the proficiency test process to 3 NGS laboratory in Korea. All laboratory detected 20 variants in a reference sample #1, and 2 in 3 labs missed each 1 variants among 20 in a reference sample #2, 3. The correlation of VAF between the detected and the average is over 0.98. Our method is expected to help the regulatory faculties to make their own reference materials and developer to measure the performance of their processes.

With CHARME towards Standardization - Making the Case to train Communication and Sharing of Data amongst Researchers in Life Science Research
Susanne Hollmann*1,2, Domenica D’Elia3, Kristina Gruden4, Marcus Frohme5

1SB Science Management UG (haftungsbeschränkt), 12163 Berlin, Germany
2Research Centre for Plant Genomics and Systems Biology, Potsdam University, Potsdam 14476, Germany
3CNR - Institute for Biomedical Technologies for Biomedical technologies, Bari, Italy
4Department of Biotechnology and Systems Biology, National Institute of Biology, Ljubljana, Slovenia
5Technical University of Applied Sciences, 15745 Wildau, Germany


Modern, high-throughput methods for the analysis of genes and metabolic products and their interactions offer new opportunities to gain comprehensive information on life processes. The data and knowledge generated open diverse application possibilities with enormous innovation potential. Open Science (OS) describes the ongoing transitions in the way research is performed, i.e. researchers collaborate, knowledge is shared, and science is organised. It is driven by digital technologies, globalisation, enlargement of the scientific community and the need to address societal challenges. Background of OS is to making research results more accessible to all societal actors, contributes to better and more efficient science, as well as to innovation in public and private sectors. Skills in generating but also properly annotating of data for further integration and analysis are needed - data need to be made computer readable and interoperable to allow integration with existing knowledge. To achieve this, we need common standards and standard operating procedures as well as workflows that enable the combination of data across standards. Currently, there is a lack of experts who understand the principles and possess knowledge of the principles and relevant tools in this context. This is a major barrier hindering the implementation of FAIR principles and the reusability of data. This is mainly due to insufficient and unequal education of scientists producing data in life science that is inherently varied, complex in nature, and large in volume. Due to the interdisciplinary nature of life science research, education within this field faces numerous hurdles including institutional barriers, lack of local availability of required expertise, as well as lack of appropriate teaching material and PROPER adaptation of curricula. Within the Cost Action CHARME we work together to harmonise existing formats and develop new solutions with a focus on the development of uniform cross border training

Same data - different results. Performance of protein sequence comparison tools in case of low complexity regions
Jarnot Patryk1, Grynberg Marcin2, Gruca Aleksandra1

1Institute of Informatics, Silesian University of Technology, Poland
2Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Poland

Low complexity regions (LCRs) are characterised by low diversity of amino acids. These regions are abundant in proteins and occur more often than by chance. LCRs are poorly investigated, although some research projects show that they play an important role in protein functions and structures. There are plenty of methods for searching for similar protein sequences. However, these methods are designed for homologous search and use only high complexity regions (HCRs), while masking low complexity parts of protein sequences.

In our research we compared results from the following methods: BLAST, HHblits and CD-HIT. These methods are popular and widely used by scientists. We analysed if selected tools are as good for searching for similar LCRs as for HCRs. To generate results we retrieved both LCRs and HCRs from UniProt/Swiss-Prot database. The LCR outcome was used to investigate whether there is overlap among the methods. On the other hand, HCRs were used as a control set to check if our workflow is correct and results are consistent with the theory. To calculate results we generated custom databases of both LCRs and HCRs. In the case of HHblits we generated profiles of Hidden Markov Models based on BLAST results.

Our results are quantitative and qualitative. To get quantitative results we created pairs of similar sequences and presented them as the overlap between each method. We investigated only similar pairs derived from different families. Additionally, we analysed randomly selected similar pairs from each method to better understand their advantages and disadvantages. HCR results from the control set shows that the workflow is correct and consistent with the theory. However, there is almost no overlap of the results from canonical tools applied for LCR dataset, therefore we assume that current methods are not efficient enough to search for similar low complexity regions.

Quality control reference database for NGS panel sequencing
Donghyeong Seonga, Jongsuk Chungb, Ki-Wook Leeb, Taeseob Leeb, Byung-Suk Kimb, Byoung-Kee Yicd, Woongyang Parkb, Dae-Soon Sonb

aDepartment of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul 06351, Korea
bSamsung Genome Institute, Samsung Medical Center, Seoul 06351, Korea
cSmart Healthcare & Device Research Center, Samsung Medical Center, Seoul 06351, Korea
dDepartment of Digital Health, SAIHST, Sungkyunkwan University, Seoul 06351, Korea

The Next Generation Sequencing (NGS) technology has been rapidly adopted in clinical practices, and governments have established a regulatory framework to ensure accuracy and reliability of NGS-based testing. In Korea, the National Health Insurance Service (NHIS) includes NGS panel testing in its insurance coverage and the Ministry of Food and Drug Safety (MFDS) has provided the clinical laboratory accreditation program since 2017. Despite of the importance and necessity of clinical NGS testing, there are a lot of barriers in industry such as performance evaluation of NGS panel sequencing. Due to technical characteristics of NGS, the definition of performance and its evaluation methods are diverse, and cannot be used uniformly. For a reliable performance evaluation, not only the method but also dozens to hundreds of experimental costs put into the evaluation is a great burden in clinical laboratory. In this study, we developed a database for integrating quality related data generated from CancerSCAN of Samsung Genome Institute. The database contains various performance evaluation data and comparative analysis data, including DNA extraction kits, library manufacturing conditions, etc. For the comprehensive view of quality parameters, we collected and selected about fifty items of quality related data for the whole data production, for example, specimen type, DNA quality and quantity, sequencing instrument, total read, mean coverage, and uniformity. In addition, we mapped some of them to standard terminologies such as HGNC (HUGO Gene Nomenclature Committee) for gene, Cellosaurus for cell line. The database could be used for the establishment of clinical laboratory by releasing sequencing raw data and used for the pipeline developers to implement their ideas without generation expensive data. It could also provide a reference data for performance comparison to reagent vendors.

P4@NU: Exploring the role of digital and genetic biomarkers to learn personalized predictive models of metabolic diseases
Benjamin Lam1, Paolo Missier1, Michael Catt2

1Newcastle University - School of Computing
2Newcastle University - National Innovation Centre for Ageing Newcastle upon Tyne, United Kingdom

Medical practice is evolving, from the inefficient detect-and-cure approach and towards a Preventive, Predictive, Personalised and Participative (P4) vision that focuses on extending individual wellness state, particularly those of ageing individuals [1], and is underpinned by “Big Health Data” such as from genomics and personal wearables [2]. P4 medicine has the potential to not only vastly improve people’s quality of life, but also to significantly reduce healthcare costs and improve its efficiency.

The P4@NU project (P4 at Newcastle University) aims to develop predictive models of transitions from wellness to disease states using digital and genetic biomarkers. The main challenge is to extract viable biomarkers from raw data, including activity traces from self-monitoring wearables, and use them to develop models that are able to anticipate disease such that preventive interventions can still be effective. Our hypothesis is that signals for detecting the early onset of chronic diseases can be found by appropriately combining multiple kinds of biomarkers.

UK Biobank [3] is our initial and key data resource, consisting of a cohort of 500,000 individuals characterised by genotype, phenotype and, for 100,000 individuals, also free-living accelerometer activity data. Our initial focus is on using UK Biobank data to learn models to predict two cardiometabolic diseases: Type-2 Diabetes (T2D) and Cardiovascular Disease (CVD). These diseases are typically associated with insufficient activity and sleep and are preventable given lifestyle changes. [4]

Our initial investigation has focused on extracting high-level features from activity traces, and learning traditional models (random forests and SVM) to predict simple disease state. This has given us a baseline for the more complex studies that include genotyping data (selected SNPs), and will be based on deep learning techniques.


References
[1] L. Hood and S. H. Friend, “Predictive, personalized, preventive, participatory (P4) cancer medicine,” Nat. Rev. Clin. Oncol., vol. 8, no. 3, p. 184, 2011.
[2] N. D . Price et al. , “ A wellness study of 108 individuals using personal, dense, dynamic data clouds,” Nat. Biotechnol., vol. 35, p. 747, Jul. 2017.
[3] C. Bycroft et al., “The UK Biobank resource with deep phenotyping and genomic data,” Nature, vol. 562, no. 7726, pp. 203–209, 2018.
[4] Cassidy S, Chau JY, Catt M, et al "Cross-sectional study of diet, physical activity, television viewing and sleep duration in 233 110 adults from the UK Biobank; the
 behavioural phenotype of cardiovascular disease and type 2 diabetes", BMJ Open 2016.

PRI – Pattern Recognition of Immune cells
Yen Hoang1, Stefanie Gryzik1, Detlef Groth2, Ria Baumgrass1

1German Rheumatism Research Center (DRFZ), Berlin, Germany
2University of Potsdam, Germany

There is an ambitious urge to cope with multi-parametric flow cytometry data since modern flow cytometers can detect up to 30 markers simultaneously. With each additional marker the number of potential cell subpopulations increases exponentially. Therefore, it is impossible to distinguish all correlations in high-dimensional data solely from a series of conventional approaches such as biaxial dot plots and boolean gates. Although several advanced visualization techniques have been developed, they are limited either in single cell resolution, detection rate of rare cell subpopulations or in comparability of different groups of samples.

With our approach, the biaxial dot plot is distributed in quadrant bins resulting in a grid of bins. Hence, this plot can be extended with statistical information about marker 3 by calculating e.g. mean flourescence intensity or frequency of marker 3 producers for each bin and displaying the bin in a color-coded manner. There are endless statistical methods which can be applied with our visualization.

This method can be used to create an overview of n*(n-1)*(n-2) different “triploT” combinations, where multiparametric correlations can be visualized, even without losing the single cell resolution. Especially, our approach is automatable and completely reproducible. Furthermore, it assists us to discover meaningful cell subpopulations. PRI is a simple but effective method which opens up whole new possibilities to analyze and visualize flow cytometry data. It will be soon accessible for the academic society, since we are currently working on a web-based interface.

Morphological ICA: Exploiting across subjects covariance and independent component analysis to define volume based networks
Riya Paul, Michael Czisch, Philipp G. Sämann, Bertram Müller-Myhsok

Max Planck Institute of Psychiatry, Munich, Germany

Introduction: Structural connectivity is usually investigated using fiber tracking techniques, yet there exists a highly systematic covariance structure of voxel-wise volumetric information. Earlier reports have highlighted parallels between such macrostructural connectivity and resting state fMRI network, pointing out that neurodegenerative diseases target such networks. These leads to the idea of constructing a structural MRI (sMRI) and volume-based atlas by independent component analysis (ICA), potentially useful for dimensionality reduction and fingerprinting of volumetric deficits.

Methods: We used an open access sMRI dataset (‘IXI’, N=563, 20-86 years, 55.6% female). Unified segmentation and the iterative DARTEL technique were used for optimal inter-subject alignment; resulting modulate grey matter (GM) maps (isometric 1.5 mm) were smoothed (FWHM 6x6x6 mm3) and residualized against age, age2, sex and intracranial volume. A good decomposition should allow a good reconstruction of data while being spatially localized, i. e. sparse in voxel space. Such a setting can be formalized in a dictionary learning (DL) optimization framework that combines a sparsity inducing penalty to a reconstruction loss. One advantage is a model-order determination by eliminating components that can be generated by Gaussian noise. DL was compared with MELODIC as standard ICA tool.

Results: MELODIC returned 84 components (54% of total variance) (Figure 1a), groupable into cerebellar, visual, default mode like, medial-prefrontally centered, (pre-)motor and cingulate/insula-dominated components. DL delivered 80 components that appeared qualitatively more spatially compact (Figure 1b). Dictionary learning with very low sparsity also delivered 80 components, explaining about 52% of the total variance (high sparsity: 79 components, 38% explained variance) (Figure 1c).

Conclusions: We replicate that voxel-wise GMs hold structured covariance information with networks similar to co-activation or resting state networks. DL seems promising sparse and robust component solutions. We will use jackknife analyses to quantify component stability, and ICASSO to pre-estimate the dimensionality of the ICA.

References
1. Seeley, W. W. (2009),' Neurodegenerative Diseases Target Large-Scale Human Brain Networks’, Neuron, vol. 62, pp. 42–52. 2. Mensch, A.(2016), ‘Compressed online dictionary learning for fast resting-state fMRI decomposition’, IEEE 13th International Symposium on Biomedical Imaging (ISBI) 1282–1285 (IEEE, 2016), doi:10.1109/ISBI.2016.7493501 3. Beckmann, C. F. (2004),’ Probabilistic Independent Component Analysis for Functional Magnetic Resonance Imaging’, IEEE Transactions on Medical Imaging, vol. 23, pp. 137–152.



Figure 1: a) MELODIC based structural covariance components emerging from N=563 healthy subjects: (i) cerebellar, (ii) default-mode-like, (iii) visual and (iv) executive/ACC/DLPFC/insula components, b) DL approach with very low (3) and low (8) sparsity control parameter, the latter delivering less fragmented components.

Processing pipeline calibration and confounder removal for accurate and reproducible patient profiling
D P Kreil1 and P P Łabaj1-3

1Chair of Bioinformatics, Boku University Vienna
2Małopolska Centre of Biotechnology UJ, Kraków, Poland
3Austrian Academy of Sciences, Vienna


Recent studies show that >70% of researchers have failed to reproduce the results of other scientists’ analyses, and >50% have failed to reproduce their own results. Every data modality that we can measure adds novel sources of variation and artefacts that need to be characterized and removed. We show that it is necessary and possible to clean hidden confounding factors in modern assay data that are not yet captured by more traditional analysis approaches.

A prerequisite for reproducible analysis results, systematic models of data from novel platforms allow correct identification of new types of confounders despite the rapid technological advances in the field. We consider the systematic measurement of gene activity by expression profiling, where the routine collection of genome scale data has recently become possible also in the clinic. The latest developments now allow higher resolution profiling, for instance, the discrimination of alternative gene transcripts. These, not genes, are in fact relevant for functional analysis, thus profiling gene activity at the level of alternative transcripts is a corner stone of functionally relevant analyses.

In general, public repositories already hold raw data allowing transcript level analyses, both from next generation sequencing as well as from high-density microarrays. However, the majority of studies still focuses on simpler gene level analysis. We here show that transcript level interpretation of high-resolution profiles remains non-trivial, and the reproducibility of results from current state-of-the art approaches is drastically lower than at the gene level. We can show that the application of modern factor analysis still allows to remove hidden confounders and by this improve reproducibility. In particular, this benefits from community standard reference sets for inter-lab calibration.

Choppy: an integrated pipeline management platform for efficiently achieving computational reproducibility in omics
Jingcheng Yang1, Yechao Huang1, Jun Shang1, Zhaojie Xia1, Luyao Ren1, Ying Yu1, Yuanting Zheng1, Li Guo2, Leming Shi1

1Center for Pharmacogenomics, School of Life Sciences, Fudan University, Shanghai 200433, China
2State Key Laboratory of Multiphase Complex Systems, Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China.


Reproducibility is a fundamental hallmark of good science, and reproducible data analysis plays a critical rule in achieving reproducibility in omics research due to its overwhelming complexity. To make omics data analysis results independently reproducibly by a third party, it is essential to keep track of the details of each step of the omics data analysis process. Therefore, we developed the Choppy platform for storing, organizing, processing, analyzing, and sharing of such omics big data, along with the computer codes, the computing environment in which the codes were applied onto the data, and the related documentations so that omics data analysis can readily be made reproducible, traceable, and scalable. Currently, several pipelines based on communitywide best practices for the analysis and automated processing of omics data from whole-genome sequencing, whole-exome-seq, target-seq, methylation, ChIP-seq, RNA-seq, and miRNA-seq have been distributed as Choppy Apps in the Choppy Store (http://choppy.3steps.cn).
Choppy has the following advantages:
1. Choppy uses Choppy Store as an open warehouse to share Choppy Apps that are implemented in Common Workflow Language (CWL), the Workflow Description Language (WDL) and the user’s own project metadata.
2. Based on the Conda and pip software manager, Choppy is an auto-builder for Docker images. It can automatically build Docker images after the user’s specification of custom codes and dependencies.
3. Choppy includes a report generator that creates interactive web reports easily by using Python codes. It is based on Markdown renderer and plugin mechanism that support several dynamic plot tools, such as multiQC, Bokeh, Plotly, Shiny, BioJS, and other JavaScript libraries. It builds comprehensive static HTML pages that can be hosted on GitHub, Amazon S3, AliCloud OSS, or other sites.
4. As a cloud platform, Choppy can be easily deployed on AliCloud, AWS, Azure, and GCP. More cloud platforms will be supported soon.

Source codes of Choppy are freely available at http://choppy.3steps.cn/go-choppy/choppy-cli. All documentations are hosted at http://docs.3steps.cn.

Why all Principal Component Analyses (PCA) in genetics are likely wrong?
E. Elhaik1

1Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK


The problem of irreproducibility that plagues science slows our progress, wastes our resources, and endangers lives. Addressing this problem requires multi-layered investigations starting with the “low hanging fruits” – the commonly used approaches and methods that are inherently irreproducible. Principal components analysis (PCA) is one of the most popular statistical methods in population genetics used to identify structure in the distribution of genetic variation. The study of Price et al. (2006), which introduced the Eigenstrat package that calculates PCA, has been cited ~7,000 times with other PCA packages for genetics (e.g., SNPRelate) trail hundreds of citations. PCA results from GWAS are an integral part of mega-databases like MR-BASE and are reported in all major studies. A few studies highlighted the limitations of PCA and criticized the assumptions it makes; however, they were largely ignored, likely because: A) Their approach failed to illustrate the magnitude of the problem and B) They did not offer an alternative. Here, we argue that PCA’s accuracy for genetic applications was never proven and that its high popularity is mainly due to its limitations, i.e., the ability to manipulate the results in the desired direction, which creates a major problem of reproducibility and decreases the power of genetic studies. To prove our arguments, we developed a model that compares the true genetic distances to those produced by PCA and measures the bias. We will show that all the common PCA applications in the genetic literature produces false conclusions and demonstrate how to generate almost any desirable outcome. We will further present examples using real genetic data. Finally, we will propose an alternative approach to measure genetic distances. Our findings demonstrate that PCA is unsuitable for genetic studies and that a large corpus of the genetic literature should be reevaluated.

maTE: Discovering Expressed MicroRNA - Target Interactions
Malik Yousef1*, Loai Abddallah2, and Jens Allmer3

1Department of Community Information Systems, Zefat Academic College, Zefat, 13206, Israel. e-mail: malik.yousef@gmail.com
2Department of Information Systems, The Max Stern Yezreel Valley Academic College, Israel. e-mail: Loai1984@gmail.com
3Hochschule Ruhr West, University of Applied Sciences Institute for Measurement Engineering and Sensor Technology Medical Informatics and Bioinformatics, Mülheim an der Ruhr, Germany. e-mail: jens@allmer.de

Motivation: Disease is often manifested via changes in transcript and protein abundance. MicroRNAs (miRNAs) are instrumental in regulating protein abundance and may measurably influence transcript levels, as well. MicroRNAs often target more than one mRNA (for human the average is three) and mRNAs are often targeted by more than one miRNA (for the genes considered in this study, the average is also three). Given a set of dysregulated transcripts, it is difficult to determine the minimal set of causative miRNAs. We present a novel approach, maTE, based on machine learning which, integrates miRNA target genes with gene expression data. maTE depends on the availability of a sufficient amount of patient and control samples. The samples are used to train classifiers to accurately classify the samples on a per miRNA basis. A combined classifier is built from multiple miRNAs to improve separation. Results: The aim of the study is to find a set of miRNAs causing regulation of their target genes that best explains the difference between groups (e.g.: cancer vs. control). maTE provides a list of significant groups of genes where each group is targeted by a specific microRNA. For the datasets used in this study, maTE generally achieves an accuracy well above 80%. It is of note, that when the accuracy is much lower (e.g.: ~50%) the set of miRNAs provided is likely not causative for the difference in expression. This new approach of integrating miRNA regulation with expression data yields powerful results and is independent of external labels and training data. Thereby, it opens up new avenues for exploring miRNA regulation and may pave the way for the development of miRNA-based biomarkers and drugs. Availability: The KNIME workflow, implementing maTE, is available at Bioinformatics online.

Modeling of gene expression time series with the latent time-evolving graphical lasso
Veronica Tozzo1, Annalisa Barla1

1Department of Informatics, Bioengineering, Robotics and System Engineering Universita’ degli Studi di Genova, Via Dodecaneso 35, 16146 Genova, Italy

Biomedical research is leaning more and more towards automatic analysis of thousands of data gathered in specific biological or physical conditions. Especially when dealing with -omics data, models able to capture complexity and interaction among variables are a primary necessity. Many statistical methods that perform reverse engineering on data have been proposed. These methods learn, from real world measurements, graphs of connections where the entities (e.g. genes) are the nodes, and, their connections (e.g. regulatory impulses) are modeled as edges. Lately, research on these techniques has moved towards more complex methods able to capture the evolution of complex system over time as well as their latent factors or their multi-level interactions. These methods, in particular the single layer ones, have already proved to be effective and useful in the analysis of -omics data especially given the reproducibility of results, a key point in the understanding of a phenomenon. We propose a model, the latent variables time-varying graphical lasso (LTGL), for the analysis of gene expression time series. LTGL is able to capture time changes on the gene co-expression network as well as to provide indications on the number of latent variables acting within the system and at which point they change. For this reason, LTGL is particularly indicated for the study of systems subject to perturbation, e.g. the response to a drug. In the presence of prior knowledge on the system, LTGL could be extended in order to provide clearer information on the latent variables and going in the direction of a time-varying two-layers network. Given the reproducibility of the results and the statistical nature of the analysis, we argue that the use of this type of methods is essential for the disentanglement of the underlying biological mechanisms.

Integrated analysis of cancer cell lines and tumor samples guides cell line selection for cancer research: a landscape study
Xin Shao, Jie Liao and Xiaohui Fan

College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, P. R. China, 310058

Cancer cell lines are widely used model systems in a variety of important researches, from carcinogenesis elucidation to drug evaluation, due to obvious advantage on availability and expense. However, increasing evidences have indicated phenotypic responses vary between tissue samples and cancer cell lines, which is related to intrinsic heterogeneity, e.g. mutations, copy number variations, gene expressions, of tumors. Therefore, two linked questions rise if a cancer can be studied by one cancer cell line and how to find this proper one. The accumulating data on transcriptional profiles of tumor samples are now available for systemic analysis, as well as for cancer cell lines through several well-known projects. Here a comprehensive comparison of tumors samples with cancer cell lines on genomic characteristics are carried out. First, high concordance on genomic signatures of cancer cell lines was observed among the NCI60 Project, the Cancer Cell Line Encyclopedia and the COSMIC Cell Lines Project. We observed a few of cancer cell lines transcriptionally correlated with the corresponding tissue samples across 22 types, however the fact does hold for a large part of them. Thus, a computational method is proposed to prioritize cell lines that resemble the corresponding cancer type based on gene expression pattern. Validation experiments on predicted top cell line with the closest resemblance to a certain tumor group were conducted using FDA-approved drugs or phase III or IV drugs on breast, prostate and thyroid carcinoma. Additionally, a web tool is under development to allow the on-the-fly evaluation, which offers good insight for drug screening or mechanism research. Our study bridges the gap between tumors and cell lines and presents a helpful guide of selecting the most suitable cell lines model for cancer studies.

Laying the Foundation for Reproducible Machine Learning: Systematic Cross-sectional and Longitudinal Dataset Comparisons
Colin Birkenbihl, Marc Jacobs, Martin Hofmann-Apitius

Fraunhofer Institute for Algorithms and Scientific Computing SCAI. Email: colin.birkenbihl@scai.fraunhofer.de

To achieve reproducible results, the process of building a reliable machine learning model involves its validation on a test dataset that is independent from the data the model was trained on. Adequate model validation is of utmost importance to prove that the model generalizes well, and hereby is applicable, beyond the training data. Finding a combination of datasets that can potentially serve the purpose of training and testing data is no trivial task. The main assumptions to be met by the test data are that all features used for training the model are available and that they show value ranges comparable to the ones exhibited in the training data.

We developed DataComp, a free open-source Python package for systematic, domain independent dataset comparisons. It serves as an investigative toolbox to assess differences on feature level across multiple datasets. A brought variety of statistical comparison methods has been implemented that can be performed cross-sectionally and longitudinally.

For example, the software allows users to asses the feature overlap between the compared datasets, to identify which and how many features significantly deviate between them, and to evaluate the presence of batch effects. Multiple comprehensive visualization techniques are supported that can be used to investigate data differences and illustrate comparison results. On top of that, conducting comparisons will inherently increase interoperability between the compared datasets.

We showcase the functionalities on clinical Alzheimer’s disease patient data.

The presented software empowers data analysts to identify significantly different or comparable dataset combinations. The latter pose good candidates to be used in machine learning approaches as training and independent test data. Thereby, DataComp facilitates reproducible and reliable machine learning.

Context graph for biomedical research data: A FAIR and open approach towards reproducible research in Medicine
Jens Dörpinghaus, Marc Jacobs, Martin Hofmann-Apitius

Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany. Email: jens.doerpinghaus@scai.fraunhofer.de

Hypothesis generation and knowledge discovery on biomedical data are widely used in medical research and digital health. In addition, the massive data available build the basis for a multitude of predictive medicine Machine Learning (ML) and AI approaches. We claim that in order to achieve reproducible research in predictive medicine we need standardized and FAIR [1] context graphs for biomedical research.

Here we present a novel approach that annotates research data with context information. The result is a knowledge graph representation of data, the context graph. It contains computable statement representation (e.g. RDF or BEL). This graph allows to compare research data records from different sources as well as the selection of relevant data sets using graph-theoretical algorithms. Hereby it constitutes a first step towards a reproducible research paradigm with regard to big data in biomedical research. It is possible to retrieve data by context (cohort size, settings, results) and by content. For example, this system may answer questions like “Give me a clinical trial to reproduce my results or to apply my model”.

We will present a first proof-of-concept on biomedical research data. We applied text mining and knowledge extraction tools on data from PubMed [1] (29 million abstracts) and PMC (4 million full-text articles). The data itself is available with SCAIView ([3], www.scaiview.com). The resulting context graph is stored in a graph database; searches and data extraction can be done using the SCAIView web frontend. We already demonstrated that this system is capable of answering semantic questions [4, 5].

We will discuss problems for standardization of context data and technologies from the semantic web. Graph algorithms as well as highly efficient technologies for storing different data types are a key issue. In addition, we will state the importance of context data for reproducible AI approaches for predictive medicine.


[1] Wilkinson, Mark D., et al. "The FAIR Guiding Principles for scientific data management and stewardship." Scientific data 3 (2016).
[2] N. R. Coordinators, “Database resources of the national center for biotechnology information,” Nucleic acids research, vol. 45, no. Database issue, p. D12, 2017.
[3] J. Dörpinghaus, J. Klein, J. Darms, S. Madan, M. Jacobs, Scaiview – a semantic search engine for biomedical research utilizing a microservice architecture, in: Proceedings of the Posters and Demos Track of the 14th International Conference on Semantic Systems - SEMANTiCS2018, 2018.
[4] J. Dörpinghaus, S. Schaaf, M. Jacobs, Soft document clustering using a novel graph covering approach, BioData Mining 11 (1) (2018) 11.
[5] J. Dörpinghaus, J. Darms, M. Jacobs, What was the question? Asystematization of information retrieval and NLP problems. In: 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), IEEE, 2018.

SPARSim Single Cell: a count data simulator for scRNA-seq data
Giacomo Baruzzo1, Ilaria Patuzzi1,2, and Barbara Di Camillo1,*

1Department of Information Engineering, University of Padova, Padova, Italy. 2Microbial Ecology Unit, Istituto Zooprofilattico Sperimentale delle Venezie, Padova, Italy. * Corresponding author

Single cell RNA sequencing (scRNA-seq) has revolutionized transcriptomic studies, offering new possibilities to address biological and medical questions. However, the bioinformatics analysis of such new kind of data is challenging. Indeed, scRNA-seq count data show many differences compared to bulk RNA-seq count data, making the application of existing RNA-seq preprocessing/analysis methods not straightforward or even inappropriate. Despite the considerable efforts in developing new computational methods, several comparative studies highlight that no computational method clearly outperforms the others, making the definition of best practices and standard analysis pipelines a major issue.

To help the development of more robust and reliable scRNA-seq preprocessing/analysis methods, the availability of simulated data plays a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data.

In this work we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. In particular, the novel Multivariate Hypergeometric modelization allows generating simulated count matrices which closely match the sparsity (including dropout events) observed in real scRNA-seq count data. SPARSim provides a great input flexibility to the end user, allowing him/her to specifying simulation parameters or, alternatively, estimating them from real data. In addition, SPARSim includes functions to simulate spike-ins, batch effects and bimodal genes. Similar to the assessment framework proposed in Zappia et al. (2017), we assess SPARSim in its ability to simulate count data that resemble real data in terms of counts intensity, variability and sparsity. We test SPARSim on 6 real scRNA-seq datasets, describing more than 30 experimental conditions and covering a wide range of biological (species and cell types) and technical (platform/protocols) scenarios. Across all the tested datasets, assessment results show that SPARSim performs comparable or better than one of the most used scRNA-seq count data simulator, Splat (Zappia et al., 2017).

A Pan-Cancer Analysis of the TME Shows Reproducibility but Tissue-Dependent Associations between RNA-based Immune Signatures, Tumor Mutational Burden, and Outcomes
Wendell Jones

EA Genomics-Q2 Solutions, Morrisville, NC USA

Immune-based biomarkers are now commonly available when measuring the tumor microenvironment (TME), although many are based on immuno-histochemistry methods which sometimes have drawbacks: specificity, quantitation, variety. Clinical grade RNA assays, whether using NGS, NanoString, or HTG methods, are now much more accessible (for example, processing formalin-fixated material has a much higher degree of success) and are commonly employed in clinical trials to measure tumor activity and immune cell content in the TME.

From previous research and pan-cancer information from TCGA, we have analyzed gene signatures for immune activity encompassing a dozen distinct subcomponents of the immune system and TME response to immune infiltration and their association with outcomes. We have applied these immune signatures in multiple independent oncology datasets and in multiple solid tumor indications using TCGA and other public sources of data including more recent clinical trial data.

The analysis encompasses
a) multiple platforms (eg, microarray and NGS) from the same cohort to examine signature measurement consistency and reproducibility across RNA measurement platforms
b) multiple independent cohorts from the same indication to examine signature reproducibility related to association with outcomes such as overall survival (OS) and event-free survival (EFS)
c) multiple independent cohorts from different tissues and indications to examine signature generalizability from one tissue and indication to another

Results suggest mature RNA platforms can reproduce important immune signatures in general but some are more easily reproduced than others. Further, in examining one particular indication in detail (ovarian cancer), we see general reproducibility among important RNA-based signatures in multiple independent cohorts. However, generalizability of immune signatures and TMB across tissues, indications, and endpoints is more elusive. The results also suggest relevant biomarkers for immune status that can be gleaned and developed from RNA from the TME and TMB from the tumor but their utility is tissue and subtype specific.

Evaluation of de novo assembly and structural variation detection in a paired tumor-normal samples
Chunlin Xiao1, Huixiao Hong2, Wenming Xiao3

1NCBI/NLM/NIH, 45 Center Drive, Bethesda, MD 20892, USA
2NCTR/FDA, 3900 NCTR Road, Jefferson, AR 72079, USA
3CDRH/FDA, 10903 New Hampshire Avenue, Silver Spring, MD 20993, USA

Structural variations (SVs) are well-known to contribute to genetic diversity of human populations, affect biological functions, and even cause various human diseases. However, accurately identifying SVs with correct sizes and locations in the human genome, particularly in cancer samples, remains a challenge due to the complexities of the human genome, limitations of sequencing technologies, and drawbacks of analysis methods. Recent advancement of next-generation sequencing technologies has significantly reduced the sequencing cost, while substantially increased the lengths of the sequencing reads. Therefore, using de novo assembly-based approaches for detecting a full spectrum of SVs in human genome becomes appealing. While various assembly methods have been developed for general use, the relative efficiency and predictive accuracy of SVs calling based on these assembly methods have not been fully evaluated. In this study, we applied several popular de novo assembly tools to the sequencing read data that were generated using multiple sequencing technologies with technical replicates for a paired tumor-normal samples. Assemblies were produced based on the data from individual sequencing platforms or combination of multiple sequencing platforms, and the qualities of each of the assemblies were assessed by various approaches. SVs callsets were generated from each of the assemblies. Repeatability in the SVs of the technical replicates and reproducibility across sequencing sites were also evaluated. Results showed that there were substantial variabilities between SVs callsets based on various assemblies within sequencing platform and cross sequencing sites. We also observed considerable variabilities between assembly-based SVs callsets and alignment-based SVs callsets. The sources of these variabilities have been analyzed. These results allow better understanding of the impacts of de novo assembly methods on SVs calling, thus providing a better insight to precision medicine.

M-RNA lung Cancer Data Classification Using Feature selection
Amrane Meriem1, Oukid Salyha1, Tolga Ensari2

1 Blida 1 university
2 Istabul University

Cancer patients number has seen an incredible increase this last decade, Lung cancer (LC) is the most common cancer type. Several studies proved that LC can be eradicated if the patient gets an early diagnosis and a suitable treatment. The main problem remains in the fact of it doesn't show any kind of symposiums when it starts, even with MRI images and Chest X-ray.So, this leads biologists and radiologists to look for new methods of detection. In biology for example we can study the genotype of cells and try to find correlation with phenotype. In this study, we use a microarray data set to detect genes which are causing the disease. We combine fisher methods and Mutual information method to rank genes, our studies show that the best 3 target biomarkers are : ‘hsa-miR-126', 'hsa-miR-423-5p', 'hsa-let-7d'. We classify our data using SVM and Random forest we get an accuracy of 99.30%.

Machine learning and mechanistic models of pathway activity to discover new therapeutic targets
Matías M. Falcó1,2, Cankut Çubuk1, Carlos Loucera1, Marina Esteban1, Inmaculada Álamo1, Marta R. Hidalgo3, María Peña-Chilet1,2, and Joaquín Dopazo1,2,4

1Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain
2Bioinformatics in Rare Diseases (BiER-U715), Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), Sevilla, Spain;
3Bioinformatics and Biostatistics Unit, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
4FPS, INB-ELIXIR-ES, Sevilla, Spain


In spite of the increasing availability of genomic and transcriptomic data, there is still a gap between the detection of perturbations in gene expression and the understanding of their contribution to the molecular mechanisms that ultimately account for cell and individual phenotypes. Alterations in signaling and metabolism are behind the initiation and progression of many diseases, including cancer. The wealth of available knowledge on signal transduction metabolic networks can therefore be used to derive mechanistic models that link gene expression perturbations to changes in signaling and/or metabolic activity that provide relevant clues on molecular mechanisms of disease and drug modes of action (MoA).

Since pathway activities are causative of cell outcomes that ultimately dictate cell fate, such activities can be used in a machine learning framework to predict the effect of interventions on specific targets. Specifically, we use a multi-task regression framework that utilizes Gaussian processes for this purpose. Gaussian processes are fully probabilistic machine learning models with the interesting property of capturing the complex relationships between the inputs (pathway activities) and the outputs (cell fate). In order to overcome dimensionality problems, a sparse approximation that can be learnt by using a variational formula that jointly infers both, the support samples and the kernel hyperparameters is used. Variational approximations are less prone to overfit the data, one of the major drawbacks of nonlinear learning methods when applied to genomic Big Data analysis. Different versions of the massive knock-down Achilles experiment are used for training and validating the model, respectively.

Mechanistic models of pathways are freely available at: http://metabolizer.babelomics.org and http://hipathia.babelomics.org. The effect of interventions over specific targets can be simulated in: http://pathact.babelomics.org. A Bioconductor package is also available at: http://bioconductor.org/packages/devel/bioc/html/hipathia.html.

Analytic Opportunities and Challenges with Single-Cell RNA-Sequencing Data
Kelci Miclaus1, Meijian Guan1, Eliver Ghosn2

1JMP Life Sciences SAS Institute Inc., Cary NC USA
2Emory Vaccine Center, Emory University School of Medicine, Atlanta GA USA

Single-cell RNA sequencing technologies (scRNA-seq) are growing rapidly, providing opportunities for researchers to uncover new and rare cell populations, track trajectories of distinct cell lineages in development, and reveal differentially expressed genes between specific cell types. However, as with any new technology, these data pose new challenges in visualization and statistical analysis due to their high dimensionality, sparsity, and distinct applications with varying heterogeneity across cell populations. Advanced visualization dimension-reduction techniques have been developed and applied to scRNA-seq data, most notably t-Distributed Stochastic Neighbor Embedding (t-SNE) and uniform manifold approximation and projection (UMAP). While such methods are powerful tools to detect local structures of scRNA-seq data, appropriate sparsity handling, parameterization and interpretation are required. Furthermore, discrepancies between study areas, such as cancer research versus immunology, impact the appropriate application of reproducible, robust statistical methods based on varying levels of cellular heterogeneity for a given field. In this presentation we demonstrate interactive implementations of t-SNE and UMAP for visualization and interpretation of scRNA-seq data and discuss challenges for developing analysis pipelines for differential gene expression between heterogeneous cell types.

Improving reproducibility in RNA-seq data analysis with svaseq
A.Muszyńska1, R.Przewłocki4, P. Łabaj1,2,3

1Małopolska Centre of Biotechnology UJ, Kraków, Poland
2Austrian Academy of Sciences, Vienna, Austria
3Chair of Bioifnormatics RG, Boku University, Vienna, Austria
4Department of Molecular Neuropharmacology, Institute of Pharmacology Polish Academy of Sciences, Kraków, Poland


The reproducibility crisis is a widely known problem in science. One of the fields, which require improvement are RNA-seq experiments. This issue was addressed by the SEQC project [1]. However in this work all tools for RNA-seq analysis were only tested on artificially created dataset. Those approaches failed when dealing with the dataset presented in our work, probably due to the fact that expected changes in gene expression levels were not as big as in the artificial one. Here we show how reproducibility among experiments can be improved using svaseq method which searches for hidden factors in the model. The approach was tested when analysing RNA-seq data created from sequencing mice tissue. The experiment was performed in three batches. Each batch had 4 samples stemming from mice with neuropathic pain and 4 controls. Batches were analysed separately with and without using svaseq method and then the results were compared. Preliminary results showed that without using factor analysis there are almost no genes which are classified as differentially expressed, however there should be a lot of them, considering the presence of neuropathy in one of the groups. Applying svaseq results in removing unwanted variation and enhancing the signal, as described previously [1][2]. The number of detected genes increased up to several thousands. However the signal is still not stable across the experiments. Among all the genes detected only about 8% is common for all the batches. This might be due to the biological differences between particular mice. Despite that svaseq is a promising approach. Further investigation is needed to asses which tools should be used when dealing with data were the expected changes are small.

[1] Su, Z. et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature Biotechnology 32, 903-914(2014)
[2] Łabaj PP, Kreil DP. Sensitivity, specificity, and reproducibility of RNA-Seq differential expression calls. Biol Direct.;11(1):66. doi: 10.1186/s13062-016-0169-7 (2016)

SEQC2 calibration benchmarks for reproducible accurate gene transcript profiling: A comprehensive onco-panel reference for gene structure variant identification
D P Kreil,1 P P Łabaj,1-3 and A Bergstrom Lucas4

1Chair of Bioinformatics, Bornaku University Vienna
2Małopolska Centre of Biotechnology UJ, Kraków, Poland
3Austrian Academy of Sciences, Vienna
4Technologies, Santa Clara , USA

While reproducibility of differential screens at the pathway level tends to be better, and sufficient reproducibility at gene levels can be achieved by stringent filters, screen results for alternative gene transcripts depend drastically on computational analysis pipeline choice, limiting their application for predictive medicine.

Cancer research, however, increasingly recognizes that functionally relevant genomic variations often lead to consequential changes in gene transcription, making them easier to detect at the RNA level. Specifically, both gene fusions and changes in alternative splicing have regularly been implicated in the disease. As a result, capture panels for cancer diagnostics and research are beginning to also target RNA.

We here present a novel validation strategy appropriate for targeted gene structure variant discovery and detection in RNA. We report on the compilation of a unique reference for assessing regular (short-read) RNA-seq platforms for targeted sequencing. For this we combine: community standardized SEQC2 sample mixtures, deep targeted long-read RNA sequencing, as well as DNA calibration of capture panel affinities. Several hundred genes have been selected by their relevance to the field.

We present a first characterization of a comprehensive deep and long targeted RNA-seq reference for the assessment of the discovery and detection of gene fusion events, of changes in alternative splicing, as well as the detection and exploitation of sequence mutations for the dissection of gene activities of diluted actors in heterogeneous samples, such as applied in cancer clone chase.

Besides the direct application of the validated platforms in cancer research and diagnostics, the presented reference forms a critical community resource for future method refinement and consolidation of short-read assays & pipelines. As a benchmark for the discovery and detection of gene fusion events, it will support improvements in fusion calls, for an inter-analysis concordance beyond the currently typical agreement of < 20%. For the discovery and detection of changes in alternative splicing, it will support more reliable differential splicing calls with inter-analysis concordance beyond the currently typical agreement of < 40%. This also means an improved quantification of gene and transcript expression in general, forming a prerequisite of reproducible meaningful predictive medicine.

Automating the analysis of high content screens by artificial intelligence
Paula A. Marin Zapataa, Stefan Prechtla, Sebastian Raesea, and Djork-Arne Cleverta

aBayer AG, Research & Development, Pharmaceuticals, Berlin

High content screening makes use of automated microscopy to evaluate functional changes in cells treated with thousands to millions of compounds. Such screens are traditionally analyzed by human-crafted algorithms, which are time consuming, require expert knowledge and are potentially biased to human perception. To overcome these drawbacks, we developed a pipeline to analyze microscope images based on deep neural networks, which is able to perform quantitative image analysis nearly fully automatically. With minimal human input, our method was able to match the performance of expert-designed algorithms and it discovered over 1500 novel primary hits when tested in two of our screens. Furthermore, our pipeline reduces the analysis time from weeks to days thanks to computation on Graphics Processing Units (GPUs). Overall, this work has a significant impact in the efficacy and quality of high content screens, prospectively increasing the number of candidate compounds in early drug research.

Standardizing the classification of herbicide phenotypes by transfer learning and ranking models
Paula A. Marin Zapataa, Sina Rothb, Dirk Schmutzlerb, Thomas Wolfb, Erica Manesso b, and Djork-Arne Cleverta
aBayer AG, Research & Development, Pharmaceuticals, Berlin bBayer AG, Research & Development, Crop Science, Frankfurt

High-throughput phenotypic herbicide screening tests the effects of hundreds of thousands of compounds by means of miniaturized plant assays and automated image acquisition. Acquired images are fed to processing pipelines to extract quantitative features which describe the corresponding phenotype. Unfortunately, such pipelines require a priori knowledge and underuse the information contained in the images since they only extract a hand full set of features. To overcome these drawbacks, we propose an information-rich approach to characterize plant phenotypes based on transfer learning. Initially, a neural network pre-trained on the ImageNet dataset is used to extract ~ 4000 features per image. Extracted features undergo a dimensionality reduction step, consisting of another neural network trained on ranking similar plant images based on time course information. Finally, dimension reduced features are normalized and clustered to define phenotypic groups. Our approach is able to separate normal development from herbicide-induced traits with high sensitivity. Moreover, by identifying phenotypic subgroups in a fully unsupervised manner, we provide an unbiased stratification of herbicide symptoms.

Adapting the MAQC-DAP to the UNICEF MICS Dataset to Predict Child Mortality in Low- and Middle-Income Countries
Andrea Bizzego1, Giulio Gabrieli2, Marc H. Bornstein3,4, Kirby Deater-Deckard5, Diane L. Putnick4, Jennifer E. Lansford6, Robert H. Bradley7, Megan Costa7, and Gianluca Esposito1,2

1 University of Trento, Italy
2 Nanyang Technological University, Singapore
3 Institute for Fiscal Studies, UK
4 Eunice Kennedy Shriver National Institute of Child Health and Human Development, USA
5 University of Massachusetts, USA
6 Duke University, USA
7 Arizona State University, USA

Child mortality (CM) still affects 3.9% of children below five years old worldwide with a large and worrisome disparity between high- (0.9%) and low-income countries (7.5% in African countries). The objective of the study is to identify risk factors associated with CM, based on data from the Multiple Indicator Cluster Survey (MICS, waves 4 and 5): a nationally representative collection of surveys from 71 middle- and low-income countries (N=5,846,662), led by UNICEF. We selected 48 indicators as representative of household and woman status to be used as predictors. All women who gave birth to at least one child were selected (Age M=31.6; SD=8.5) and categorized as: (i) Mothers with children older than 5 who survived (N = 74,507; Age M=30.8; SD=8.4) and (ii) Mothers who lost at least one child under the age of 5 (N = 14,303; Age M=35.7; SD=8.0). The 10 countries with all the 48 indicators were considered. The Micro-Array Quality Control Data Analysis Plan was adopted to optimize and train a Random Forest model for classifying the category and obtain the ranked list of predictors, ranked by relative Mean Decrease Impurity (rMDI). Model’s performance, evaluated in terms of Matthew Correlation Coefficient (MCC) resulted in MCC=0.325, 90% CI [0.321, 0.329] on the train set and; MCC=0.329 on the test set. The top 3 predictors (cumulative rMDI=0.34) are: mother’s age (rMID=0.176), availability of electricity (rMDI=0.087) and highest level of school attended (rMDI=0.077). These results provide important signposts to frame efficacious developing interventions and healthcare enhancements to reduce CM. This application pioneers a broadly validated framework previously limited to Bioinformatics to the field of human developmental science.

X-CNV: predicting the pathogenicity of copy number variations with XGBoost
Yiran Tao#, Dongsheng Yuan#, Chengkai Lv#, Tieliu Shi*

1Center for Bioinformatics and Computational Biology, and the Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, Shanghai, China, 200241
#Equal contribution
Contact: Tieliu Shi, email: tieliushi@yahoo.com

Advances in high-throughput DNA sequencing have promoted the identification of genomic variants in the human genome. However, Inferring the effects of copy-number variants (CNV) is still a challenge. Although there have been preliminary attempts on CNV impact prediction in the past, none of existing tools provides a quantitative prediction about the pathogenicity of CNVs. Here we present a novel computational tool X- CNV for the prediction of copy-number variant impact on the pathogenicity with a XGBoost-based algorithm, we incorporated 85 individual annotation features of CNV in the model, which were distributed along with coding, non-coding, and intergenic regions. We also considered the population allele frequency of each CNV which was systematically curated and calculated by maximum clique method as component feature. Totally, we collected the structure variants of over forty thousand of normal samples from seven major ethnic groups around world as reference. Our model achieved robust performance with AUC (0.9537), F1-score (0. 9645) in the pathogenicity prediction with the testing dataset and AUC (0.9736) in the independent validation dataset. A convenient webserver is available for all users to freely access at www.megabionet.org/x-cnv.

Electronic Health Records encoding for neurodevelopmental disorders stratification
Isotta Landi1,2, Riccardo Miotto3, Paola Venuti2, Joel T. Dudley3, Cesare Furlanello1

1FBK-Bruno Kessler Foundation, Trento, Italy
2DIPSCO Department of Cognitive Sciences, University of Trento, Italy
3Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY

Neurodevelopmental disorders (NDs), such as Autism Spectrum Disorder (ASD), are complex disorders characterized by heterogeneity at multiple levels of analysis (e.g. genetics, neural systems, cognition, behavior and development) throughout disorder trajectories. Clinical event sequences within Electronic Health Records (EHRs) can act as fingerprints for disease progression. By leveraging EHRs, we implemented an end-to-end deep learning architecture that couples Convolutional Neural Networks and Autoencoders (CNN-AE) to generate unsupervised encoded representations of patients in a deep feature space. The extracted features are clustered together through hierarchical clustering to identify possible subgroups within the same condition. Our aim is to decompose heterogeneity facilitating the progress towards precision medicine. Following an initial work on Multiple Myeloma (Landi et al., 2019), we considered EHR sequences of medication and diagnosis for patients with NDs from the Mount Sinai Health System’s data warehouse. We selected 13,117 subjects with: ASD (N=733), Attention Deficit Hyperactivity Disorder (ADHD; N=6,141), Intellectual Disabilities (N=1,148) and Pervasive Developmental Disorder (N=201). For model validation purposes, we also included a group of randomly selected patients (OTH; N=4,894). To prove the effectiveness of our model representations, we investigated whether the CNN-AE encodings performed better than the embedding of the Term Frequency - Inverse Document Frequency (TFIDF) matrix at discriminating the ND class from the OTH. We obtained that our model outperformed (Entropy=0.89; Purity=0.66) the TFIDF approach (Entropy=0.91; Purity=0.63). For disease stratification, we selected the number of clusters that maximize the Silhouette Index (SI). With the CNN-AE model (SI=0.03), we obtained subgroups of patients with ADHD and with ASD that differ with respect to comorbidities and medications. In comparison, the TFIDF model succeeded in identifying ADHD subgroups, but failed to detect ASD subgroups (SI=0.22).

Landi, I., Miotto, R., Lee, H., Danieletto, M., Laganà, A., Furlanello, C., Dudley, J. T., Medical sequence encoding for the stratification of complex disorders. AMIA Informatics Summit 2019.

Integrating Deep Learning and Radiomics for predictive models from CT/PET imaging
Damiana Salvalai1, Andrea Bizzego2, Alessandro Fracchetti3, Giuseppe Jurman4, Francesco Ricci1, Cesare Furlanello4

1Faculty of Computer Science, Free University of Bozen-Bolzano, Bolzano, Italy
2Department of Psychology and Cognitive Science, University of Trento, Trento, Italy
3Department of Medical Physics, Ospedale Centrale di Bolzano, Bolzano, Italy
4Predictive Models for Biomedicine and Environment, Fondazione Bruno Kessler, Trento, Italy


We explore an AI framework for cancer prognosis that combines deep learning and radiomics features. In the deep learning approach, the network model is optimized during a training phase, encoding the input image space through a cascade of transformations. One of the top layers (typically before the final classification step) can then be used as a projection space where intrinsic patterns in the image can be best separated in terms of deep features. This automatic construction of feature from data is dual to the process of pre-defining a set of transformations extracting features of the image that are thought to be best discriminants for the task, e.g shape of the tumor mass, texture and image intensity statistics (radiomics features). Deep features are expected to be more effective, but they sacrifice interpretability, while by design radiomics features allow a direct interpretation of image characteristics in terms of clinically relevant patterns. Here we aim at merging the two approaches (i.e. comparison and combination of the two types of features) into a unified classification pipeline. Models are trained and evaluated with a data analysis plan originally developed for reproducibility in the MicroArray Quality Control (MAQC2). For interpretability, we apply the Uniform Manifold Approximation Projection (UMAP) comparing the different feature options. We applied the integrative DL/Radiomics approach in the multimodal context of CT/PET images, where Positron Emission Tomography (PET) and Computed Tomography (CT) features are both considered, for prognosis of Locoregional Recurrence of Head and Neck Cancer (N=298). Deep features are extracted from a multi-modal Convolutional Neural Network pre-trained for diagnostic classification of T-stage. The integration of deep and radiomics features improves over previous results (sensitivity 0.67 vs 0.56; specificity 0.99 vs 0.67, accuracy 0.96 vs 0.65). The source code is publicly released.

ML4Tox - A Machine Learning Framework for Predictive Toxicology
Marco Chierici1, Marco Giulini1, Nicole Bussola1,2, Giuseppe Jurman1, and Cesare Furlanello1

1Fondazione Bruno Kessler, Trento, Italy
2Centre for Integrative Biology, University of Trento, Italy


Environmental exposure to chemical compounds poses high risks for human health, with potential impact on the endocrine system causing adverse immune, neurological and developmental effects. The large-scale Collaborative Estrogen Receptor Activity Prediction Project (CERAPP) explored the use of machine learning to evaluate the binding interactions of environmental chemicals to the ligand-binding domain of human estrogen receptor (ER) from high-throughput screening data. Further, the CERAPP project includes three activity prediction tasks, namely agonist, antagonist, and binding. Given the costs of experimental testing of chemicals, this class of prediction tasks aims at prioritizing compounds, with a preference for higher-sensitivity models to apply for risk prevention. We applied ML4Tox, a predictive toxicology computational framework for modeling the potential endocrine disruption of environmental chemicals. Machine learning models (Deep Learning and Support Vector Machine) were trained to predict chemical compound agonist, antagonist, and binding activity for the human ER ligand-binding domain on the CERAPP ToxCast training set of 1677 chemicals, described by 777 molecular features computed by Mold2. In order to control for selection bias and other overfitting effects, the models were developed and evaluated in terms of Balanced Accuracy and Matthews Correlation Coefficient with a 10 × 5-fold cross-validation schema, based on a Data Analysis Protocol previously designed for the MAQC/SEQC initiatives. On the CERAPP “All Literature” evaluation set (agonist: 6319 compounds; antagonist 6539; binding 7283), ML4Tox significantly improved sensitivity (Se) over [Mansouri et al, 2016] on all three tasks, with agonist: Se=0.78 vs 0.56; antagonist: 0.69 vs 0.11; binding: 0.66 vs 0.26.

The AIN: ATAC-seq integrity number for assessing pre-sequencing quality of the libraries
Gosia Golda1,2,3, Lara Bossini-Castillo1, Natalia Kunowska1, Dafni Glinos1, Pawel Labaj3,4,5, Gosia Trynka1

1Wellcome Sanger Institute, Cambridge, United Kingdom
2Faculty of Biochemistry, Biophysics, and Biotechnology, Jagiellonian University, Krakow, Poland
3Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
4Austrian Academy of Sciences, Vienna, Austria
5Chair of Bioinformatics RG, Boku University, Vienna, Austria


ATAC-seq is an increasingly popular assay for defining gene regulatory elements marked as chromatin accessible regions. Dependable results of the assay rely on an overall quality of the ATAC-seq library. However, the clear-cut evidence for the quality of the experiment can be assessed only from the sequencing data. Having a reliable and comprehensive standard, for estimating sample quality prior to sequencing, could help in the better use of resources and to guide sequencing study design. Common practice, for assessing the quality of the ATAC-seq library, is to analyse the Agilent 2100 Bioanalyzer electropherograms. However, this approach can be inconsistent since it relies on analysis of the electrophoretic traces, which requires expert knowledge and intrude, known also from other areas, human specific bias. The goal of our study is to establish an automated and reliable procedure, for standardization of the ATAC-seq quality control. We gathered a large collection of sequencing data and electrophoretic ATAC-seq measurements recorded with an Agilent 2100 Bioanalyzer, for a variety of cell types and ATAC-seq protocols. Our initial logistic regression model, showed a better accuracy (94.9%) over the human eye (56.5%) and its applicability across different cell types and protocols. We are working on developing a model on the basis of artificial neural networks, by devising the ATAC-seq Integrity Number (AIN) algorithm, for more accurate estimations. Further approaches are developed in collaboration with EpiQC working group of SEQC consortium.

Intrinsic Capacity Index in Older Adults living with HIV: data analytics from a prospective clinical trial to standardise a healthy aging tool
Giovanni Guaraldi1, Federica Mandreoli2, Riccardo Martoglia2

1Modena HIV Metabolic Clinic - University of Modena and Reggio Emilia, Modena, Italy
2FIM - University of Modena and Reggio Emilia, Modena, Italy


World Health Organization (WHO) is searching for healthy aging tools to support health-care professionals and self-management in routine care of older people. Healthy aging construct is based on intrinsic capacity, defined as the composite of all the physical and mental capacities of an individual divided in 5 domains: locomotion, cognition, psychological, vitality and sensory (ref. https://www.who.int/ncds/prevention/be-healthy-be-mobile/en/). Digital innovations including smartphone applications and wearables can support data generation from these 5 domains, whereas data analytics, deep learning and machine learning can be exploited to build an intrinsic capacity index.

MySmart Age with HIV (MySAwH) is a multi-center prospective ongoing study designed to empower older adults (aged >50 years) living with HIV (OALWH) in Italy, Australia and Hong Kong to achieve healthy aging. The study involves around 280 patients monitored for 24 months. Collected data are highly heterogeneous as to their kind, origin and acquisition rate. Comprehensive geriatric assessment and HIV variables are collected by healthcare workers during study visits at time 0, 9 and 18 months. Step count, sleep hours an calories are daily collected by a wearable device (VivoFit2 - Garmin) and Patient Reported outcomes, including a large set of questions exploring functional abilities and quality of life, are 3 times a week recorded through a dedicated smart phone app (MySAwHApp). Data are integrated into an Internet of Medical Things framework (IoMT) which permits to describe the 5 domains of intrinsic capacity.

The objective of this study is to follow a data-driven approach to select variables and standardise an Intrinsic Capacity Index (ICI) in relation to age and relevant health outcomes including frailty and HIV status. We will investigate to what extent deep learning and machine learning techniques can be exploited to predict adverse health outcomes and patient’s status from the selected ICI variables.

Metaproteomics in ASD Families: A New Approach for Gut Microbiota Profiling in Autism Spectrum Disorders
Ilaria Basadonnea, Alessandro Zandonàb, Stefano Levi Morterac, Pamela Vernocchic, Marco Chiericid*, Andrea Quagliarielloc, Giuseppe Jurmand, Cesare Furlanellod, Lorenza Putignanic, Paola Venutia

a Department of Psychology and Cognitive Sciences, University of Trento, Rovereto, Italy
b Department of Information Engineering, University of Padua, Padua, Italy
c Human Microbiome Unit, Bambino Gesù Children’s Hospital, Rome, Italy
d Bruno Kessler Foundation, Trento, Italy.


A possible role for gut microbiota in Autism Spectrum Disorders (ASD) has been addressed using metaproteomics, machine learning and network analysis. Fecal samples (N=36) were collected from 10 ASD children (age 6.74 ± 2.08 years), their biological parents and a non-ASD sibling to evaluate possible effects of shared environment and genetics, and from 10 typically developing children matching for age and gender with the ASD children. LC-MS/MS shotgun proteomics experiments on isolated bacterial proteins were conducted on a TripleTOF 5600+ instrument. Raw data was processed through the ProteinPilot 4.0 software and database research was carried out using the NCBInr database. Protein sequences in FASTA format were uploaded on the web open source service WebMGA for taxonomic and functional assessment in terms of Operational Taxonomic Units (OTUs) and Cluster of Orthologous Groups (COGs). Random Forest (RF) classifiers were trained on OTUs and COGs datasets and feature importance was calculated as mean decrease in the Gini impurity. RF were developed inside a Data Analysis Protocol derived by the MAQC-II project, with Matthews Correlation Coefficient (MCC) assessed by a 10-times iterated 5-fold Cross Validation. A list of top-ranked discriminant features was derived as a Borda aggregated list from the 50 ranked lists. In discriminating ASD and typically developing children, the best performance for OTUs (MCC=0.92) was achieved with Enterococcus and Oscillibacter genera, whereas for COGs (MCC=0.96) with the classes Function unknown, General function prediction only, Signal transduction mechanism and Membrane biogenesis. Co-abundance undirected weighted networks were built using OTUs or COGs as nodes from cohorts (ASD or typical) by absolute Pearson Correlation Coefficient and compared using the glocal HIM distance, with denser networks found for ASD vs ASD-family members (HIM=0.28). Acknowledgments: IB is supported by the TRAIN (TRentino Autism INitiative).

Evaluating reproducibility of Deep Learning models on Digital Pathology images
Nicole Bussola1, Andrea Bizzego2, Marco Chierici1, Valerio Maggio1, Margherita Francescatto1, Luca Cima3, Marco Cristoforetti1, Giuseppe Jurman1, Cesare Furlanello1

1Fondazione Bruno Kessler, Trento, Italy 2Department of Psychology and Cognitive Science, University of Trento, Trento, Italy 3Pathology Unit, Santa Chiara Hospital, Trento, Italy


As Artificial intelligence is rapidly increasing its impact on healthcare, a steady research effort is directed to enhance interpretability and reproducibility of predictive models, moving from black-box models to methods that provide new insights, e.g. on clinic-pathological features for medical images. Here, we apply the DAPPER framework to evaluate deep learning models and other machine classifiers applied to digital pathology (H&E), considering a Data Analysis Plan originally developed in the FDA’s MAQC project and designed to analyze causes of variability in predictive biomarkers. The models were trained to identify tissue/organ of origin of H&E stained Whole Slide Images (WSI) publicly available at the Genotype-Tissue Expression (GTEx) portal (N=787). From each WSI, we extracted up to 100 informative patches of size 512x512, creating a benchmark dataset (HINT: Histological Imaging - Newsy Tiles) composed of 53 000 histological tiles, with 80% tiles used for (stratified) training and 20% for validation. The models were trained with four HINT sub-datasets of 5, 10, 20 or 30 classes. Further we validated the pipeline on the KIMIA Path24 dataset for identification of slide of origin (24 classes). Moreover, the deep model accuracy was compared on a test set with an expert pathologist, both at tile-level (Nt=2000) and WSI-level (Ni=200): DAPPER outperforms the pathologist at both levels, with an improvement of more than 20% at tile-level. Further, we analyzed accuracy and feature stability of three different deep learning architectures (VGG, ResNet and Inception) as feature extractors and classifiers (a fully connected multilayer, Support Vector Machine and Random Forest), also introducing diagnostic tests (e.g., random labels) to cope with potential selection bias. The DAPPER software and the HINT dataset are publicly released as a resource for standardization and validation initiatives in AI for digital pathology.

Reproducible Integration of Real-World Sequencing Data to Benchmark Germline Variant Calls for Proficiency Test
Yuanting Zheng, Luyao Ren, Wanwan Hou, Jingcheng Yang, Yechao Huang, Ding Bao, Ying Yu, Leming Shi

Center for Pharmacogenomics, School of Life Sciences, Fudan University, Shanghai 200433, China.

Whole-genome sequencing (WGS) has emerged as promising diagnostic tools in a wide range of applications. However, the lack of ground truths at the whole-genome scale and appropriate thresholds for QC metrics hampers the proficiency testing. Several human genome reference DNA materials have been released and demonstrated their values in evaluating the performance of sequencing facilities. However, it is usually difficult to judge whether the performance of an individual sequencing site is good enough without knowing the acceptable thresholds of the performance metrics.

We developed four whole-genome DNA reference materials from four immortalized cell lines of a “Chinese Quartet” family including father, mother, and two monozygotic twin daughters. WGS data were generated from different sites using different platforms, including Illumina XTen, Illumina Nova-Seq, BGI-500, and BGI-2000, resulting in dozens of WGS datasets per reference material. All the WGS data were analyzed in parallel by seven sites using their respective in-house pipeline for germline variant (SNV and Indel) calling. The WGS reference dataset was developed using a reproducible integration approach. Briefly, high-confidence SNVs and Indels were defined as those supported by >50% technical replicates, >50% sequencing sites, >50% variant-calling pipelines, and >50 platforms. Moreover, the “genetic ground truths” from the monozygotic twin family made the benchmarking values more reliable by minimizing false positive variant calls. All the VCF files contributing to the benchmarking integration were used to retrospectively calculate the distributions of performance metrics such as precision, recall, F1-score, and accuracy, yielding performance metrics thresholds that can be used for cross-laboratory proficiency comparison. The reference materials and reference datasets from the “Chinese Quartet” are being used in a national proficiency testing so that more and more real-world sequencing datasets can be collected and the performance of participating sites can be objectively assessed and improved for more reliable disease diagnosis.

Benchmarking NGS of HLA Genes – Current Status and Challenges
Chia-Jung Chang1, Marcelo Fernandez-Viña2, Wenzhong Xiao1,3

1 Stanford Genome Technology Center, Stanford Medical School
2 Stanford Blood Center, Stanford Medical School
3 Massachusetts General Hospital, Harvard Medical School


The Human Leucocyte Antigen (HLA) genes encode cell-surface proteins that are key elements of the immune system. The HLA genes are the primary determinants of histocompatibility in transplantation. HLA polymorphisms are significantly associated with many human diseases. To date, >50 million clinical tests have been performed in 200 accredited labs to perform clinical histocompatibility in the US and >800 labs around the world. To advance the fields of histocompatibility and immunogenetics through the application NGS technologies, we evaluated the benchmark studies of the 17th International HLA and Immunogenetics Workshop (IHIW), including NGS of 11 HLA genes of 50 cell lines by 34 participating IHIW laboratories and 2500 trios as a real-world application. The results show that allele level typing using NGS is ready for clinical application, while consensus sequences from current NGS protocols are still problematic, because of challenges in amplification and sequencing errors in regions of homopolymers and short tandem repeats, and in the phasing of short reads. Amplifications of longer regions of the HLA genes will help to solve allele ambiguities and long reads sequencing is potentially necessary to solve phasing ambiguities.

Grouping Classification Method based on Ensemble Clustering
Loai Abdallah1, and Malik Yousef2

1Department of Information Systems, The Max Stern Yezreel Valley Academic College, Israel. e-mail: Loai1984@gmail.com
2Department of Community Information Systems, Zefat Academic College, Zefat, Israel. e-mail: malik.yousef@gmail.com


The performance of many supervised or unsupervised machine learning algorithms depend very much on distance metrics to determine similarity between data points. A suitable distance metric could improve the classification performance, and clustering process significantly. Distance metrics over a given space of data should reflect the actual similarity between objects. One of the obvious weaknesses of the Euclidean distance is dealing with data that is represented by a large number of attributes, where the Euclidean distance does not capture the actual relationship between those points. However, objects belonging to the same cluster usually share some common traits even though their Euclidean distance might be relatively large. In this study, we propose a new classification method named GrbClassifierEC that replaces the given data space with categorical space based on ensemble clustering (EC). The similarity between objects is defined as the number of times that the objects belong to the different clusters. The EC space is defined by tracking the membership of the points over multiple runs of clustering algorithms. Different points that were included in the same clusters will be represented as a single point. Our algorithm classifies all these points as a same class. In order to evaluate our suggested method, we compare its results to the k nearest neighbors, Decision tree and Random forest classification algorithms on several benchmark datasets from miRbase. The data is consisting of microRNA precursor sequences where each sequence consist of 4 nucleotide letters {A,U,C,G,} and the length of each precursor sequence is about 70 nucleotides. The results confirm that the suggested new algorithm GrbClassifierEC outperforms the other algorithms.