1 Table of Contents

2 Abstract

Copy-number variants (CNVs) are a form of genetic structural variation with increasing importance in complex human disorders. Both DNA sequencing and microarray data can be used to call CNVs, which can be used in genetic association tests. Unlike genotypes, CNV detection in microarrays requires the use of observed intensity signals at each probe, which limits the imputability for analyses that span multiple array types. Thus far, a consensus set of probes (those present on all arrays) has been used to circumvent the problem of differing array-specific sensitivities. This has led to excessive reduction in overall sensitivity since arrays can have an undesirably low probe overlap. To overcome this limitation, we developed MarkerMatch, a proximity-based algorithm that matches probes across different genotyping microarrays to maximize the number of probes considered in the CNV calling algorithm, thereby increasing the resolution and sensitivity while preserving precision.

By analyzing CNV calls from 4,906 individuals genotyped across three different arrays, we show that the MarkerMatch approach improves sensitivity by increasing the density of probes available for CNV calling while maintaining precision or improving it relative to the current practice (e.g., use of consensus probes only). We further demonstrate that MarkerMatch exceeds the output from current practice in terms of F1 score and PPV. We also optimize MarkerMatch parameters, DMAX and Method, and find an optimal DMAX setting at 10kb, with no clear optimal candidate based on Method, indicating that parameters for this metric should be determined on a use case basis.

Figure 1. Panel A. Diagram depicting the MarkerMatch algorithm. MarkerMatch follows a step-by-step process to identify the best match for each probe across the two selected manifests, while ensuring no duplicates. Specifically, in the first step (exact matching) MarkerMatch will take an intersection of probes from two manifests and keep them in the output. The second step (Method matching): MarkerMatch will take a probe from the reference manifest and match it with all the remaining probes (those not used in the first step) in the matching manifest within the specified DMAX distance. A probe from the matched manifest that has the smallest difference in the selected Method from the selected probe from the reference manifest will be retained. Once a probe from the matching manifest is paired with a reference probe, it is removed from the matching manifest. This prevents it from being matched again, avoiding repetitive matching of identical probes from the matched manifest. This process will continue until all reference probes have been considered. Panel B. Graphical representation of experimental setup. Blue boxes represent unprocessed array data, with red borders representing reference manifests and green borders representing matching manifests. Yellow boxes represent the MarkerMatch algorithm for WAE (for all Methods and 10bp < DMAX < 5Mb) and CAE (for all Methods and DMAX = 10kb). Green boxes represent output manifests of the MarkerMatch algorithm (-MAT suffix indicates output manifests from MarkerMatch). Red boxes represent exact match manifests as currently used in CNV association analyses (intersections, also consensus manifests and -EM suffix). In WAE, we compared matched OMNI callsets to full OMNI as a truth set (also Full Set). In CAE, we compared matched GSA1 callsets to full OEE as a truth set, as well as matched OEE callsets to full GSA1 as a truthset. OMNI: Omni2.5 array, GSA1: Global Screening Array, OEE: Omni Express Exome array. These processes have been repeated for each iteration of Method and DMAX combination.

3 Citation

4 Notebook

4.1 Environments

4.1.1 Local

Local analyses were done on a Windows machine with following specifications:

Component	Value
Processor	12th Gen Intel(R) Core(TM) i9-12900KS 3.40 GHz
Installed RAM	128 GB (128 GB usable)
System type	64-bit operating system, x64-based processor
Edition	Windows 11 Pro
Version	23H2
OS build	22631.4037

4.1.2 Cloud

Cloud analyses were conducted on University of Florida’s cluster computing service, HiPerGator, using SLURM scheduler to schedule/run jobs. Resources included 4TB storage and 10 processors.

4.1.3 Software & Packages

MarkerMatch

MarkerMatch is a free R package for proximity-based probe matching for CNV calling with PennCNV.

PennCNV

PennCNV is a free software tool for Copy Number Variation (CNV) detection from SNP genotyping arrays.