Assembling Over 1,000 Human Genomes Affordably: New Method Powers Medicine's Future

Publish date: 03 Apr 2026

Stay updated on the job market

Popular Articles

7種最難有拖拍職業

平等只係口號？職場歧視無處不在

打工仔6種最常見職業病

【網路廣告術語知多D】2026數位廣告趨勢最新廣告KPI大洗牌

女性不宜返夜班？研究指工作表現遜男性

HANGZHOU, China, April 3, 2026 /PRNewswire/ -- A research team led by Zhen-Xing Endowed Professor Jian Yang at the School of Life Sciences, Westlake University, together with collaborators, published their latest findings in Nature on April 1. The study innovatively developed a pangenome-informed genome assembly (PIGA) method. By combining a cost-effective hybrid sequencing strategy of long and short reads, the team successfully constructed a pangenome for over a thousand individuals. This achievement breaks through the limitations of previous small-sample pangenomes and provides a critical foundational infrastructure for medical and population genetics research.

Since the completion of the Human Genome Project, single linear reference genomes (such as GRCh38) have served as the foundation for biomedical research. However, the genetic backgrounds of human individuals vary significantly, and a single reference genome cannot capture the full extent of genetic diversity across populations. This leads to complex forms of genetic variations, such as structural variants (SVs) and tandem repeats (TRs), being overlooked in traditional analyses. To address this challenge, researchers proposed the concept of a pangenome—a collection of genome sequences representing the genetic diversity of a population.

While advancements in long-read sequencing have enabled the assembly of high-quality diploid genomes, the high costs of sequencing have limited the sample sizes of previous pangenomes to only a few dozen individuals. Such small sample sizes are insufficient to accurately estimate the frequency of genetic variants in populations or to resolve low-frequency variants and high-complexity regions. Therefore, developing a cost-effective pangenome construction strategy for large-scale populations has become an urgent requirement for resolving the functional impact of complex variants and enhancing clinical diagnostics.

Yang's team has long been dedicated to methodological research in statistical genetics, genomics, and the big data analysis of human complex traits. By developing efficient computational methods, the team has consistently tackled core challenges in processing large-scale genomic data. Analysis tools developed by the team, such as GCTA-GREML, SMR, and gsMap, have been widely adopted globally. To address the challenge in constructing large-scale pangenomes, the research team developed the pangenome-informed genome assembly (PIGA) workflow (Fig. 1). Unlike de novo assembly approaches, which rely on sequencing data from individual samples, PIGA adopts a pangenome-guided framework to integrate sequence information across the entire cohort. It fully leverages a cost-effective hybrid sequencing strategy based on modest-coverage Illumina short-read and PacBio long-read whole-genome sequencing (WGS) data. This approach substantially reduces sequencing costs while enabling the assembly of genomes from modest-coverage data, thereby providing a practical new technical pathway for future population-scale hybrid sequencing studies.

Applying this method, the research team constructed the world's largest human pangenome to date, comprising 1,116 diploid genomes with a mean quality value (QV) of 46. The pangenome identified 405.3 million base pairs (Mb) of non-reference sequences absent from current references (GRCh38 and CHM13). Notably, the team annotated 26.2 Mb of these sequences as functional genic and predicted regulatory elements, greatly expanding our understanding of the non-reference sequences in the human genome.

Fig. 1. The pangenome-informed genome assembly (PIGA) workflow.

Leveraging the large-scale assembly dataset, the researchers compiled a comprehensive catalog of genetic variation. In addition to 35.4 million small variants, the catalog captured a wide range of complex variants, including 110,530 SVs, 485,575 TRs, and 0.86 million nested variants embedded within non-reference sequences.

Using this catalog, the team characterized medically relevant variations at multiple scales (Fig. 2), including gene-altering SVs, pathogenic TR expansions, gene cluster variations, and HLA gene haplotypes. These findings indicate that the 1KCP variant catalog provides an important reference for the clinical screening of pathogenic mutations.

By integrating gene expression data, the team conducted pan-variant expression quantitative trait loci (eQTL) mapping. They identified 3,256 eQTLs involving complex variants (SVs, TRs, and nested variants), elucidating the regulatory complexity of these diverse variant types.

Together, this study significantly advances our understanding of complex genetic variants and their functional implications, establishing a new paradigm for human health research and pangenome studies in other species.

Ph.D. student Yifei Wang and Research Assistant Professor Zhongqu Duan are the co-first authors of the study. Professor Jian Yang is the last author. This work was supported by the National Natural Science Foundation of China, the National Key R&D Program, the Zhejiang "Pioneer & Leading Goose" Program, and the New Cornerstone Science Foundation. Computational resources were provided by the High-Performance Computing Center at Westlake University.

Professor Jian Yang's research group is dedicated to developing statistical genetics and bioinformatics methods. By deeply analyzing genomic and multi-omic data from large-scale population cohorts, they aim to uncover the genetic architecture and molecular mechanisms underlying complex diseases, translating these discoveries into novel strategies for disease diagnosis, drug target discovery, and precision medicine.

Related links:
Paper link: https://www.nature.com/articles/s41586-026-10315-y
Jian Yang lab website: https://yanglab.westlake.edu.cn/

Media contact:
Chi Zhang
media@westlake.edu.cn
+86-15659837873