The Sparkle database is a comprehensive, user-friendly, one-click interactive web resource for analyzing various cancer omics data. It includes genomics, transcriptomics, proteomics, epigenetics, single-cell sequencing, spatial transcriptomics, and more, providing easy access to publicly available cancer omics data. It enables users to identify biomarkers or perform in silico validation of potential target genes. These resources empower researchers to gather valuable information and data on genes or targets of interest.

Citation: Article is being written

Establish best bioinformatics practices to minimize unwanted batch effects while preserving biological signals

Graphic abstract

Dear users, with the widespread availability of the internet, researchers are no longer hindered by a lack of information, but rather overwhelmed by information redundancy. We often spend considerable time discerning what is truly valuable. As sequencing costs continue to drop, vast amounts of sequencing data are being generated and made publicly available at an unprecedented rate. Building databases has become a popular endeavor, and many bioinformaticians have made outstanding contributions. However, even as tools, databases have begun to suffer from information redundancy. While researchers benefit from diverse analytical methods that facilitate solving biological problems, they also struggle with the overwhelming number of databases, each requiring time to learn and master. More importantly, valuable scientific tools scattered across the web often remain undiscovered and ultimately forgotten.

Many public databases are funded by research grants. Once funding ends, operators may discontinue maintenance, leading to database shutdowns. In recent years, numerous databases have been discontinued due to lack of resources, rendering data irretrievable and causing significant scientific waste. Driven by the aspiration for permanent database operation and the urgent need to preserve precious research data, our team established the Sparkle database. It integrates the world’s most comprehensive cancer omics data and advanced algorithms, enabling one-click data analysis and visualization, standardizing and simplifying bioinformatics workflows. After six months of data and code preparation, we launched free public testing on February 1, 2024. Starting April 1, we adopted a semi-open, semi-restricted operational model. Through this approach, we aim to sustain continuous updates, integrate the latest scientific advances, and explore a viable path toward sustainable research infrastructure. As of the submission date, our database has been accessed by over 20,000 researchers, with more than 1,000 registered users.

Here, we sincerely call upon all researchers: if you lose grant funding and find it difficult to maintain your database, please do not discontinue it. The Sparkle database is willing to host and preserve your work. As mentioned, we will never stop updating. Perhaps we cannot become Newtons, but we hope to be the apple that helps a future Newton discover the laws of biology. Sparkle aims to become the world’s largest bioinformatics encyclopedia, empowering thousands of researchers with knowledge and enabling them to advance further and faster.

Every great achievement begins with a courageous first step. The Sparkle database adopts a semi-restricted access policy to ensure long-term sustainability—free from dependence on official grant funding—so it can operate permanently. Like a rope cutting through wood or water droplets piercing stone, over time, the Sparkle database will collect and curate an ever-growing volume of invaluable data. In humanity’s fight against cancer, artificial intelligence shows remarkable promise. Unified, standardized big data will play a pivotal role in empowering AI-driven breakthroughs in life sciences.

TCGA Introduction

The Cancer Genome Atlas (TCGA) was launched in 2006 with a 3-year pilot project focusing on glioblastoma multiforme (GBM), lung squamous cell carcinoma (LUS7C), and ovarian cancer (OV). This was followed by the full-scale project from 2009 to 2015. At the conclusion of this decade-long initiative, TCGA Network researchers had characterized the molecular landscapes of tumors from 11,160 patients across 33 cancer types and defined multiple molecular subtypes for each. Unlike previous bioinformatics databases that used TPM or FPKM for data construction, the Sparkle database adopts the standardized, normalized, batch-corrected, and platform-corrected RNA matrix file generated by the PanCancer Atlas consortium to eliminate platform effects. This matrix (EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv) contains data from 11,069 samples (click for details). TCGA strongly recommends using the published TCGA-CDR-SupplementalTableS1.xlsx file for clinical elements and survival outcomes to support high-quality analyses. For more details about this clinical dataset, please refer to An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. The consortium processed 33 initial enrollment data files and 97 follow-up data files from 11,160 patients across 33 cancer types. The Sparkle database obtained the curated survival data from the UCSC Xena database (dataset: phenotype - Curated clinical data), and intersected the sample IDs with those in the aforementioned expression file, ultimately obtaining 11,060 samples with both expression and clinical information (including 737 normal samples and 10,323 tumor samples) for downstream analyses. The actual number of samples used may vary in specific analyses; for example, outliers are removed using z-scores in differential expression analysis, and samples with zero or missing survival times or missing survival status are excluded in survival analysis.

Declaration

Acknowledgements

We sincerely thank all scholars for their generous contribution of publicly available sequencing data, which forms the foundation of our database. Special thanks to Dr. Zaoqu Liu and his team for their invaluable support and assistance during the database development. Additionally, we are grateful to every user who supports us. The database will continue to be updated — as long as there are users, we will keep it alive.

Funding Information

In the initial phase, the Sparkle database was funded by Hefei GuangRe Biotechnology Co., Ltd. and Yuyao Liu personally, ensuring its operation for at least the next 10 years. The Sparkle database is a non-profit platform. To ensure its long-term sustainability, we rely on donations from users. All funds received will be used exclusively for database maintenance and further development. Moreover, the Sparkle database will support the development of more free databases, aiming to establish the world's first pan-disease database.

Responsibility

Yuyao Liu from Hefei GuangRe Biotechnology Co., Ltd. serves as the database administrator, responsible for the construction and operation of the database. Other authors participated only in the academic development of the database and are not involved in its operation, management, or subsequent maintenance.

Our Team

Yuyao Liu
Yuyao Liu, Postgraduate
Bioinformatics R&D Department, HefeiGuangRe Biotechnology Co., Ltd
Hefei China
Email: ahmulyy@163.com
Research Interests: Skin Melanoma | R Package Development | R Web Scraping | Database Construction | LLM
Main Contributions: Database Development | Database Operation | Project Leader
Haoxue Zhang
Haoxue Zhang, MD
Anhui Medical University
Hefei China
Email: 215672062@qq.com
Research Interests: Skin Melanoma | Bioinformatics
Main Contributions: Database Paper Writing | Project Leader

Competing Interests

We have declared that no competing interest exists.

References

If you use different modules of our database, please cite the corresponding literature below:
Bulk RNA Transcriptomics:
Liu, Z., Liu, L., Weng, S. et al. 'BEST: a web application for comprehensive biomarker exploration on large-scale data in solid tumors.' Journal of Big Data 10, 165 (2023). https://doi.org/10.1186/s40537-023-00844-y.
Single-Cell Transcriptomics:
Han Y, Wang Y, Dong X, Sun D, Liu Z, Yue J, Wang H, Li T, Wang C. TISCH2: expanded datasets and new tools for single-cell transcriptome analyses of the tumor microenvironment. Nucleic Acids Res. 2023 Jan 6;51(D1):D1425-D1431. doi: 10.1093/nar/gkac959. PMID: 36321662; PMCID: PMC9825603.
Spatial Transcriptomics:
Shi J, Wei X, Xun Z, Ding X, Liu Y, Liu L, Ye Y. The Web-Based Portal SpatialTME Integrates Histological Images with Single-Cell and Spatial Transcriptomics to Explore the Tumor Microenvironment. Cancer Res. 2024 Apr 15;84(8):1210-1220. doi: 10.1158/0008-5472.CAN-23-2650. PMID: 38315776.
Xun, Z., Ding, X., Zhang, Y. et al. Reconstruction of the tumor spatial microenvironment along the malignant-boundary-nonmalignant axis. Nat Commun 14, 933 (2023).
Malignancy Characteristic Gene Scoring:
CancerSEA: a cancer single-cell state atlas. Yuan H, Yan M, Zhang G, Liu W, Deng C, Liao G, Xu L, Luo T, Yan H, Long Z, Shi A, Zhao T, Xiao Y, Li X. Nucleic Acids Res. 2019:47(D1).
Pan-Cancer Immune Subtypes:
Thorsson V, Gibbs DL, et al. The Immune Landscape of Cancer. Immunity. 2018 Apr 17;48(4):812-830.e14. doi: 10.1016/j.immuni.2018.03.023. Epub 2018 Apr 5. Erratum in: Immunity. 2019 Aug 20;51(2):411-412. doi: 10.1016/j.immuni.2019.08.004. PMID: 29628290; PMCID: PMC5982584.
GSVA-Combined z-score:
Lee E, Chuang H-Y, Kim J-W, et al. Inferring pathway activity toward precise disease classification. PLoS Comput Biol 2008;4:e1000217.
Hänzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics 14, 7 (2013). https://doi.org/10.1186/1471-2105-14-7.
GSEA Enrichment Analysis:
Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, Feng T, Zhou L, Tang W, Zhan L, Fu X, Liu S, Bo X, Yu G. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation (Camb). 2021 Jul 1;2(3):100141. doi: 10.1016/j.xinn.2021.100141. PMID: 34557778; PMCID: PMC8454663.
TCPA:
Chun-Jie Liu, Fei-Fei Hu, Gui-Yan Xie, Ya-Ru Miao, Xin-Wen Li, Yan Zeng, An-Yuan Guo. GSCA: an Integrated Platform for Gene Set Cancer Analysis at Genomic, Pharmacogenomic, and Immunogenomic Levels. Briefings in bioinformatics, 2022, bbac558.
Multi-Algorithm Tumor Microenvironment Scoring:
Li T, Fu J, Zeng Z, Cohen D, Li J, Chen Q, Li B, Liu XS. TIMER2.0 for analysis of tumor-infiltrating immune cells. Nucleic Acids Res. 2020 Jul 2;48(W1):W509-W514. doi: 10.1093/nar/gkaa407. PMID: 32442275; PMCID: PMC7319575.
Gene Methylation Site Annotation:
Tian Y, Morris T, Stirling L, Teschendorff A (2023). _ChAMPdata: Data Packages for ChAMP package_. doi:10.18129/B9.bioc.ChAMPdata https://doi.org/10.18129/B9.bioc.ChAMPdata, R package version 2.32.0, https://bioconductor.org/packages/ChAMPdata.

Start preparation time 2023-7-1
Pilot run time 2023-11-6
Formal running time 2024-4-1

The Sparkle database is a non-profit platform that does not accept investment, agency services, or any other commercial activities.
All user support will be used for database maintenance and further development.

Sparkle v1.0 | ©2024 | Bioinformatics R&D Department, Hefei GuangRe Biotechnology Co., Ltd | 皖ICP备2023011057号-1 | 皖公网安备34118102000909号 | 访客数:👦 | 访问量:👦

📝 快速记事本
分类: