Workflow of p-clustval
Our research tackles a critical challenge in modern biology: making sense of complex single-cell RNA sequencing data. We introduced p-ClustVal, an innovative data transformation technique inspired by p-adic number theory, that significantly improves how we identify and group similar cells. The method's power lies in its ability to transform data into a format where cell groups become more distinctly separated, while seamlessly integrating with existing analysis tools. In extensive testing across 30 experiments, p-ClustVal improved clustering accuracy in 91% of cases, achieving remarkable improvements - up to 0.84 points with RaceID and 0.60 points with SIMLR. The method not only enhances performance but also adapts automatically to different datasets without requiring manual tuning, making it both powerful and user-friendly. This advancement opens new possibilities for understanding cellular biology at unprecedented resolution, potentially accelerating discoveries in fields ranging from development to disease research.
Sharma, P., Mishra, S., Kurban, H. et al. p-clustval: a novel p-adic approach for enhanced clustering of high-dimensional single-cell RNASeq data. Int J Data Sci Anal (2025). https://doi.org/10.1007/s41060-024-00709-4
Illustration of the Data-Centric principles for optimization
This paper proposes k-means-d, an innovative data-centric enhancement to the classic k-means clustering algorithm that achieves significant performance improvements while preserving accuracy. The key innovation lies in classifying data points as either high expressive (HE) or low expressive (LE) based on their impact on the objective function, allowing the algorithm to avoid redundant computations on LE points while focusing only on HE points that significantly affect convergence. The method integrates data-centric principles directly into the algorithm's iterative core, making it a interesting example of applying data-centric AI to fundamental algorithms. Through extensive experimentation across multiple datasets, k-means-d demonstrated impressive results - up to 19× reduction in distance computations compared to existing faster k-means variants, while maintaining identical clustering results. Notably, it outperformed both classical k-means and state-of-the-art alternatives, particularly excelling on larger datasets. This advancement opens up new possibilities for infusing data-centricity into other canonical algorithms, potentially revolutionizing how we think about algorithmic optimization in the data-centric AI era.
What Data-Centric AI Can Do For k-means: a Faster, Robust k-means-d. Parichit Sharma, Hasan Kurban and Mehmet M. Dalkilic. 41st International Conference on Machine Learning (ICML), Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models, Vienna, Austria, 2024.
Visualizing the result of p-ClustVal transformation
P. Sharma, S. Mishra, H. Kurban and M. Dalkilic, p-ClustVal: A Novel p-Adic Approach for Enhanced Clustering of High-Dimensional scRNASeq Data (Extended Abstract), 2024 IEEE 11th International Conference on Data Science and Advanced Analytics (DSAA), San Diego, CA, USA, 2024, pp. 1-3, doi: 10.1109/DSAA61799.2024.10722799.
Architectural framework overview
We developed an approach to make NBA fantasy leagues more realistic by incorporating crucial aspects often overlooked in traditional systems. While current fantasy teams rely solely on individual player statistics, we've addressed the critical gap of team chemistry (TC) and position-specific statistical scaling, recognizing that a player's contribution varies significantly based on their role. Our solution introduces a novel method to quantify team chemistry by analyzing season-wise player pairing history, combined with scaled position statistics that reflect each position's unique impact. Testing our approach on the NBA's API data with various machine learning models, the results were remarkable - our best-performing model achieved a 75.4% accuracy in predicting playoff qualification, an 8% improvement over baseline methods.
Ganesh Arkanath, Nishad Gupta, Hasan Kurban, Parichit Sharma, KR Madhavan, Elham Buxton, Mehmet Dalkilic, Novel NBA Fantasy League driven by Engineered Team Chemistry and Scaled Position Statistics, 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 2023, pp. 4268-4275, doi: 10.1109/BigData59044.2023.10386444.
DCEM package workflow
DCEM represents a novel approach to modernizing the Expectation Maximization (EM) algorithm for the big data. While EM has long been a cornerstone technique for clustering and parameter estimation, its performance significantly degrades with large-scale datasets - a critical limitation in today's data-intensive world. To address this, DCEM introduces an innovative data-centric strategy that dynamically identifies and separates high expressive (HE) from low expressive (LE) data points based on their impact on the objective function, fundamentally restructuring how the algorithm processes information. This solution involves inserting hierarchical structures within the algorithm to alternate between using either HE or HE+LE data during iterations, dramatically reducing computational overhead without compromising accuracy. The results are promising: DCEM outperforms traditional EM and contemporary alternatives across various dataset sizes and dimensionalities, converging faster while maintaining or even improving clustering quality. DCEM demonstrated superior performance, successfully clustering datasets of up to 2 million points in approximately 33 minutes, while competing methods failed to converge within a 2-hour threshold.
Parichit Sharma, Hasan Kurban, Mehmet Dalkilic, DCEM: An R package for clustering big data via data-centric modification of Expectation Maximization, SoftwareX, Volume 17, 2022, 100944, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2021.100944
Impact of High Expressive (HE) Data on algorithm's convergence
This work tackles a fundamental challenge in modern data science: understanding and leveraging the inherent value of data in algorithmic processing. With data growing exponentially while computing power advances linearly, there's an urgent need to rethink how we interact with data in AI systems. To address this, we introduce a novel concept of "data expressiveness" - a way to dynamically evaluate how much each data point contributes to the learning process - and demonstrate its application through a data-centric enhancement of the Expectation Maximization (EM) algorithm. Our solution, EM-DC, employs balanced binary search trees to efficiently separate high-expressive (HE) from low-expressive (LE) data, allowing the algorithm to focus computational resources on data points that significantly impact the objective function. When tested on real-world datasets, including a cropland classification dataset with 325,834 records and 175 attributes, EM-DC showed remarkable improvements over traditional EM and previous data-centric versions: it achieved comparable accuracy while using only one-third of the data in its structure, significantly reduced training time and iteration counts, and demonstrated superior scalability across increasing numbers of clusters, dimensions, and data size - all while maintaining the same level of accuracy as traditional approaches.
Data Expressiveness and Its Use in Data-centric AI, Hasan Kurban, Parichit Sharma and Mehmet M. Dalkilic, 35th Conference on Neural Information Processing Systems (NeurIPS 2021), First Data-centric AI workshop, Sydney, Australia, 2021
Feature correlation and distribution in the TiO2 nanoparticle data
Kurban, H., Kurban, M., Sharma, P., Dalkilic, M.M., 2021. Predicting Atom Types of Anatase TiO2 Nanoparticles with Machine Learning. KEM 880, 89–94. https://doi.org/10.4028/www.scientific.net/kem.880.89
Visualizing the interactions in a multimer protein complex
Understanding how proteins interact with each other is crucial for cellular functions, but experimentally determining these interactions is time-consuming and expensive. Enter PIZSA (Protein Interaction Z Score Assessment), an innovative web server tool that evaluates protein-protein interactions by analyzing interface residue contacts. Using a sophisticated scoring scheme that considers atom-mediated interactions between residue pairs, PIZSA successfully identified native protein structures with impressive 84% accuracy, outperforming existing methods. The tool uniquely handles both simple and complex protein assemblies without requiring explicit chain identification, making it highly versatile. Testing on multiple datasets showed PIZSA achieves a classification accuracy of 89%, balanced accuracy of 90%, and Matthews correlation coefficient of 0.80, demonstrating its reliability for predicting stable protein interactions.
Ankit A Roy, Abhilesh S Dhawanjewar, Parichit Sharma, Gulzar Singh, M S Madhusudhan, Protein Interaction Z Score Assessment (PIZSA): an empirical scoring scheme for evaluation of protein–protein interactions, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W331–W337, https://doi.org/10.1093/nar/gkz368