Handling the big data collected in modern applications is a growing challenge, especially when traditional statistical methods struggle to scale. A common workaround is to divide the data into smaller subsets, such as training data used to represent the full dataset. However, how do we choose a training sample that truly captures the essence of the full data—especially when we don't want to commit to a specific model too early? While recent methods focus on selecting “optimal” design points under assumed models like linear or logistic regression, these can fall short when the model is misspecified. In this talk, I will introduce a new algorithm for selecting a “good” training sample—one that preserves the key characteristics of the full data and remains robust across different modeling choices. This approach not only improves prediction performance but also provides flexibility in building and selecting the best model for the data at hand.