With advancements in data collection and storage technology, data analysis in modern scientific research and practice has shifted from analyzing single data sets to coupling several data sets. Incorporating external data sources introduces significant challenges due to data quality and compatibility variations. These challenges are in three key areas:
Effectively and safely utilizing external information is crucial for enhancing the robustness and accuracy of scientific analyses. In this talk, I will present two novel methodologies designed to address these integration challenges for different purposes. First, I will introduce an innovative approach that leverages external information as constraints within kernel regression models, enhancing predictive performance even when dealing with partial covariates, summary-level data, and heterogeneity. Second, I will discuss strategies for safely integrating external control data to improve the estimation of average treatment effects, carefully accounting for heterogeneity to ensure valid and reliable inferences. These methodologies offer robust solutions for data integration challenges across various scientific domains.