HRSA Geo Data (Link) This dataset provides geospatial information on healthcare facilities, workforce distribution, and access to federally funded health programs. It was used to analyze geographic disparities in healthcare accessibility.
Census Small Area Health Insurance Estimates (SAHIE) (Link) The SAHIE dataset offers county-level estimates of health insurance coverage, segmented by age, sex, and income level. This data was used to assess uninsured populations and their distribution across different demographics.
San Diego County Community Health Statistics (Link) This database includes a wide range of health indicators, such as chronic disease prevalence, leading causes of death, and risk factors affecting public health in San Diego County.
San Diego County Disease Atlas (Link) The Disease Atlas provides interactive maps displaying the geographic distribution of various diseases in San Diego County. These visualizations were used to identify health trends and disease clusters.
San Diego County Health Equity Reports (Link) This report series highlights disparities in health outcomes across racial, ethnic, and socioeconomic groups. It was used to assess the impact of social determinants on healthcare access.
San Diego Racial Equity Dashboard (Link) This interactive dashboard presents data on racial disparities in health, education, and economic well-being. It was incorporated to evaluate the intersection of race and healthcare inequities.
HRSA Medically Underserved Areas (MUA) Finder (Link) The MUA Finder identifies regions with limited access to primary care services based on criteria such as provider-to-population ratios and poverty levels. This was used to pinpoint underserved communities.
HRSA Health Center Service Delivery (HSCD) Data (Link) The HSCD dataset includes information on federally qualified health centers, patient demographics, and service utilization patterns. It was used to analyze healthcare service distribution and demand.
Model Geographic Focus:San Diego and Imperial Counties were chosen as primary focus areas for the model due to their unique healthcare dynamics and the significant role that UC San Diego Health (UCSD) plays in both regions.
San Diego County is home to UCSD’s main medical campus, which serves as a critical hub for medical research, patient care, and healthcare education. The county has a large, diverse population with varying healthcare needs, making it an ideal area to study the impact of healthcare resource allocation and patient volume modeling.
Imperial County, while more rural, is closely connected to San Diego County in terms of healthcare infrastructure. Many residents of Imperial County rely on UCSD Health for specialized medical services, due to the limited healthcare options available locally. As a result, UCSD's role in providing care to the underserved population of Imperial County underscores the importance of optimizing resources and patient volume models for this region.
By focusing on these two counties, the models aim to address critical healthcare challenges, including resource distribution, patient care access, and optimization of healthcare services across both urban and rural areas in California.
2) Data Preprocessing
All models processed and ran using Python.
Handling Missing Values – Missing numerical values were imputed using the column mean to maintain dataset completeness.
Feature Engineering – A new target variable, Patient Volume, was defined as the ratio of the number of uninsured individuals (NUI) to the number of available healthcare sites (plus one to avoid division by zero).
Data Cleaning – Non-numeric columns, such as county names, were removed to facilitate numerical analysis.
Feature Scaling – Standardization was applied using StandardScaler to normalize feature distributions, improving model performance.
Dataset Splitting – The dataset was divided into training (80%) and test (20%) subsets to evaluate model generalizability.
3) Exploratory Data Analysis
Extracting Feature Importances – The feature_importances_ attribute of the trained Random Forest model was used to quantify the contributions
Sorting Features by Importance – The np.argsort() function was applied to rank features in descending order of importance.
Visualizing Feature Importance – Matplotlib bar chart. Rotated labels were applied for readability.
Interpretation – ID of key drivers of patient volume, enabling further refinement of the predictive model.
4) Machine Learning Model Selection
Random Forest Regressor
Non-Linearity Handling – Healthcare resource needs are influenced by multiple interacting factors, such as uninsured rates, IMU Scores, and the number of healthcare sites. Random forests can model non-linear relationships more effectively than linear regression.
Feature Importance Analysis – The model can naturally rank key predictors of resource demand, helping policymakers and administrators prioritize interventions in high-need areas.
Robustness to Outliers & Noise – Decision tree-based models like Random Forests are less sensitive to outliers, which is useful in healthcare datasets where certain regions may have disproportionately high or low uninsured rates or site availability.
Reduced Overfitting – By averaging predictions across multiple trees, Random Forest mitigates overfitting that could occur with a single decision tree, ensuring generalizability to new data.
Scalability & Performance – With 200 estimators, the model efficiently leverages multiple decision trees to make accurate predictions while maintaining reasonable computational efficiency.
Strong Predictive Performance – The evaluation metrics, Mean Absolute Error (MAE) and R² Score, quantify model accuracy, ensuring the predictions provide meaningful insights for resource distribution.
5) Model Execution & Visualization
Learning Curve (Scikit-Learn, Matplotlib) – Analyzes training vs. validation error to detect overfitting or underfitting.
Actual vs. Predicted Scatter Plot (Seaborn, Matplotlib) – Compares model predictions to real patient volume with a reference line.
SHAP Summary Plot (SHAP, Matplotlib) – Highlights feature contributions to predictions for interpretability.
Feature Importance Heatmap (Seaborn, Pandas, Matplotlib) – Visualizes key drivers of patient volume and resource need.
3D Scatter Plot (Matplotlib, NumPy) – Displays relationships between uninsured rates, clinic availability, and resource demand.
6) Model Evaluation
Mean Absolute Error (MAE) and R-squared (R²) are among the most commonly used metrics for evaluating regression models because they provide complementary insights into model performance:
MAE is widely used because it directly quantifies the average magnitude of errors in predictions, without considering their direction. It's easy to interpret and useful in settings where it’s important to measure the typical prediction error and where large errors should be penalized in a straightforward manner.
R², on the other hand, provides an indication of how well the model explains the variance in the data. It helps assess the goodness of fit and is a standard metric for understanding how well the model generalizes to unseen data. Higher R² values indicate that the model captures a greater proportion of the variability, making it an essential metric for model evaluation, particularly in contexts where understanding the underlying patterns is critical.
7) Bias & Ethical Mitigation
Data Bias: The datasets used may underrepresent certain populations, such as undocumented individuals or marginalized communities, leading to skewed predictions. Ensuring diverse data sources and proper feature selection helps mitigate.
Data Completeness: Public datasets may lack granular details, such as real-time patient data, specific clinic capacities, or undocumented populations, limiting the model’s accuracy.
Data Timeliness: Government and county health reports are often updated annually or with delays, meaning the model may not reflect current healthcare needs.
Model Bias: Features such as the IMU Score and uninsured rates may reflect systemic disparities rather than true healthcare needs. The model must be carefully interpreted to avoid reinforcing existing inequities.
Fairness in Resource Allocation: Predictions should not be used to justify reducing resources in underserved areas. Instead, they should guide equitable distribution by prioritizing high-need communities.
Ethical Use of Predictions: The model should support decision-making rather than dictate policies. Human oversight is necessary to balance data-driven insights with contextual knowledge.
8) Future
Enhancing Data Quality: Incorporating more granular and real-time healthcare data, such as hospital admission rates and patient demographics, to improve predictive performance.
Developing a User-Friendly Interface: Creating interactive dashboards using Tableau or Dash to visualize predictions for stakeholders, improving accessibility and usability.
Exploring Additional AI Techniques: Testing other machine learning algorithms, such as deep learning and reinforcement learning, to improve the model's predictive accuracy and adaptability.