Sequential decision-making in dynamic, heterogeneous environments is often hindered by multivariate and correlated outcomes. This study introduces a unified Copula-CNN-LSTM Deep Q-Network (DQN) framework for multi-stage individualized policy learning in pseudo temporal settings. Motivated by the need for agents that account for inter-outcome dependencies, we extend static covariates from benchmark datasets (Boston Housing and Wine Quality Red) into pseudo-temporal sequences to emulate state transitions. Multivariate rewards with controlled correlations (ρ = 0.5 and ρ = −0.5) are standardized via an empirical copula transformation to assess policy robustness under varying dependency structures. The DQN agent optimizes policies using experience replay and temporal discounting of state-action reward trajectories. The framework demonstrates stable convergence in average rewards across both datasets under positive and negative correlation structures. Analysis of the resulting dynamic conditional average treatment effects (CATEs) across outcome dimensions highlights the model’s ability to discern heterogeneous treatment impacts. Furthermore, learned policy matrices and dynamic Directed Acyclic Graphs (DAGs) reveal interpretable temporal dependencies, with edge structures reflecting the complex multivariate nature of the optimal policy. Overall, the proposed framework effectively captures inter-temporal dependencies and adapts to correlated rewards, providing a scalable and interpretable solution for sequential decision making in complex environments.
목차
Abstract 1. INTRODUCTION 2. METHODS 2.1. Sequential Decision-Making Framework 2.2. Treatment Assignment Model 2.3. Multistage Outcome Generation with Heterogeneous Treatment Effects 2.4. Empirical Copula Transformation for Normalization 2.5. Dynamic Conditional Average Treatment Effect (CATE) 2.6. Deep Q-Network (DQN) for Multistage Policy Learning 2.7. Theoretical Justification and Policy Learning 2.8. Learned Dynamic Directed Acyclic Graph (DAG) 2.9. Performance Metrics 3. DATA ANALYSIS 3.1. Data Sets 3.2. Descriptive Statistics 3.3. Experimental Setup 3.4. Outcome Simulation and Transformation 3.5. Policy Learning and Replay Memory 3.6. Dynamic DAG Estimation 3.7. Evaluation Metrics 3.8. Results Overview 3.9. Visualization 3.10. Analysis of Temporal Model Structure and Dynamics 4. CONCLUSION DATA AVAILABILITY REFERENCES
키워드
Deep Q-NetworkSequential Decision-MakingMultivariate OutcomesCopulaCNNLSTMConditional Average Treatment Effect
저자
Jong-Min Kim [ Statistics Discipline, Division of Science and Mathematics, University of Minnesota-Morris, Morris, MN 56267, USA / EGADE Business School, Tecnologico de Monterrey, Ave. Rufino Tamayo, Garza Garcia, NL,´ CP. 66269, Mexico ]
Jinhwa Kim [ School of Business, Sogang University, Seoul, South Korea ]
Corresponding Author