Abstract: The theory of statistical inference along with the strategy of divide-and-combine for large-scale data analysis has recently attracted considerable interest due to great popularity of the MapReduce scheme in the Hadoop platform. The key to the development of statistical inference lies in the method of combining results yielded from separately mapped data batches. One seminal solution based on the confidence distribution has been proposed in the setting of maximum likelihood estimation in the literature. We consider a more general inferential methodology based on estimating functions, of which the maximum likelihood is a special case. This generalization allows us to perform regression analyses of massive complex data via the MapReduce scheme, such as longitudinal data, survival data and quantile regression, which cannot be done using the maximum likelihood method. The proposed statistical inference inherits many key large-sample properties of estimating functions. In addition, because the proposed method is closely connected to the generalized method of moments (GMM) and Crowder’s optimality, its optimality over the existing methods is conveniently verified. Our method provides a unified framework for many kinds of statistical models and data types, which is illustrated via numerical examples in both simulation studies and real-world data analyses.
This is a joint work with Ling Zhou.