Medical Statistical Analysis


Medical Statistical Analysis is a broad term referring to the use of modeling and analysis to identify and predict trends in medical data and optimize treatment. The analysis process may be conducted on individual patient data or a collection of depersonalized medical records. Simple trend identification involves the use of conventional statistical processes, the collection of a simple random sample, the application of a curve fitting algorithm, the development of a confidence interval, and the rejection or failure to reject the null hypothesis. On the prediction and optimization end, a few of the most commonly utilized methods include Bayesian probability calculus, the Monte Carlo method, and Markov chain analysis. Analysis on frequently updated data sets can also be automated to provide the automatic flagging of potentially alarming trends or to identify potentially erroneous outliers. In application, the results of medical statistical analyses are often combined with research data and patient symptoms to form the basis of evidence based medicine, and some forms of artificial intelligence.

A diagram illustrating the place of Medical Statistical Analysis within the context of Evidence Based Medicine.


Statistical analysis is a powerful technique that enables a researcher to draw meaningful conclusions from a study in which data are collected through observation, survey or experimentation. The success of a medical study however depends to a great extent, on the proper statistical analysis of the data emanating from such a study. This fact is quite often ignored by many medical researchers. As a consequence, some very interesting medical studies may be rendered useless due to insufficient and improper statistical analysis.
In the field of medicine and health, one encounters several research problems ranging from simple to complicated, such as monitoring the weight of a group of infants receiving a specific diet and testing whether the average weight of such infants differs from the general average weight for that age, comparing the efficacy of a new medicine with that of the existing one, assessing the effectiveness of different doses of a medication, comparing different treatments simultaneously and choosing the best treatment, estimating the effect of personal factors on a particular disease status e.g. diabetes, categorizing a person as healthy or not based on his response variables, diagnosing cancer subtype and assessing the changes in the health of patients after repeated applications of chemotherapy, prescribing appropriate diet to patients with multi-disease syndrome, predicting survival time of HIV infected patients and other multifarious issues.
In the course of solving these types of problems, researchers tend to collect data from the subjects involved in the study. It is important to note that such problems can be well categorized as univariate or multivariate; cross-sectional or longitudinal; case control or cohort design problems in statistical sense. When the data are collected from all members of the population, it may be sufficient to describe and summarize data using numerical and graphical descriptive statistical methods. However, under most circumstances, it is not feasible to investigate the entire population and thus information is collected from only a sample of members representing the population. Descriptive analysis in this case can be used only to describe the features of sample data. Opinions regarding the populations should not be made at this stage; rather inferential statistical procedures should be applied to draw conclusions with respect to the population on the basis of sample data analysis. Population parameters can be estimated using point estimation and confidence interval estimation methods. In order to test the hypotheses regarding the parameters, one can choose parametric or non-parametric technique depending on the problem under investigation and the type of data collected. Each method is based on certain assumptions. It is utmost important to verify that all the assumptions are met, before the selected inferential method is applied to analyse the data. After performing the statistical analysis, the findings should be precisely interpreted and the probabilities of specific eventualities be clearly stated. It’s crucial to highlight the limitations of the study undertaken.
To summarize, in data related medical studies, it is quite reasonable to convert a medical problem into a statistical problem; collect data through relevant experimental design or questionnaire; choose the most appropriate statistical method; apply it properly, that is, paying a solemn attention to the underlying theory and requirements of the selected statistical method; and at the end, adequately interpret the outcome of the statistical analysis while indicating the level of uncertainty involved. The proper and complete statistical analysis will eventually lead to reliable and valid conclusions. In health care and clinical trial studies involving human subjects, it is highly desirable that proper treatment therapies be used. In the event that the decisions are based on improper statistical analyses the consequences can be disastrous.
Statistical analysis kit contains plenty of tools. A brief orientation is provided in Figure 1. Most of these methods are available for application on various statistical software packages such as SPSS, SAS, Minitab, Epi-info etc.
external image fig.jpg

Figure 1: Statistical analysis kit
This figure is however not exhaustive. Many other testing procedures and user defined model based techniques can be developed and programmed using programming languages such as R and S-Plus, to address typical problems. These packages assist users by automating the calculation involved in applying a statistical method. It is in fact, the onus of the researcher to choose the appropriate method for a given problem and properly interpret the output provided by the software after processing the data.


Identification Example:
A lab technician is concerned that cholesterol levels observed in blood samples from a single county are significantly higher than the state average. He has collected a large number of samples, and believes that they are representative of the population. The lab tech’s postulate, that this single county’s average cholesterol level is greater than the state average, forms the alternative hypothesis. The null hypothesis in this case is that the two averages are equal to one another. By setting a significance level and calculating the p-value for his sample, the technician can determine whether he is able to reject or fail to reject the null hypothesis.

Prediction Example:
A college campus clinic is receiving a large influx of students suffering from influenza. They also happen to have a large database containing historical data documenting the frequency of flu diagnosis in the past. Clinic management is concerned that they may quickly exceed their capacity to provide care. To preempt this occurrence, an automatic prediction and flagging algorithm utilizing Markov chain analysis is put in place to warn clinic staff when the probability of exceeding capacity reaches 25% or greater. The historical data is parsed and a transition matrix is formed, representing the probability that the rate of change (in flu cases per day) will increase from arbitrary point A to arbitrary point B. The initial state matrix is loaded each day with the current rate of change. After crunching the matrix operations and summing each successive observed rate of change with those predicted, the software can notify staff if the key alert level is breached.

Web Resources:

Medical Statistics Made Easy (Google Book Preview)
Markov Chain
Monte Carlo Method

Related Terminology:

Evidence Based Medicine - EBM
Artificial Intelligence in Medicine
National Committee on Vital and Health Statistics (NCVHS)


Markov Chains
David S. Moore. (2010). The Basic Practice of Statistics. New York, NY: W.H. Freeman and Company
Center for Evidence Based Medicine