Statistics in Epidemiology: Difference between revisions
(Statistics in Epidemiology: The study of the occurrence of disease or other health-related characteristics in human and in animal populations.) |
No edit summary |
||
Line 1: | Line 1: | ||
= | = Epidemiology = | ||
Epidemiology is the study of the occurrence of disease or other health-related characteristics in human and in animal populations. Epidemiologists study the frequency of disease and whether the frequency differs across groups of people; such as, the cause-effect relationship between exposure and illness. Diseases do not occur at random; they have causes. | Epidemiology is the study of the occurrence of disease or other health-related characteristics in human and in animal populations. Epidemiologists study the frequency of disease and whether the frequency differs across groups of people; such as, the cause-effect relationship between exposure and illness. Diseases do not occur at random; they have causes. |
Revision as of 19:55, 4 January 2017
Epidemiology
Epidemiology is the study of the occurrence of disease or other health-related characteristics in human and in animal populations. Epidemiologists study the frequency of disease and whether the frequency differs across groups of people; such as, the cause-effect relationship between exposure and illness. Diseases do not occur at random; they have causes. Many diseases could be prevented if the causes were known. The methods of epidemiology have been crucial to identifying many causative factors which, in turn, have led to health policies designed to prevent disease, injury and premature death.
Basic Terms
Epidemiology is an academic discipline which deals with the occurrence of diseases, their causes, events which lead to these diseases and their consequences. The most important terms which define epidemiology are incidence, prevalence, mortality and lethality of diseases.[1]
Incidence is determined by the number of people in a particular group (population) during a certain time period, who fell ill. [1]
Prevalence is used to measure the frequency of diseases. This is calculated by dividing the number of ill people, at the moment of the examination, by the total number of people in the sample group.[1]
Example: 400 people are tested for the common flue. 100 of them in the sample group are found to have the flue. Divide the 100 flue infected people by the total sample size, which is 400, the answer is the prevalence.
100/400= 1/4 or 1 out of 4 people or 25%.
Mortality determines how many people die in a certain time period. It can be measured with calculating the death rate: (Number of deaths during a specified period in the sample group)/(The total number of people in the sample groupe period)
Example: In the flue case 10 people died from the flue and 10 from a different source. So, divide the number of people who died by the total sample size: 20/400=1/20 or 0.5%, that is the mortality rate, notice that it is the TOTAL number of deaths, not just from the disease. For a more accurate measure use lethality rate.
Lethality of diseases is a ratio which is determined by the number of people who died in a certain time divided through the number of people who fell ill in the same time period. [1]
Example: Let’s take the flue case as an example. 10 people have died from the flue out of the 100 that were sick. So: 10/100=1/10 or in other works there is a 10% chance to die from this disease, that’s the lethality.
Diagnostic Tests
Diagnostic tests are performed in the aim of determining the presence of a certain disease or illness in a patient. The test may be carried out through performing procedures, such as various scans or merely on the basis of symptoms. [1]
The results obtained from the test could be from either of the 2 distinct main categories- positive or negative, where a positive result indicates the presence of the diseases. A positive or negative result can be subdivided further into true positives and negatives, and false positives and negative results. A true positive result is one that accurately determines the presence of the illness. On the contrary a false positive result indicates the presence of the disease in the patient; however, the disease is actually not present in the patient. A similar pattern is seen in true negative and false negative results. [1]
Certain attributes may be assigned to the tests that are used in the diagnosis, such as their sensitivity and specificity. The sensitivity of a test corresponds to its ability to detect the condition when the condition is present. The Specificity indicates the ability of a test to not detect the condition when it is absent. [1]
Table illustrating different types subcategories positive and negative test results[2]
The Fourfold Table
The fourfold (confusion matrix) table is a type of contingency table with, which is a tabular cross-classification of data in which subcategories of one characteristic are indicated horizontally (in rows) and subcategories of another characteristic are indicated vertically (in columns) to test the characteristics between the two (the rows and the columns). The table itself is divided into positives (P) and negatives(N) which is further divided into false positives, false negatives, true positives, and true negatives. [1]
We build the table by taking predicted results by a data model and compare them with tested results of the subject we predicted. We then get the fourfold (confusion matrix) with the fallowing stats of correlations (these are for Binary classifier (Positive or Negative) results [1],[3],[4]:
- True positive (predicted positive & tested positive)
- True negative (predicted negative & tested negative)
- False positive (predicted positive & tested negative)
- False negative (predicted negative & tested positive)
- P & N: Actual positive(P) and negatives(N)
- p & n: Predicted positive(p) and negatives(n)
From these we can derive the fallowing (there are more but these are the most important)[1],[3],[4]:
- Accuracy of predictions: TP+TN/P+N
- True Positive Rate (Sensitivity) (also called Recall in binary classifier) (Probability of detection/ identifying the positives): TP/P = TPR
- True Negative Rate (Specificity) (Probability of detection/ identifying the negatives): TN/N = SPC
- False Positive Rate (The probability that a non-relevant result is retrieved by the query): FN/N or 1 – Specificity = FPR
- Positive Predictive Rate/ Precision (the Probability of positives that are predicted and tested true): P/TP+FP = PPR
You want to get a value of as close to 1 on True Positive Rate (Sensitivity) and close to 0 on False Positive Rate. The closer to 1 on True Positive Rate (Sensitivity) and to 0 on False Positive Rate the better your chance of getting the results correctly.
Sensitivity and Specificity of Diagnostic Test Calculated from Fourfold table
Diagnostic test is a procedure performed in order to determine if an individual has an illness or not. Sensitivity of a diagnostic test is the chance or likelihood that a diseased individual in a population which is tested is going to be identified as diseased by the test. Specificity is the chance or likelihood that a healthy individual will be identified as non-diseased by the diagnostic test. A fourfold table shows the relationship between the two. It determines whether the 2 distinct variables are linked. [1], [5]
The letters a, b, c, d symbolize the numbers in the fourfold table.
- a — stand for diseased individuals detected by the test.
- b — stand for healthy individuals detected by the test.
- c — stand for diseased individuals not detectable by the test.
- d — stand for healthy individuals negative by the test.
The formula for sensitivity — a / a+c.
The formula for specificity — d / b+d.
predictive value of a positive test result — a / a+b.
Predictive value of a negative test result — d / c+d.
Receiver Operating Characteristic (ROC) Curve
ROC curves are used to present if a method for classification is acceptable or not. In ROC curves the true positive rate (TPR) is plotted versus the false positive rate (FPR); these pairs of values are determined by applying various threshold values for the determination of the true positive (TP) and false positive (FP) values. In a two-class prediction problem initially the statistics of the total population presenting the condition positive and negative are determined. Then, using a certain threshold value, predictions are made and the results are classified as (a) true positive (TP) in the case a positive prediction is made for a positive condition (b) false negative (FN) in the case of a negative prediction for a positive condition (c) false positive (FP) in the case of a positive prediction for a negative condition (d) true negative (TN) in the case of negative prediction for a negative condition
Based on these four parameters, the relative four ratios are determined as follows:
TPR (True positive rate or Sensitivity) which is equal to the sum of true positive values divided by the sum of condition positive values FNR (False negative rate), equal to the sum of false negative values divided by the sum of condition positive value FPR (False positive rate) which is equal to the sum of false positive values divided by the sum of condition positive values TNR (True negative rate, or specificity) which is equal to the sum of true negative value divided by the sum of condition negative values
ROC curves have the form:
The example of diabetes disease prediction using the blood glucose level was used to illustrated the way these curves are generated. In this example, it is supposed that the normal levels for a healthy person of glucose in blood are 70-110 mg/dL (average value 90 mg/dl) corresponding to condition negative, and the glucose concentration in blood of a non-healthy person is 90 to 180 mg/dL (corresponding to condition positive).
For this specific example, it is assumed that the sample of total population is 1000 members with 500 being non-healthy (condition positive) and 500 healthy (condition negative). The distribution of the healthy and non-healthy population is given in the tables 1 and 2, whereas their graphical presentation is given in Fig. 1.:
Based on data provided in Tables 1 and 2 (or in Fig.1), the parameters TP, FN, FP and TN, as well as the respective rates (TPR, FNR, FPR and TNR) were determined for five different cut-off (threshold) levels: 90, 92, 95, 100 and 105 mg/dL.
The results for these exercises are presented tin the following Table 3.
Then, the pairs of FPR and TPR were plotted to give the ROC curve.
Table 4. Pairs of FPR, TPR for plotting
FPR | TPR |
---|---|
0 | 0 |
0.02 | 0.962 |
0.084 | 0.978 |
0.188 | 0.988 |
0.442 | 0.998 |
0.526 | 1 |
1 | 1 |
As seen in Table 3, the maximum ACC (accuracy) value was obtained in Case A3 examined using the value of 105 mg/dL as cut-off (threshold) value, therefore, this value should be used for the classification. Furthermore, it is easily seen that the ROC curve produced is very close to the perfect classification point (0,1), therefore, the classification is very effective.
Example of a Possible Screening and Confirmation Test in Medicine
Sensitivity and Specificity are used to evaluate the validity of laboratory tests (not results of the tests). Basically, you use sensitivity and specificity to determine whether or not to use a certain test or to determine what situations a certain test would work best in. [1], [7]
Imagine we have 2 different buttons that starts the alarm. The first button starts the alarm when you barely touch it, a gust of wind or feather touch. The first button has high sensitivity and low specificity. It is sensitive to the smallest of signals to start the alarm not being very specific to an intentional starting the alarm. We never miss a possible chance to star the alarm (~Low FN). But often accidentally starts the alarm when we shouldn’t (~High –FP). [1], [7]
The second button only set-off the alarm if a great pressure is applied. This button has high specificity and low sensitivity. It is very specific to setting-off the alarm only when pressed but isn’t very sensitive to weak pressure. [1], [7]
In the real world, you never have a test that is fully Sensitivity and full Specificity. We are usually faced with a decision to use a test with high Sensitivity (and lower spec) or high Specificity (and lower Sensitivity). Usually a test with high sensitivity is used as the Initial Screening Test. Those that receive a positive result on the first test will be given a second test with high specificity that is used as the Confirmatory Test. In these situations, you need both tests to be positive to get a definitive diagnosis. Getting a single positive reading is not enough for a diagnosis as the individual tests have either a high chance of FP or a high chance of FN. For example, HIV is diagnosed using 2 tests. First an ELISA screening test is used and then a confirmatory Western Blot is used if the first test is positive. [1], [7]
There are also specific situations where having a high specificity or sensitivity is really important. Consider that you are trying to screen donations to a blood bank for blood borne pathogens. In this situation, you want a super high sensitivity, because the drawbacks of a false negative (spreading disease to a recipient) are way higher than the drawbacks of a false positive (throwing away 1 blood donation). Now consider you are testing a patient for the presence of a disease. This particular disease is treatable, but the treatment has very serious side effects. In this case, you want a test that has high specificity, because there are major drawbacks to a false positive. [1], [7]
Links
Related Articles
Bibliography
References
- ↑ Jump up to: a b c d e f g h i j k l m n o p Porta Miquel, A dictionary of epidemiology, Oxford, sixth edition 2014.
- ↑ Test Statistics. (n.d.). Retrieved November 23, 2016, from http://groups.bme.gatech.edu/groups/biml/resources/useful_documents/Test_Statistics.pdf
- ↑ Jump up to: a b Fawcett, Tom (2006). "An Introduction to ROC Analysis". Pattern Recognition Letters. P. 861 – 874
- ↑ Jump up to: a b Powers, David M W (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 37–63.
- ↑ Loong T (2003). Understanding sensitivity and specificity with the right side of the brain
- ↑ Altman DG, Bland JM (1994). Diagnostic tests. 1: Sensitivity and specificity
- ↑ Jump up to: a b c d e Farlex Partner Medical Dictionary - Epidemiology, ROC analysis © Farlex 2012