An example of a ROC curve. (A) Ten test cases are ranked in decreasing order based on the classification score (e.g. estimated class posterior probabilities). Each threshold on the score is associated with a specific false positive and true positive rate. For example, by thresholding the scores at 0.45 (or anywhere in the interval between 0.4 and 0.5), we misclassify one actual negative case (#3) and one actual positive case (#8). We may translate this into a classification rule: ‘if p(x) ≥
Examples of ROC curves calculated by pairwise sequence comparisons using BLAST [20], Smith Waterman [21] and structural comparisons using DALI [22]. The query was Cytochrome C6 from Bacillus pasteurii, the + group was composed of the other members of the Cytochrome C superfamily, and the – set was the rest of the SCOP40mini dataset, which were taken from record PCB00019 of the Protein Classification Benchmark collection [47]. The diagonal corresponds to the random classifier. Curves running hi
that the analysis behind the ROC convex hull extends to
multiple classes and multi-dimensional convex hulls.
One method for handling n classes is to produce n differ-
ent ROC graphs, one for each class. Call this the class ref-
erence formulation. Specifically, if C is the set of all classes,
ROC graph i plots the classification performance using
class ci as the positive class and all other classes as the neg-
ative class, i.e.
Pi ¼ ci ð2Þ
Ni ¼
[
j6¼i
cj 2 C ð3Þ
While this is a conveni
Experiments with deliberately corrupted microarray data. Circles represent cases of class ER–, full dots represent cases of class ER+. Model #1 results from the application of diagonal linear discriminant analysis to the original training set (upper panel). Model #2 results from the application of diagonal linear discriminant analysis to a corrupted training set (middle panel), where the class label of two randomly selected cases (1 ER+ and 1 ER−) were swapped. Both models are then applied t
Averaging and comparison of ROC curves. Repeating the calculation on randomly sampled data from the same set can be used to generate a bundle of ROC curves that can be aggregated into average curves (bold line) with confidence intervals. The error bars in the figure correspond to 1.96 SD (95% confidence). The higher running curves and the larger AUC values correspond to better performance. Inset A: If ROC curves cross (data taken from Table 1, continuous line corresponds to column a, dotted line
Ranking scenarios for calculating ROC. In the elementwise scenario (A), each query is compared to a dataset of + and – train examples. A ROC curve is prepared for each query and the integrals (AUC-values) are combined to give the final result for a group of queries. In the groupwise scenario (B), the queries of the test set are ranked according to their similarity to the +train group, and the AUC value calculated from this ranking is assigned to the group. Note that both A and B are one-class
Experiments with deliberately corrupted microarray data. Model #1 is built from the original training set, whereas model #2 is built from deliberately corrupted training data. Both models are applied to the same test set.
Database-wide comparison using cumulative AUC curves and similarity measures. The three methods are BLAST [20], Smith Waterman [21] and DALI [22], the comparison includes 55 classification tasks defined in the SCOP40mini dataset of the Protein Classification Benchmark [47]. The comparison was done by a nearest neighbor analysis using a groupwise scenario (Figure 7). Each graph plots the total number of classification tasks for which a given method exceeds a score threshold (left axis). The right
The piece-wise constant calibration map derived from the convex hull in Fig. 3. The original score distributions are indicated at the top of the figure, and the calibrated distributions are on the right. We can clearly see the combined effect of binning the scores and redistributing them over the interval [0, 1].
(A) A test set containing n = 12 cases of two classes that require two decision thresholds. (B) AUC of 0.5 does not necessarily indicate a useless model if the classification requires two thresholds (XOR problem).
Comparison of various ranking and scoring scenarios calculated by varying the number of negatives in the ranking. Average AUCs were calculated for all 246 classifications tasks defined from the sequences taken from the SCOP database, compared with the Smith Waterman algorithm. The error bars indicate standard deviations calculated from the 246 tasks. This is a measure of dataset variability and not the evaluation. Note that the group-wise scenario with likelihood-ratio scoring gives values that
ROC curves from a plain chest radiography study of 70 patients with solitary pulmonary nodules (Table 3).
A. A plot of test sensitivity (y coordinate) versus its false positive rate (x coordinate) obtained at each cutoff level.
B. The fitted or smooth ROC curve that is estimated with the assumption of binormal distribution. The parametric estimate of the area under the smooth ROC curve and its 95% confidence interval are 0.734 and 0.602 ~ 0.839, respectively.
C. The empirical ROC curve. The disc
Two ROC curves (A and B) with equal area under the ROC curve. However, these two ROC curves are not identical. In the high false positive rate range (or high sensitivity range) test B is better than test A, whereas in the low false positive rate range (or low sensitivity range) test A is better than test B.
Constructing a ROC curve from ranked data. The TP, TN, FP and FN values are determined compared to a moving threshold; an example is shown by an arrow in the ranked list (left). Above the threshold, + data items are TP, − data items are FP. Therefore, a threshold of 0.6 produces the point FPR = 0.1, TPR = 0.7, as shown in inset B. The plot is produced by moving the threshold through the entire range. Data were randomly generated based on the distributions shown in inset A.
Four ROC curves with different values of the area under the ROC curve. A perfect test (A) has an area under the ROC curve of 1. The chance diagonal (D, the line segment from 0, 0 to 1, 1) has an area under the ROC curve of 0.5. ROC curves of tests with some ability to distinguish between those subjects with and those without a disease (B, C) lie between these two extremes. Test B with the higher area under the ROC curve has a better overall diagnostic performance than test C.
An introduction to ROC analysis
Tom Fawcett
Institute for the Study of Learning and Expertise, 2164 Staunton Court, Palo Alto, CA 94306, USA
Available online 19 December 2005
Abstract
Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs
are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining
research. Although ROC graphs are apparently simple,
Binary classification. Binary classifiers algorithms (models, classifiers) capable of distinguishing two classes that are denoted + and −. The parameters of the model are determined from known + and – examples; this is the training phase. In the testing phase, test examples are shown to the predictor. Discrete classifiers can assign only labels (+ or −) to the test examples. Probabilistic classifiers assign a continuous score to the text examples, which can be used for ranking.
ROC curves calculated with the perfcurve function for (from left to right) a perfect classifier, a typical classifier, and a classifier that does no better than a random guess.