Title: | Agreement of Nominal Scale Raters (with a Gold Standard) |
---|---|
Description: | Estimate agreement of a group of raters with a gold standard rating on a nominal scale. For a single gold standard rater the average pairwise agreement of raters with this gold standard is provided. For a group of (gold standard) raters the approach of S. Vanbelle, A. Albert (2009) <doi:10.1007/s11336-009-9116-1> is implemented. Bias and standard error are estimated via delete-1 jackknife. |
Authors: | Matthias Kuhn [aut, cre]
|
Maintainer: | Matthias Kuhn <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.0 |
Built: | 2025-02-08 02:51:38 UTC |
Source: | https://github.com/cran/kappaGold |
The data are reported in a textbook from Fleiss, probably it is fictitious.
agreem_binary
agreem_binary
A list that contains three matrices. Each matrix contains the result of a study involving two raters. It is a binary rating scale ("+" and "-").
Chapter 18, Problems 18.3
Fleiss, J. L., Levin, B., & Paik, M. C. Statistical Methods for Rates and Proportions, 3rd edition, 2003, ISBN 0-471-52629-0
Fifty general hospital patients, admitted to the Monash Medical Centre in Melbourne, were randomly drawn from a larger sample described by Clarke et al. (1993). Agreement between two different screening tests and a diagnosis of depression was compared. Definition of depression included DSM-III-R Major Depression, Dysthymia, Adjustment Disorderwith Depressed Mood, and Depression NOS. Depression was determined empirically using the Cutoff (McKenzie & Clarke, 1992) program. The screening tests consisted of
depression
depression
A matrix with 50 observations and 3 variables:
diagnoses as determined by the Cutoff program
Beck Depression Inventory
General Health Questionnaire
the Beck Depression Inventory (BDI) (Beck et al., 1961) and
the General Health Questionnaire (GHQ) (Goldberg & Williams, 1988)
McKenzie, D. P. et al., Comparing Correlated Kappas by Resampling: Is One Level of Agreement Significantly Different from Another? J. psychiat. Res, Vol. 30, 1996. doi:10.1016/S0022-3956(96)00033-7
N = 30 patients were given one of k = 5 diagnoses by some n = 6 psychiatrists out of 43 psychiatrists in total. The diagnoses are
Depression
PD (=Personality Disorder)
Schizophrenia
Neurosis
Other
diagnoses
diagnoses
diagnoses
A matrix with 30 rows and 6 columns:
1st rating of some six raters
2nd rating of some six raters
3rd rating of some six raters
4th rating of some six raters
5th rating of some six raters
6th rating of some six raters
A total of 43 psychiatrists provided diagnoses. In the actual study (Sandifer, Hordern, Timbury, & Green, 1968), between 6 and 10 psychiatrists from the pool of 43 were unsystematically selected to diagnose a subject. Fleiss randomly selected six diagnoses per subject to bring the number of assignments per patient down to a constant of six.
As there is not a fixed set of six raters the ratings from the same column
are not related to each other. Therefore, compared to the dataset with the
same name in package irr
, we applied a permutation of the six ratings.
Sandifer, M. G., Hordern, A., Timbury, G. C., & Green, L. M. Psychiatric diagnosis: A comparative study in North Carolina, London and Glasgow. British Journal of Psychiatry, 1968, 114, 1-9.
Fleiss, J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76(5), 378–382. doi:10.1037/h0031619
This dataset is also available as diagnoses
in the irr-package on
CRAN.
The null hypothesis states that the kappas for all involved groups are the same ("homogeneous"). A prerequisite is that the groups are independent of each other, this means the groups are comprised of different subjects and each group has different raters. Each rater employs a nominal scale. The test requires estimates of kappa and its standard error per group.
kappa_test(kappas, val = "value0", se = "se0", conf.level = 0.95)
kappa_test(kappas, val = "value0", se = "se0", conf.level = 0.95)
kappas |
list of kappas from different groups. It uses the kappa estimate and its standard error. |
val |
character. Name of field to extract kappa coefficient estimate. |
se |
character. Name of field to extract standard error of kappa. |
conf.level |
numeric. confidence level of confidence interval for overall kappa |
A common overall kappa coefficient across groups is estimated. The test statistic assesses the weighted squared deviance of the individual kappas from the overall kappa estimate. The weights depend on the provided standard errors. Under H0, the test statistics is chi-square distributed.
list containing the test results, including the entries statistic
and p.value
(class htest
)
Joseph L. Fleiss, Statistical Methods for Rates and Proportions, 3rd ed., 2003, section 18.1
# three independent agreement studies (different raters, different subjects) # each study involves two raters that employ a binary rating scale k2_studies <- lapply(agreem_binary, kappa2) # combined estimate and test for homogeneity of kappa kappa_test(kappas = k2_studies, val = "value", se = "se")
# three independent agreement studies (different raters, different subjects) # each study involves two raters that employ a binary rating scale k2_studies <- lapply(agreem_binary, kappa2) # combined estimate and test for homogeneity of kappa kappa_test(kappas = k2_studies, val = "value", se = "se")
Bootstrap test on kappa based on data with common subjects. The differences in kappa between all groups (but first) relative to first group (e.g., Group 2 - Group 1) are considered.
kappa_test_corr( ratings, grpIdx, kappaF, kappaF_args = list(), B = 100, alternative = "two.sided", conf.level = 0.95 )
kappa_test_corr( ratings, grpIdx, kappaF, kappaF_args = list(), B = 100, alternative = "two.sided", conf.level = 0.95 )
ratings |
matrix. ratings as sbj x raters, including the multiple groups to be tested |
grpIdx |
list. Comprises numeric index vectors per group. Each group is defined as set of raters (i.e., columns) |
kappaF |
function or list of functions. kappa function to apply on each group. |
kappaF_args |
list. Further arguments for the kappa function. By default, these settings apply to all groups, but the settings can be specified per group (as list of lists). |
B |
numeric. number of bootstrap samples. At least 1000 are recommended for stable results. |
alternative |
character. Direction of alternative. Currently only
|
conf.level |
numeric. confidence level for confidence intervals |
list. test results as class htest
. The confidence interval shown
by print
refers to the 1st difference k1-k2
.
Due to limitations of the htest
print method the confidence interval shown
by print
refers to the 1st difference k1-k2
. If there are more than 2
groups access all confidence intervals via entry conf.int
.
# Compare Fleiss kappa between students and expert raters # For real analyses use more bootstrap samples (B >= 1000) kappa_test_corr(ratings = SC_test, grpIdx = list(S=1:39, E=40:50), B = 125, kappaF = kappam_fleiss, kappaF_args = list(variant = "fleiss", ratingScale=-2:2))
# Compare Fleiss kappa between students and expert raters # For real analyses use more bootstrap samples (B >= 1000) kappa_test_corr(ratings = SC_test, grpIdx = list(S=1:39, E=40:50), B = 125, kappaF = kappam_fleiss, kappaF_args = list(variant = "fleiss", ratingScale=-2:2))
Cohen's kappa is the classical agreement measure when two raters provide ratings for subjects on a nominal scale.
kappa2(ratings, robust = FALSE, ratingScale = NULL)
kappa2(ratings, robust = FALSE, ratingScale = NULL)
ratings |
matrix (dimension nx2), containing the ratings as subjects by raters |
robust |
flag. Use robust estimate for random chance of agreement by Brennan-Prediger? |
ratingScale |
Possible levels for the rating. Or |
The data of ratings must be stored in a two column object, each rater is a column and the subjects are in the rows. Every rating category is used and the levels are sorted. Weighting of categories is currently not implemented.
list containing Cohen's kappa agreement measure (value) or NULL
if
no valid subjects
# 2 raters have assessed 4 subjects into categories "A", "B" or "C" # organize ratings as two column matrix, one row per subject rated m <- rbind(sj1 = c("A", "A"), sj2 = c("C", "B"), sj3 = c("B", "C"), sj4 = c("C", "C")) # Cohen's kappa ----- kappa2(ratings = m) # robust variant --------- kappa2(ratings = m, robust = TRUE)
# 2 raters have assessed 4 subjects into categories "A", "B" or "C" # organize ratings as two column matrix, one row per subject rated m <- rbind(sj1 = c("A", "A"), sj2 = c("C", "B"), sj3 = c("B", "C"), sj4 = c("C", "C")) # Cohen's kappa ----- kappa2(ratings = m) # robust variant --------- kappa2(ratings = m, robust = TRUE)
Estimate agreement with a gold-standard rating for nominal categories.
Maintainer: Matthias Kuhn [email protected] (ORCID)
Authors:
Jonas Breidenstein [email protected]
When multiple raters judge subjects on a nominal scale we can assess their agreement with Fleiss' kappa. It is a generalization of Cohen's Kappa for two raters and there are different variants how to assess chance agreement.
kappam_fleiss( ratings, variant = c("fleiss", "conger", "robust", "uniform"), detail = FALSE, ratingScale = NULL )
kappam_fleiss( ratings, variant = c("fleiss", "conger", "robust", "uniform"), detail = FALSE, ratingScale = NULL )
ratings |
matrix (subjects by raters), containing the ratings |
variant |
Which variant of kappa? Default is Fleiss (1971). Other options are Conger (1980) or robust variant. |
detail |
Should category-wise Kappas be computed? Only available for the Fleiss (1971) variant. |
ratingScale |
Specify possible levels for the rating. Default |
Different variants of Fleiss' kappa are implemented.
By default (variant="fleiss"
), the original Fleiss Kappa (1971) is calculated, together with an asymptotic standard error and test for kappa=0.
It assumes that the raters involved are not assumed to be the same (one-way ANOVA setting).
The marginal category proportions determine the chance agreement.
Setting variant="conger"
gives the variant of Conger (1980) that reduces to Cohen's kappa when m=2 raters.
It assumes identical raters for the different subjects (two-way ANOVA setting).
The chance agreement is based on the category proportions of each rater separately.
Typically, the Conger variant yields slightly higher values than Fleiss kappa.
variant="robust"
assumes a chance agreement of two raters to be simply 1/q, where q is the number of categories (uniform model).
list containing Fleiss's kappa agreement measure (value) or NULL
if no subjects
# 4 subjects were rated by 3 raters in categories "1", "2" or "3" # organize ratings as matrix with subjects in rows and raters in columns m <- matrix(c("3", "2", "3", "2", "2", "1", "1", "3", "1", "2", "2", "3"), ncol = 3, byrow = TRUE) kappam_fleiss(m) # show category-wise kappas ----- kappam_fleiss(m, detail = TRUE)
# 4 subjects were rated by 3 raters in categories "1", "2" or "3" # organize ratings as matrix with subjects in rows and raters in columns m <- matrix(c("3", "2", "3", "2", "2", "1", "1", "3", "1", "2", "2", "3"), ncol = 3, byrow = TRUE) kappam_fleiss(m) # show category-wise kappas ----- kappam_fleiss(m, detail = TRUE)
First, Cohen's kappa is calculated between each rater against the gold
standard which is taken from the 1st column by default. The average of these
kappas is returned as 'kappam_gold0'. The variant setting (robust=
) is
forwarded to Cohen's kappa. A bias-corrected version 'kappam_gold' and a
corresponding confidence interval are provided as well via the jackknife
method.
kappam_gold( ratings, refIdx = 1, robust = FALSE, ratingScale = NULL, conf.level = 0.95 )
kappam_gold( ratings, refIdx = 1, robust = FALSE, ratingScale = NULL, conf.level = 0.95 )
ratings |
matrix. subjects by raters |
refIdx |
numeric. index of reference gold-standard raters. Currently, only a single gold-standard rater is supported. By default, it is the 1st rater. |
robust |
flag. Use robust estimate for random chance of agreement by Brennan-Prediger? |
ratingScale |
Possible levels for the rating. Or |
conf.level |
confidence level for confidence interval |
list. agreement measures (raw and bias-corrected) kappa with
confidence interval. Entry raters
refers to the number of tested raters,
not counting the reference rater
# matrix with subjects in rows and raters in columns. # 1st column is taken as gold-standard m <- matrix(c("O", "G", "O", "G", "G", "R", "R", "R", "R", "G", "G", "O"), ncol = 3, byrow = TRUE) kappam_gold(m)
# matrix with subjects in rows and raters in columns. # 1st column is taken as gold-standard m <- matrix(c("O", "G", "O", "G", "G", "R", "R", "R", "R", "G", "G", "O"), ncol = 3, byrow = TRUE) kappam_gold(m)
This function expands upon Cohen's and Fleiss' Kappa as measures for interrater agreement while taking into account the heterogeneity within each group.
kappam_vanbelle( ratings, refIdx, ratingScale = NULL, weights = c("unweighted", "linear", "quadratic"), conf.level = 0.95 )
kappam_vanbelle( ratings, refIdx, ratingScale = NULL, weights = c("unweighted", "linear", "quadratic"), conf.level = 0.95 )
ratings |
matrix of subjects x raters for both groups of raters |
refIdx |
numeric. indices of raters that constitute the reference group. Can also be all negative to define rater group by exclusion. |
ratingScale |
character vector of the levels for the rating. Or |
weights |
optional weighting schemes: |
conf.level |
confidence level for interval estimation |
Data need to be stored with raters in columns.
list. kappa agreement between two groups of raters
Vanbelle, S., Albert, A. Agreement between Two Independent Groups of Raters. Psychometrika 74, 477–491 (2009). doi:10.1007/s11336-009-9116-1
# compare student ratings with ratings of 11 experts kappam_vanbelle(SC_test, refIdx = 40:50)
# compare student ratings with ratings of 11 experts kappam_vanbelle(SC_test, refIdx = 40:50)
In medical education, the script concordance test (SCT) (Charlin, Gagnon, Sibert, & Van der Vleuten, 2002) is used to score physicians or medical students in their ability to solve clinical situations as compared to answers given by experts. The test consists of a number of items to be evaluated on a 5-point Likert scale.
SC_test
SC_test
A matrix with 34 rows and 50 columns. Columns 1 to 39 are student raters, columns 40 to 50 are experts. Each rater applies to each clinical situation one of five levels ranging from -2 to 2 with the following meaning:
The assumption is practically eliminated;
The assumption becomes less likely;
The information has no effect on the assumption;
The assumption becomes more likely;
The assumption is virtually the only possible one.
Each item represents a clinical situation (called an 'assumption') likely to be encountered in the physician’s practice. The situation has to be unclear, even for an expert. The task of the subjects being evaluated is to consider the effect of new information on the assumption to solve the situation. The data incorporates 50 raters, 39 students and 11 experts.
Each rater judges the same 34 assumptions.
Sophie Vanbelle (personal communication, 2021)
Vanbelle, S., Albert, A. Agreement between Two Independent Groups of Raters. Psychometrika 74, 477–491 (2009). doi:10.1007/s11336-009-9116-1
The function generates simulation data according to given categories and probabilities.
and can repeatedly apply function kappam_gold()
.
Currently, there is no variation in probabilities from rater to rater,
only sampling variability from multinomial distribution is at work.
simulKappa(nRater, cats, nSubj, probs, mcSim = 10, simOnly = FALSE)
simulKappa(nRater, cats, nSubj, probs, mcSim = 10, simOnly = FALSE)
nRater |
numeric. number of raters. |
cats |
categories specified either as character vector or just the numbers of categories. |
nSubj |
numeric. number of subjects per gold standard category. Either a single number or as vector of numbers per category, e.g. for non-balanced situation. |
probs |
numeric square matrix (nCat x nCat) with classification
probabilities. Row |
mcSim |
numeric. Number of Monte-Carlo simulations. |
simOnly |
logical. Need only simulation data? Default is |
This function is future-aware for the repeated evaluation of kappam_gold()
that is triggered by this function.
dataframe of kappa-gold on the simulated datasets or (when
simOnly=TRUE
) list of length mcSim
with each element a simulated data
set with goldrating in first column and then the raters.
# repeatedly estimate agreement with goldstandard for simulated data simulKappa(nRater = 8, cats = 3, nSubj = 11, # assumed prob for classification by raters probs = matrix(c(.6, .2, .1, # subjects of cat 1 .3, .4, .3, # subjects of cat 2 .1, .4, .5 # subjects of cat 3 ), nrow = 3, byrow = TRUE))
# repeatedly estimate agreement with goldstandard for simulated data simulKappa(nRater = 8, cats = 3, nSubj = 11, # assumed prob for classification by raters probs = matrix(c(.6, .2, .1, # subjects of cat 1 .3, .4, .3, # subjects of cat 2 .1, .4, .5 # subjects of cat 3 ), nrow = 3, byrow = TRUE))
Staging of carcinoma is done by different medical professions. Gold standard is the (histo-)pathological rating of a tissue sample but this information typically only becomes available late, after surgery. However prior to surgery the carcinoma is also staged by radiologists in the clinical setting on the basis of MRI scans.
stagingData
stagingData
A data frame with 21 observations and 6 variables:
the (histo-)pathological staging (gold standard) with categories I, II or III
the clinical staging with categories I, II or III by radiologist 1
the clinical staging with categories I, II or III by radiologist 2
the clinical staging with categories I, II or III by radiologist 3
the clinical staging with categories I, II or III by radiologist 4
the clinical staging with categories I, II or III by radiologist 5
These fictitious data were inspired by the OCUM trial. The simulation uses the following two assumptions: over-staging occurs more frequently than under-staging and an error by two categories is less likely than an error by only one category.
Stages conform to the UICC classification according to the TNM classification. Note that cases in stage IV do not appear in this data set and that the following description of stages is simplified.
I Until T2, N0, M0
II From T3, N0, M0
III Any T, N1/N2, M0
simulated data
Kreis, M. E. et al., MRI-Based Use of Neoadjuvant Chemoradiotherapy in Rectal Carcinoma: Surgical Quality and Histopathological Outcome of the OCUM Trial doi:10.1245/s10434-019-07696-y
Quick simple jackknife routine to estimate bias and standard error of an estimator.
victorinox(est, idx)
victorinox(est, idx)
est |
estimator function |
idx |
maximal index vector for data of estimator |
list with jackknife information, bias and SE
https://de.wikipedia.org/wiki/Jackknife-Methode