Hi all,
I’d like to announce the debut of the “Online Kappa Calculator.” It calculates free-marginal and fixed-marginal kappa–a chance-adjusted measure of interrater agreement–for any number of cases, categories, or raters. Roman and Nikko will appreciate this; now we don’t have to do these calculations the hard way!
Thanks to Walubengo M. Singoro for his fantastic programming work on this.
Check out:
http://justus.randolph.name/kappa
I’m going on vacation to the States tomorrow. If you have any comments, I’ll get back to you in a few weeks. Is anyone else interested in creating a collection of calculators for obscure, but intensely useful, statistics? I have ideas for many student programming projects along this line.

49 comments
Comments feed for this article
August 20, 2008 at 6:49 pm
Brenda
I have not been able to open to calculator (Java applet problem)…any ideas on getting the applet to run???
August 21, 2008 at 10:53 pm
Justus Randolph
Hi Brenda,
I checked the site and it is working on my computer. Earlier I had a problem on this computer where the java wouldn’t load. Uninstalling and then reinstalling the latest version of java worked to get the issue resolved on my own computer. Let me know if this tack doesn’t work.
August 21, 2008 at 10:56 pm
Justus Randolph
P.S., you have to accept the digital certificate, so that you can cut and paste your data from a spreadsheet program.
August 22, 2008 at 2:14 pm
roman
Usability-wise, I think you should move the calculator behind one extra link (or a ‘Start’ button). What happens is that the java application will load whatever you want it or not (e.g. for just to check the references). On most computers, the loading takes quite a long while and the browsers often chuck on it.
September 12, 2008 at 4:43 am
Greg
I spent about two hours entering data into the calculator (224 cases and 17 categories). When I pushed the button to calculate the results, I was continually asked to ensure I had entered the number of raters (which I had–2 raters). Eventually, I lost everything, including two hours of my life.
September 13, 2008 at 3:40 am
Justus Randolph
We figured out that the two hours of lost life was because the empty cells were left blank and not filled in with zeros. I’ll make a note in the “Instructions” that no cell can be left empty–empty cells need to have a zero.
Greg, thanks for pointing out this point of confusion and sorry about the wasted time.
Justus
September 13, 2008 at 4:12 am
Greg
When I made it clear that the calculator was not working for me, Justus went to a lot of trouble to solve the problem for me. Now that we understand the error that was made, everything works great.
The calculator caused a headache for a period, but once Justus figured out where things were going wrong, the calculator has been fantastic and I can feel my lost life returning!
November 11, 2008 at 6:14 am
Jeet
Hi Justus,
First of all I would like to thank you very much for hoting this online kappa calculator. I have fed my data (0 when category was mot used), but I am not getting any result when calculate button is clicked. Plaese tell me what else to do or how to get the results? Regards.
Jeet
November 12, 2008 at 5:55 am
GA
Justus,
Thank you for providing an online calculator. Could you please tell me if my data set is appropriate for this analysis? I have 8 pairs of raters who each rated their own set of 10 subjects. (The same people are always paired together.) So, different subjects are sometimes rated by the same raters and sometimes not. I want to put the data from all rater-pairs into one analysis, rather than calculating a kappa for each pair of raters. Is that possible?
Thanks very much.
November 12, 2008 at 9:24 pm
Justus Randolph
Hi GA,
The Online Kappa Calculator isn’t set up for this kind of situation. Kappa only works when you have all raters rating the same sample. It sounds like a generalizability theory sort of problem to me. See:
http://www.psychology.sdsu.edu/faculty/matt/Pubs/GThtml/GTheory_GEMatt.html
If you wanted to stay with a Kappa approach, you could report the value of kappa for all rater pairs since there are only eight. It seems like that would tell the reader what they needed to know: highest kappa, lowest kappa, median/mean kappa. They would subjectively get a sense of overall interrater agreement.
November 12, 2008 at 9:50 pm
Justus Randolph
Hi Jeet,
Sometimes I have to hit the “submit” button a few times if the server is slow. Did you get it working? If not, get back to me and tell me more about your data set and I’ll try to figure it out.
Justus
November 13, 2008 at 5:07 am
GA
Thanks very much, Justus. I appreciate your time.
November 20, 2008 at 7:37 pm
Jeet
Thanks for the tip Justus, but unfortunately it did not work.
My data has 31 cases, 6 catogaries and 3 observers.
0 3 0 0 0 0
0 3 0 0 0 0
0 0 1 2 0 0
0 3 0 0 0 0
0 0 1 1 0 1
1 2 0 0 0 0
1 2 0 0 0 0
0 2 0 1 0 0
1 2 0 0 0 0
0 3 0 0 0 0
1 2 0 0 0 0
0 1 0 2 0 0
1 2 0 0 0 0
0 3 0 0 0 0
0 3 0 0 0 0
0 2 0 1 0 0
0 3 0 0 0 0
0 1 0 1 1 0
2 1 0 0 0 0
0 3 0 0 0 0
0 2 0 1 0 0
0 0 2 0 1 0
0 1 1 1 0 0
0 0 1 0 1 1
0 0 1 0 1 1
0 0 2 0 1 0
0 0 2 0 1 0
0 0 1 1 1 0
0 0 1 1 1 0
0 2 0 1 0 0
0 1 1 0 1 0
Hope this may help you to help me more. Thanks for your time.
Cheers
Jeet
November 20, 2008 at 10:15 pm
Justus Randolph
Hi Jeet,
I was able to cut and paste your data set above right into the calculator and it worked fine. I’m not sure what kind of error you were having, but often uninstalling then reinstalling java will get the online kappa calculator working.
Anyway, your results are below:
Percent of overall agreement: 0.419355
Fixed-Marginal Kappa: 0.153976
Free-Marginal Kappa: 0.303226
November 26, 2008 at 10:31 pm
Jeet
Thank you very much Justus.
I am going to try using online kappa calculator from home and see, as I still have the same problem with it from my office (after pasting the data and clicking calculate, nothing happens, even after repeated clicking). I will let you know what happens from home.
Once again thanks for calculating my data.
Jeet
December 17, 2008 at 6:27 pm
JKS
Hi,
Thanks for providing this great resource! BUT…I’m also having problems getting the program to work at all. I’ve tried changing browers (Firefox, IE) and asking a co-worker to try on their Apple computer, but nothing happens when I push CALCULATE. I also reinstalled JAVA.
I have 239 cases with 4 categories and 5 raters. Is it possible this is just too many cases?
Thanks for any advice,
/JKS
5 0 0 0
0 0 1 4
4 1 0 0
0 0 2 3
1 0 4 0
0 0 5 0
2 0 3 0
0 0 0 5
3 0 2 0
0 1 2 2
0 0 4 1
4 1 0
4 0 0 1
2 2 0 1
0 0 4 1
0 0 0 5
0 0 0 5
0 0 4 1
0 0 4 1
5 0 0 0
0 0 0 5
0 0 3 2
0 0 0 5
2 0 3 0
5 0 0 0
5 0 0 0
5 0 0 0
0 0 2 3
1 0 0 4
1 0 4 0
0 0 0 5
0 0 1 4
1 0 4 0
0 0 0 5
0 0 0 5
4 0 1 0
4 0 1 0
0 0 0 5
0 0 5 0
5 0 0 0
5 0 0 0
0 0 3 2
0 0 0 5
0 0 1 4
0 1 3 1
5 0 0 0
0 0 0 5
5 0 0 0
5 0 0 0
0 0 5 0
5 0 0 0
5 0 0 0
0 0 0 5
5 0 0 0
5 0 0 0
0 0 3 2
0 0 2 3
0 0 5 0
0 0 1 4
0 0 0 5
0 1 3 1
0 0 1 4
0 0 1 4
4 0 1 0
0 0 5 0
0 0 2 3
0 0 4 1
0 0 4 1
0 0 0 5
0 0 4 1
0 0 2 3
0 0 2 3
0 0 2 3
0 5 0 0
0 0 0 5
0 0 5 0
4 0 1 0
0 0 5 0
0 0 5 0
1 0 2 2
3 0 2 0
1 0 3 1
5 0 0 0
0 0 5 0
0 0 2 3
0 0 3 2
4 0 1 0
0 0 0 5
2 0 0 3
4 0 1 0
0 0 1 4
0 0 0 5
5 0 0 0
5 0 0 0
0 0 4 1
0 0 4 1
5 0 0 0
0 0 4 1
2 0 3 0
2 0 3 0
0 0 2 3
1 0 4 0
0 0 1 4
0 0 0 5
0 1 4 0
0 0 0 5
0 0 0 5
0 0 5 0
2 0 3 0
1 0 4 0
1 0 4 0
0 0 4 1
0 0 3 2
2 0 3 0
1 0 3 1
0 0 4 1
0 0 4 1
2 0 3 0
2 0 3 0
2 0 1 2
0 0 2 3
0 0 3 2
0 0 0 5
0 0 3 2
0 0 3 2
0 0 0 5
0 2 3 0
0 0 1 4
0 0 2 3
0 0 3 2
1 0 2 2
5 0 0 0
0 0 3 2
4 0 1 0
0 0 1 4
5 0 0 0
1 0 0 4
0 0 1 4
5 0 0 0
5 0 0 0
5 0 0 0
5 0 0 0
5 0 0 0
0 0 4 1
0 0 5 0
0 0 1 4
0 0 1 4
0 0 5 0
5 0 0 0
4 0 1 0
0 0 1 4
0 0 0 5
0 1 2 2
0 0 0 5
0 0 2 3
0 0 0 5
3 0 2 0
5 0 0 0
0 1 4 0
5 0 0 0
4 0 1 0
0 0 4 1
0 0 2 3
1 0 4 0
0 1 4 0
0 0 2 3
0 0 1 4
0 1 4 0
0 0 0 5
0 0 5 0
5 0 0 0
0 0 0 5
0 0 1 4
4 0 1 0
3 0 2 0
1 0 4 0
0 0 0 5
0 0 0 5
0 0 0 5
0 0 0 5
0 0 0 5
3 0 2 0
3 0 2 0
0 0 0 5
0 0 0 5
0 0 0 5
0 1 3 1
0 0 1 4
0 0 2 3
0 0 1 4
0 0 0 5
1 0 4 0
0 0 4 1
0 0 1 4
0 0 5 0
5 0 0 0
1 0 0 4
0 0 0 5
0 0 0 5
5 0 0 0
1 1 2 1
3 0 2 0
0 0 0 5
0 0 1 4
0 0 4 1
0 0 2 3
0 2 3 0
5 0 0 0
0 0 0 5
0 2 3 0
0 0 0 5
1 0 4 0
5 0 0 0
0 0 0 5
1 4 0 0
5 0 0 0
0 0 0 5
5 0 0 0
0 0 4 1
5 0 0 0
0 0 4 1
3 0 2 0
2 0 2 1
0 0 4 1
0 0 5 0
5 0 0 0
5 0 0 0
5 0 0 0
0 0 2 3
0 0 0 5
0 0 0 5
0 0 0 5
0 0 0 5
1 0 4 0
0 0 0 5
4 0 1 0
2 0 0 3
4 0 1 0
0 0 0 5
December 17, 2008 at 6:32 pm
JKS
Hmm! I now see there is an error in my data, with a missing value. Now it works!
So nevermind! The only suggestion I have is that it might be nice that if this type of error occurs you get an error message. Now I wasn’t sure if I had done something wrong or if the program wasn’t working anymore.
Anyway, THANKS so much for making this available!
/JKS
January 7, 2009 at 10:23 am
Kim
Justus,
This is so cool! Thank you, thank you.
I wonder if you can tell me if I have things set up correctly. I have 16 items, rated on a 4 point scale, with 21 subjects rating the items.
So….
16 cases
4 categories
21 raters
Am I right?
Kim
January 7, 2009 at 10:36 am
Kim
Justus,
One other thing, I have collected my data through the Delphi Survey Method. I wonder which of the statistics, Fixed-Marginal Kappa or Free-Marginal Kappa would be most appropriate?
Kim
January 8, 2009 at 12:14 am
Justus Randolph
Hi Kim,
I wonder if either of the Kappa statistics are appropriate in your case. First, if I understand the Delphi method correctly, raters are able to change their ratings after hearing a summary of ratings, making the ratings dependent. I don’t think that the kappa family of statistics are appropriate for dependent ratings. (By “dependent” I mean that one rating or rater affects the ratings of another.) By being able to revise their answers based on other answers, it is obvious that the raters will be able to do better than chance.
Second, it seems that your scale is not actually categorical, but rather is continuous or ordinal because you wrote”on a 4 point scale.” If it is continuous or ordinal you would be better off using a different statistic. A great multirater agreement statistic for continuous scales is the intraclass correlation coefficient ( see e.g., http://www.nyu.edu/its/statistics/Docs/intracls.html).
However, if your ratings are independent and your scale is categorical (apples, oranges, and pears), one of the kappa statistics would be right for you and indeed there would be 16 cases, 4 categories, and 21 raters. Like Brennan and Prediger I suggest using free-marginal kappa if raters didn’t need to have a certain number of “1″ ratings, a certain number of “2 ratings, and so on.
Hope this helps. Feel free to write back if you have any other questions. The next comment might help explain some of the reasons why there are better agreement statistics than kappa for ordinal or continuous scales.
–Justus
January 8, 2009 at 1:04 am
Justus Randolph
An online kappa calculator user, named Lindsay, and I had an e-mail discussion that I thought other online kappa calculator users might benefit from. I will excerpt parts of our conversation below.with permission. Lindsay, thanks for your great questions and letting me share them with others. Feel free to write back if you have any more questions or if I didn’t answer your questions.
***Lindsay wrote:
I am interested in using your online multi-rater free margin Kappa calculator for a research project; however, I am having a statistical problem and hoping you can help me understand.
I have an ordinal scale of 0 (unacceptable), 1 (acceptable), and 3 (excellent).
I have 3 raters, each using the scale above to rate 30 images.
When I enter the data into your online calculator, I get 0.10 free margin Kappa. If I transform the data into a dichotomous scale (0 unaccetable and 1 acceptable), the free margin Kappa goes up to 0.86.
What bothers me is that performing standard Cohen’s Kappa calculations via SPSS for Rater 1 vs. Rater 2, Rater 2 vs. Rater 4 and so on yields much lower kappas for the dichotomous ratings, while your online calculator yields much higher for dichotomous variables.
I’m trying to understand why it’s reversed.
***Justus wrote:
It seems like there are a few questions going on here:
Why does the free-marginal kappa differ from the fixed-marginal kappa?
Why does that relationship change when I make dichotomoize my variable?
Should I dichotimize my variable or not?
What statistic should I use?
===
–Why does the free-marginal kappa differ from the fixed-marginal kappa?
All other things being equal, free-marginal and fixed-marginal kappa
differs because of prevalence and bias. You can read about this from:
Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: Some
uses, misuses, and alternatives. Educational and Psychological
Measurement (41), 687-699.
Or
Randolph, J. J. (2005). Free-marginal multirater kappa: An alternative
to Fleiss’ fixed-marginal multirater kappa. Paper presented at the
Joensuu University Learning and Instruction Symposium 2005, Joensuu,
Finland, October 14-15th, 2005. (ERIC Document Reproduction Service
No. ED490661)
–Why does that relationship change when I dichotomoize my variable?
Theoretically, since the free-marginal kappa will increase as the
number of categories increase, the free-marginal kappa should go up
after dichotomizing. However, I suspect that dichotomizing drastically
increased the percent of overall agreement and that is why you saw the
kappa values you did. I bet that it is much easier to categorize a
case as (unacceptable) or (acceptable or excellent) than unacceptable,
acceptable, or excellent. I’m guessing that there is a strong
distinction between unacceptable and (acceptable or excellent) and a
very fine distinction between acceptable and excellent. I strongly
suggest that you crosstabulate how many of each type of errors there
were (unacceptable-acceptable; unacceptable-excellent;
acceptable-excellent). Given chance, there should be about the same
number of errors in each category. I’m guessing that that won’t be the
case. Graphing where the errors are can tell you a lot about the
construct or phenomenon you are investigating. I always suggest doing
that crosstabulation.
In short, the kappa is going to change when you dichotomize the
variable because the percent of overall agreement is probably going to
change because of systematic categorizing errors in your case.
–Should I dichotimize my variable or not?
One can “massage” free-marginal kappa by increasing the number of
categories. Therefore, I suggest using only as many categories as are
theoretically justifiable. Since you chose three categories a priori,
I would stick with three categories.
–Which statistic should I use?
The Kappa family of statistics are appropriate when you have nominal
variables. So, a different type of statistic might be better for you.
My rationale for using a different statistic is that treating your
three categories as nominal assumes that an unacceptable-excellent
disagreement is the same degree of error as an acceptable-excellent or
unacceptable-acceptable error. If you want to treat your categories as
ordinal, there are better statistics than Kappa. I don’t have my books
handy right now, but you could probably find the right statistic from
Siegel, S., & Castellan, (1988). Nonparametric statistics for
the social sciences, (2nd ed.). New York: McGraw-Hill.
If you think it is appropriate to treat your categories as continuous,
a good candidate would be one of the variations of the intra-class
correlation coefficient. See:
http://www.ats.ucla.edu/stat/spss/library/whichicc.htm
***Lindsay wrote:
So free-marginal kappa increases as the number of categories increases. This is interesting, when applied to my results using your online calculator. I have found the exact opposite, entering in Non-dichotomous (3 category) data for 8 attributes using 3 raters and then entering the Dichotomous data. The dichotomous results are uniformly higher – much higher – than their non-dichotomous counterparts.
Some samples:
Q1 – 0.48 (Non-Dichotomous) 0.87 (Dichotomous)
Q2 – 0.33 (ND) 0.73 (D)
Q3 – 0.37 (ND) 0.82 (D)
Q4 – 0.27 (ND) 0.78 (D)
Strange. The pattern is consistent all the way down the line of attributes.
Any idea why this would occur? I have 30 cases per attribute, so I’m not working with extremely small samples.
*** Justus wrote:
So, ALL OTHER THINGS BEING EQUAL, an increase in the number of
categories will increase the value of free-marginal kappa. As an
exercise, try holding the percent of overall agreement constant and
changing the number of categories. I illustrate this in Figure 3 of
http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/1b/c3/27.pdf
I expect that the discrepancy you find between kappa values when you
convert a three- category system to a two-category system is because
the percent of overall agreement changes drastically when you switch
from a three-category system to a two-category system. The increase in
percent of overall agreement from reducing categories must outweigh
the associated decrease in agreement expected by chance. (Remember
that the kappa formula is (P-overall – Pexpected)/(1-Pexpected); two
variables affect the kappa value–not just the percent of expected
agreement.) In your data set, doesn’t the percent of overall agreement
dramatically increase when you treat the three categories as if they
were two categories? If so, that would help explain the discrepancy.
I would make a graph of what kind of errors (i.e., disagreements) the
raters made (e.g, 1-3, 1-2, and 2-3). I suspect how you split up the
three categories into two categories will make a difference in percent
of overall agreement and, therefore, make a difference in the values
of kappa you find. If the errors are not split up evenly, then how you
dichotomize the categories will make a difference in percent of
overall agreement. For example, if most of the errors are 2-3 errors,
then you can “hide” those errors by combing categories 2 and 3 into
let’s say a “4″ category and then recalculating the errors as if there
were only a 1 and 4 category.
Overall, I would suggest that you use as many categories as you
originally had the raters use. That seems like it would give you the
most accurate picture of agreement.
February 13, 2009 at 6:43 am
Jessica
The table is not showing up on the site. Has it been moved?
Thanks!
February 13, 2009 at 8:45 am
Justus Randolph
Hi Jessica,
The Online Kappa Calculator works fine for me. I just tried it. Please describe your problem in more detail and perhaps I can help troubleshoot. I had a computer once where I had to install the latest version of Java to get the Calculator to load properly.
March 3, 2009 at 8:45 pm
Sofie
Hi Justus,
First I want to thank you for your work and the free calculator on the internet!
For my reliability study, I have calculated Fleiss’ kappa and in addition the Free-marginal multirater kappa. My question is how I can calculate the confidence intervals, which is also important to mention. I have already calculated the confidence intervals for the Fleiss’ kappa in another statistical program. Can I mention the same confidence intervals for the free-marginal kappa’s?
Kind regards,
Sofie
March 5, 2009 at 5:45 pm
Claire
Hi Justus,
I’m in a bit of a similar situation to Kim (one of the previous posters). I’m using a 4-point Likert scale to calculate inter-rater reliability. I used Cohen’s kappa when I only had 2 raters, but in my next study I have up to ten raters.
Someone suggested using a weighted version of Fleiss’ kappa (as I used a weighted version of Cohen’s kappa for my first study). So I guess I have a few questions:
1. Does a weighted Fleiss’ kappa exist?
2. If Fleiss’ kappa is still appropriate for me to use, which statistic would be most appropriate: Fixed-Marginal Kappa or Free-Marginal Kappa?
3. What should I do about missing variables (i.e. only 9 responses, but 10 raters)?
Thanks so much for your help & great program!
Claire
March 8, 2009 at 4:07 am
Justus Randolph
Hi Sofie,
Provide me with a little bit of information and I can figure out the confidence intervals for you and show you how I did it.
I’ll need:
-The total number of cases
-The number of cases you calculated kappa for
-The percent of overall agreement
-The number of categories
-The confidence intervals you are interested in (e.g., 95%, 90%. . . )
March 8, 2009 at 4:31 am
Justus Randolph
Hi Claire,
1. I haven’t read anything about a weighted Fleiss’s Kappa, but that doesn’t mean it doesn’t exist. If you can’t find it, it probably wouldn’t be too hard to figure a formula out. I’m swamped right now, but that might make a nice statistical paper someday for someone ; )
2. Since you have a continuous or ordinal scale (depending on how you think about Likert scales), I wouldn’t use a Kappa statistic at all. Why not use an intra-class correlation if you can consider the scale to be continuous (See P. E. Shrout & Joseph L. Fleiss (1979). “Intraclass Correlations: Uses in Assessing Rater Reliability”. Psychological Bulletin 86 (2): 420–428. doi:10.1037//0033-2909.86.2.420. Or, see http://www.ats.ucla.edu/stat/spss/library/whichicc.htm for how to compute it with SPSS.) If it’s ordinal, Siegel and Castellan’s Nonparametic Statistics for the Behavioral Sciences has a whole chapter on measures of association for ordinal data.
As you know, unweighted kappa statistics (like the ones used in the Online Kappa Calculator) are meant for assessing the reliability of categorical data. For example, unweighted Kappa statistics give the same weight to a strong disagreement (e.g., strongly disagree v. strongly agree) as a weak disagreement (e.g., strongly disagree v. disagree). Weighting the Kappa can let you assign weights for each level of disagreement if that is appropriate in your situation. That might be particularly helpful if you want to assign customized weights to various levels of disagreement. But, there are plenty of commonly used and easily computed multirater statistics for ordinal and continuous data. Why not use those?
About which to use if you were going to use a multi-rater kappa: The Kappa family of statistics can be divided into two categories–those that use 1/number of categories as the percent of expected agreement (free-marginal) and those that don’t (fixed-marginal). There is a lot of debate which situations it is appropriate to use the various types of Kappa, but I’m convinced by Brennan and Prediger’s argument (you can find the reference on the bottom of the Online Kappa Calculator page) that one should use fixed-marginal kappas (like Cohen’s kappa or Fleiss’s kappa) when you have a situation where you tell raters, for example, “Categorize these ten cases into two categories, AND MAKE SURE THAT YOU END UP WITH FIVE CASES IN EACH CATEGORY” and should free-marginal kappas when you have a situation where you tell raters, for example, “Categorize these ten cases into two categories. It doesn’t matter how many cases end up in each category.”
3. Like most Kappa formulas, the formula used in the Online Kappa Calculator needs a full data set. My best advice in your case is to just format the data set disregarding the rater who didn’t fully respond. I’m not aware of research that investigates the effects of using different missing data strategies on the values of the various kappa statistics. –There’s another good topic for a statistics paper.
Let me know if you have any other questions,
Take care,
Justus
March 8, 2009 at 8:43 pm
Sofie
Hi Justus,
I hope this is the information you need:
Total number of cases: 40 (population)
The number of cases I calculated kappa for: 18
Percent of overall agreement:
0,680 – 0,660 – 0,675 – 0,745 – 0,740 – 0,880 – 0,865 – 0,960 – 0,975 – 0,650 – 0,655 – 0,635 – 0,745 – 0,800 – 0,905 – 0,955 – 0,970 – 0,990
Number of categories: 2
Confidence interval I would like to calculate: 95 %
Thanks for your help!
Sofie
March 10, 2009 at 1:37 am
Justus Randolph
Hi Sofie,
One more thing–how many raters were there?
Justus
March 12, 2009 at 12:06 am
Sofie
There were 5!
March 12, 2009 at 1:54 am
Justus Randolph
Hi Sofie,
I’m still have a little problem making sense of this. I figured that using the 18 percent of overall agreements you gave, I could average those to get the overall percent of overall agreement? But, I’m not sure if that is right. I figured that the percent of overall agreement was 80.4722, but it seems to me that the percent of overall agreement should be a factor of 180 (With five raters, there are ten possibilities for agreement per case:1-2, 1-3, 1-4, 1-5, 2-3, 2-4, 2-5, 3-4, 3-5, 4-5. 10 possibilities * 18 cases = 180 possibilities). I need to figure out, out of the 180 possibilities for agreement in your data set, how many agreements there actually were.
When you use the Online Kappa Calculator for your data set that has 18 cases (drawn from a sample of 40 cases), 5 raters, and 2 categories, what are the specific values for percent of overall agreement, fixed-marginal and free-marginal kappa?
March 15, 2009 at 3:43 am
Sofie
Hi Justus,
I think there must have been a misunderstanding from my side. I’ll explain the whole situation: 5 raters examinated 18 ribs (rib 2-10 right and rib 2-10 left) of 40 people. For each rib, the 5 raters chose whether there was a blockade “yes” or “no” (=2 categories). Then I calculated the Fleiss’ kappa and the Free-marginal kappa for each rib.
So, I calculated 18 times the Fleiss’ and Free-marginal kappa but for each rib, there where 40 cases.
So for each rib:
Total number of cases: 40 (population)
Number of categories: 2
Confidence interval I would like to calculate: 95 %
Number of Raters: 5
Percent of overall agreement:
rib 2 right= 0,680
Rib 3 right = 0,660
Rib 4 right = 0,675
Rib 5 right = 0,745
Rib 6 right = 0,740
Rib 7 right = 0,880
Rib 8 right = 0,865
Rib 9 right = 0,960
Rib 10 right = 0,975
Rib 2 left = 0,650
Rib 3 left = 0,655
Rib 4 left = 0,635
Rib 5 left = 0,745
Rib 6 left = 0,800
Rib 7 left = 0,905
Rib 8 left = 0,955
Rib 9 left = 0,970
Rib 10 left = 0,990
Sofie
March 27, 2009 at 6:49 am
jrandolp
Hi Sofie,
Sorry for taking so long to get back to you. It’s been a really hectic week, or two.
Now I’ve got a good sense for what your data set is like. However, I’m not entirely sure what population you mean to make inferences about. If you still need help after reading my explanations below send me an e-mail (justus@randolph.name) and we can set up an appointment for me to do some statistical consulting for you, if you desire. I’m happy to write this long explanation here because it might have value for Online Kappa Calculator users in general.
On a side note, it looks like it’s not very easy for people to agree on whether a rib is concaded or not. When there is a lot of disagreement I suggest that you make a table of who tends to agree with whom and who tends to disagree with whom and why they tend to agree or disagree. You might find out that a simple clarification of the procedure is what is needed to boost agreement. Plus, you can find a lot of interesting results from that kind of sleuthing. For example, in a content analysis I did on publication bias we found that researchers tend to bury nonsignificant results in nonnumerical text and emphasize significant findings in numerical text and tables. We wouldn’t have found that had we not had low kappa and tried to troubleshoot the source of our disagreement.
GENERATING CONFIDENCE INTERVALS AROUND FREE-MARGINAL KAPPA WHEN YOU ARE INFERRING FROM A RELIABILITY SUBSAMPLE TO THE ENTIRE SAMPLE
Typically, I see confidence intervals drawn around a kappa statistic when one is making inferences from a reliability sub-sample to a sample. This might happen when you have limited time or resources and have multiple raters rate different cases in your sample. In this case, it is customary to overlap some of the cases to determine to what degree raters are agreeing. If the raters agree, there is justification for having different raters rate a different set of cases. Even if rating work isn’t shared, it’s good practice to have a second rater (or more raters) rate a sub-sample of cases to establish whether there the obtained rating are not “the idiosyncratic results of one rater’s subjective judgment” (Neuendorf, K.A., 2002, The Content Analysis Guidebook. Thousand Oaks, CA: Sage.).
In the case where not all raters rate all cases, the confidence intervals around the sample Kappa show the range where Kappa likely would have fallen had all raters rated all cases. That is, the inference is from the subsample of reliability cases to the population of sample cases. For example, in my dissertation I randomly sample a set of 352 cases. I rated all cases and had a second rater rate a random subsample of 53 cases. I calculated kappa for each case and confidence intervals around the kappa. The confidence intervals indicated the range it was likely for the population Kappa to have fallen had the second rater rated all cases. You can read the details from http://www.archive.org/details/randolph_dissertation
Sofie, in the case of your data set, since all raters rated all 40 cases, you don’t need to calculate confidence intervals if you meant to make an inference from the reliability subsample to all sampled cases. You know what the population parameter is; you don’t have to make any inferences.
Since this will probably be of interest to other Online Kappa Calculator users, I wrote a program to calculate confidence intervals around free-marginal kappa in the case of generalizing from a reliability subsample to a sample. I’m a big fan of resampling, especially for obscure statistics like free-marginal Kappa, so I used resampling here. You can run this program on a giftware resampling program, Statistics 101, which is based on the Resampling Stats language. You can download Statistics 101 from http://www.statistics101.net/. See http://www.statistics101.net/QuickReference.pdf for a quick reference to the software and to make sense of the program below.
STATISTICS 101 PROGRAM TO CALCULATE CONFIDENCE INTERVALS AROUND FREE-MARGINAL KAPPA AND PERCENT OF OVERALL AGREEMENT WHEN INFERRING FROM A RELIABILITY SUBSAMPLE TO ALL SAMPLED CASES.
This program presumes that there is a total 1000 sampled cases, that the number of expected agreements in the 1000 sample cases is 750, that the total number of expected disagreement in the sample cases is 250, that the size of the reliability subsample is 100, that the percent of overall agreement found in the reliability subsample is 0.75 (or 75%), that the percent of expected agreement is 0.5 (remember that percent of expected agreement is 1/number of rating categories), and that the desired confidence intervals are 95% intervals. Agreements are labeled with a “1″ and disagreements are labeled with a “2″.
You can find the number of expected agreements and expected disagreements to put in line 1 by multiplying your percent of overall agreement by the total number of sampled cases (here it was 0.75*1000). Your number of expected disagreements can be found by multiplying the total number of sample cases by (1-percent of overall agreement). Here the expected number of disagreements was 250 because (1-0.75)*100=250. Note that only whole numbers are useable here so you might have to make your population sample size slightly larger or smaller to make your percent of overall agreement accurately reflected in a proportion reflected in whole numbers.
You can modify this program to fit your own needs by changing the numerical values in line 1 (i.e., replace 750 with your own number of expected agreements and 250 with your own number of expected disagreements), line 2 (replace 100 with the size of the reliability subsample), and line 3 (replace 0.5 with the percent of expected agreement in the reliability subsample (expected agreement = 1/number of rating categories).
URN 750#1 250#2 pop
COPY 100 size
COPY 0.5 expect
REPEAT 100000
SAMPLE size pop samp$
COUNT samp$ =1 agree$
DIVIDE agree$ size over$
SUBTRACT over$ expect numerat
SUBTRACT 1 expect denom
DIVIDE numerat denom kappa$
SCORE kappa$ kappa
SCORE over$ over
END
PERCENTILE kappa (2.5 50 97.5) Kappatiles
PERCENTILE over (2.5 50 97.5) overtiles
PRINT kappatiles
PRINT overtiles
The results are displayed below:
Kappatiles: (0.32 0.5 0.66)–Meaning that the 95% confidence intervals around the median free-marginal kappa (i.e., 0.5) are 0.32 and 0.66.
overtiles: (0.66 0.75 0.83)–Meaning the 95% confidence intervals around the median percent of overall agreement (0.75) are 0.66 and 0.83.
Note it’s possible to get a discrepancy between the median kappa or median percent of overall agreement reported here and the kappa reported by the Online Kappa Calculator for several reasons: (1) the number of expected agreements and expected disagreements in the population might not correspond with the percent of overall agreement. Remember that only whole numbers are possible in Line 1. (2) The online Kappa calculator calculates Kappa and percent of overall agreement differently than it is calculated here. The Online Kappa Calculator does not use the median kappa and does not use resampling.
GENERATING CONFIDENCE INTERVALS AROUND FREE-MARGINAL WHEN INFERING FROM ALL SAMPLED CASES TO ALL POSSIBLES CASES IN THE UNIVERSE.
A use for drawing confidence intervals around kappa that I haven’t personally seen in any research, but that I could imagine a use for is when you want to make an inference from all sampled cases to the possible universe of cases that could have been sampled from. Sofie, I think that this is probably the Kappa intervals that you are looking for. If I understand correctly, you had 5 raters rate 40 of the same rib (i.e., rib 1A or whatever). I would calculate percent of overall agreement for each of the 40 cases and use the program below. Replace the values in parentheses in Line 1 with the 40 percent of overall agreements you got for a particular rib. I suspect that you want to report interrater reliability for each rib (i.e., 1a, 2a, etc.) separately.
PROGRAM FOR GENERATING CONFIDENCE INTERVALS AROUND FREE-MARGINAL KAPPA WHEN GENERALIZING FROM ALL SAMPLED CASES TO THE UNIVERSE OF ALL POSSIBLE CASES
To modify this program to meet your own needs, replace the data in parentheses in line 1 with the percent of overall agreement you found for each case. Change the numerical value in line 3 with your own percent of expected agreement (i.e., 1/number of rating categories).
COPY ( 1 1 1 1 1 0 0 0 0 0 ) pop
SIZE pop size
COPY 0.5 expect
REPEAT 10000
SHUFFLE pop pop$
SAMPLE size pop$ agree$
MEAN agree$ mean$
SUBTRACT mean$ expect numerat
SUBTRACT 1 expect denom
DIVIDE numerat denom kappa$
SCORE kappa$ kappa
SCORE mean$ mean
END
PERCENTILE kappa (2.5 50 97.5) kappatiles
PERCENTILE mean (2.5 50 97.5) overtiles
PRINT kappatiles
PRINT overtiles
The results are listed below:
kappatiles: (-0.6 0.0 0.60)–Meaning that the 95% confidence intervals around the median free-marginal kappa (i.e., 0.0) are -0.6 and 0.60.
overtiles: (0.2 0.5 0.8)–Meaning the 95% confidence intervals around the median mean percent of overall agreement (0.5) are 0.2 and 0.8.
April 8, 2009 at 11:57 pm
Sofie
Hey,
I used the “program when generalizing from all sampled cases to the universe”. And it worked! The free marginal kappa’s correspond to the ones I calculated with your online kappa calculator, but now they are all negative. Is it ok if I just take the absolute value or is there more I (or you) should know?
The free marginal kappa’s I calculated now are also a little bit bigger. For example 0,67 in online kappa calculater becomes 0,69 (median) in statistics 101.
Sofie
April 9, 2009 at 3:36 am
jrandolp
Hi Sofie,
A negative kappa means that you would have done better if you had just guessed (i.e., a kappa of zero). So, no, the absolute value doesn’t work for Kappas. If the point estimate is on, but the sign is wrong, I’m guessing that you (or I) just got a sign switched somewhere.
I’m glad that this worked for you.
Take care,
Justus
April 16, 2009 at 9:57 am
thelovelydays
Hi Justus,
I have used the equations you described in your paper presented at the symposium in 2005 to calculate a free marginal multi-rater kappa, using Excel. It works well and I have checked its accuracy with your online calculator.
Now, I want to be able to calculate SE and 95% CI. I see your discussion on this above. I downloaded the program (Statistics101), but really could not makes heads or tails of it. Can you provide me with an equation I can use, or an Excel equation, for calculating SE. Also, what about p values? Can we derive those somehow too?
The data I have analysed is from 22 raters, 6 categories, 15 subjects.
Thanks for your help.
Ben
May 8, 2009 at 12:47 am
Justus Randolph
Hi Ben,
I replied a few posts down.
Justus
May 7, 2009 at 11:31 pm
Andy
Hello Justus
I have just come across your online calculator and had a go at one data set, entering the data manually. Worked brilliantly! Many thanks for setting this up. Now doing a second data set but wanted to paste an excel spreadsheet to save inputting all those noughts. Sorry for being a bit dense but how do I accept the digital certificate that allows me to proceed?
Cheers
May 8, 2009 at 12:46 am
Justus Randolph
Hi Andy,
The first time you went to the site, there should have been a pop-up screen that asked you if you would accept the digital certificate. Since you were successfully able to use the calculator, I think that you have already accepted it. Some folks have had a hard time inputting data from a spreadsheet. If worse comes to worse, you might have to enter it by hand. If you get an error check to make sure there is a number in each cell and that each row adds up to the number of raters.
Justus
May 8, 2009 at 12:04 am
Justus
Hi Ben,
I’m a convert to resampling these days. The Statistics101 program has a good book and online documentation with it to help make sense of the formula. They also have good info at http://www.resample.com.
You could easily change it to get P values or SE; however, why calculate those things when the confidence intervals are what is really meaningful? (See the APA’s Task Force on Statistical Inference’s statement for a rationale). I think the inferential question that we want to answer is what are the likely values of the population Kappa, not the probability of a kappa’s being 0. If you did do a P value, I would calculate the probablility of the population’s kappa being at or above the threshold acceptability value (e.g., .70).
Now, with that said you could find confidence intervals (or p values) around kappa in excel using resampling. There’s a great article in Teaching Statistics called Resampling with Excel that will tell you how to do it. It is is fun exercise. The article is linked below:
http://www3.interscience.wiley.com/journal/118769449/abstract?CRETRY=1&SRETRY=0
June 20, 2009 at 3:53 am
Jeff Levsky
Dear Justus,
Thank you very much for your online calculator. I was trying to use it online (3 raters, 2 categories, 21 samples) but was getting a java console exception:
java.lang.NoSuchMethodError: java.math.BigDecimal.(I)V
at KappaCal$1.mouseClicked(KappaCal.java:200)
at java.awt.AWTEventMulticaster.mouseClicked(Unknown Source)
Can you help me get this to work? Unfortunately, I have limited control over updating the java engine on my machines at work.
Thanks,
Jeff
June 20, 2009 at 4:11 am
Justus Randolph
Hi Jeff,
I don’t know a lot about Java, so I can’t help you with that error. I checked the calculator from two of my computers and the calculator worked fine in both cases. Perhaps the easiest solution is to find a different machine in which the Java is working. Whenever I’ve had a technical problem with the calculator, uninstalling then reinstalling Java worked every time. Sorry I couldn’t be of more help.
Take care,
Justus
August 22, 2009 at 1:21 am
Tracy
Hello Justus!
I have followed your online Kappa blog and it helped me realize I need to use an intraclass correlation (ICC) rather than a Kappa correlation. I have a question about the results I obtained…
I am trying to figure out if my calculations on the intraclass reliability coefficients I have obtained are adequate for my purposes. I asked 8 experts in the field of school climate 12 questions on a questionnaire I created, asking them to rate the importance of certain reliability, validity, norms, and varibles on an assessment of school climate. This is the Likert scale I used: 0-Not important at all 1-Somewhat important 2-Moderately important 3-Very important 4-No opinion 5- Don’t know.
I decided to throw out the responses from the raters that chose either 4 or 5 as a response, since it did not indicate ordinal data. This left me with responses for only 5 raters.
I plugged in my data into: Calculation form of intraclass correlation coefficient by Dr.Funatsu, Professor of Meisei University, Tokyo at http://www.wwq.jp/javascript/intracorre.html and calculated an ICC of 0.5192307692307757.
I listed the frequency for each variable as “1″. Is this correct? Is 0.5192307692307757 considered an adequate ICC? or is my sample size simply too small to give an adequate reading?
If you have a chance to respond, that would be wonderful–unfortunately, I have a very short deadline and I am running out of resources to determine an accurate answer.
Thanks so much!
Best regards,
Tracy
September 29, 2009 at 9:42 pm
Markku Paanalahti
Dear Justus
I have one case
8 categories
Two rates
Cat.0 Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7
1 2 1 1 1 2 1 1
the program tells me that Kappa can not be calculated. the two raters do not disagree at the same time in any of the categories.
Sincerly
Markku
October 6, 2009 at 3:32 am
jrandolp
Hi Markku,
The problem is that you need to retabulate your data so that the case
are in rows, the categories in columns, and the sum of agreements per
category in the cells. The sum of rows should equal two, not ten, in
your case since there are only two raters. Currently, looking at your
data set from the prospective of how the Online Kappa Calculator views
data, it tells me that there was one case, eight categories, and 10
raters (1+2+1+1+1+2+1+1=10).
For kappa, each case can belong to one and only category. So if you
have a case that is of the type “check all that apply” you would have
to have separate kappas for each characteristic. For example, if you
wanted to have two raters rate a cake on the three characteristics
(icing, flavor, texture) you would have to calculate Kappa for each of
those three characteristics.
Color: Is it white (1), blue (2), or pink (3) ?
Flavor: Is it (1) Chocolate (2) Angel Food?
Texture: Is it (1) hard (2) soft. ?
Suppose that for color, both raters agree that it was white. You would
tabulate the data thusly (the case in row 1, white blue and pink in
the first three columns, and the number of agreements per category in
the cells):
White(1) Blue (2) , or Pink (3)
Case A (a cake): 2 | 0 | 0
The 2 in cell A1 means that 2 raters thought that the cake was white.
You mentioned that all raters agreed in every time on all
characteristics; therefore, as a short cut I can tell you that for
each characteristic, the value of Kappa, free-marginal or fixed, is
1.00.
October 6, 2009 at 3:43 am
Markku Paanalahti
Hi Rudolph
Thank you very much for you answer- I start to understand a little bit more about Kappa statisics. Quite of relief I can tell you.
Markku
October 26, 2009 at 11:16 am
James Rucker
Hi Justus
Your calculator is rather wonderful and has saved my research project. Many thanks. I have about 180 sets of data I want to calculate Fleiss’ free marginal kappa on so I am wondering if there is a faster way to do it than having to cut and paste in each individual set of data, and do the same for each result?
Incidentally, on Apple Macs and Linux you can’t cut and paste data into the applet, even if you accept the certificates. Windows is fine.
Cheers
James
October 27, 2009 at 11:48 pm
Justus
Hi James,
I’m sorry that there is only one way to set up data in the OKC. An OKC user recently told me that he had success making an excel file with the formula’s used here. There is a reference to the formula I for Fleiss’s Kappa on the OKC page. I’m glad that you find the it useful.
Justus
August 22, 2009 at 3:41 am
jrandolp
Hi Traci,
I’m glad that you found the Online Kappa Calculator discussion to be useful for you. To be honest, I’m getting out of my area of expertise with the ICC. I think that Dr. Funatsu could help you better than I can.
To answer what I can though:
I’m not sure what you should put for the frequency. I quickly looked at Dr. Funatsu’s calculator, and it wasn’t intuitive to me what was going on.
Since you can’t change your sample size now, I would calculate confidence intervals for the ICC if that’s possible. I think that it’s possible with SPSS. That way you would know the probable range of the parameter ICC that you are inferring to based on your sample size.
If I understand the ICC correctly, you can interpret it basically as you would the ubiquitous Pearson correlation (r). The ICC ranges from 1 to -1, where 0 indicates no correlation. Whether 0.52 is an adequate ICC is really relative. I would find other similar studies that have used the ICC and use them as reference points.
If you need help calculating intervals around the ICC, I offer personalized statistical support services. Send me a private e-mail if you are interested, justus@randolph.name