My online kappa calculator

May 20, 2008 in Uncategorized

Hi all,

I’d like to announce the debut of the “Online Kappa Calculator.” It calculates free-marginal and fixed-marginal kappa–a chance-adjusted measure of interrater agreement–for any number of cases, categories, or raters. Roman and Nikko will appreciate this; now we don’t have to do these calculations the hard way!

Thanks to Walubengo M. Singoro for his fantastic programming work on this.

Check out:

http://justus.randolph.name/kappa

I’m going on vacation to the States tomorrow. If you have any comments, I’ll get back to you in a few weeks. Is anyone else interested in creating a collection of calculators for obscure, but intensely useful, statistics? I have ideas for many student programming projects along this line.

Blogroll

242 comments

Comments feed for this article

August 20, 2008 at 6:49 pm

Brenda

I have not been able to open to calculator (Java applet problem)…any ideas on getting the applet to run???

August 3, 2010 at 8:27 am

John

Great program justus. I have had no problem in getting it to run, but I could not get my data to paste from SPSS or Excel. I entered all 200 cases by hand. Then I read Fam’s post on using Ctl-v. That worked, and the results, happily, confirmed that I had entered the data correctly by hand the first time. Thanks for OKC, it’s a great tool. John at Pitt.

August 21, 2008 at 10:53 pm

Justus Randolph

Hi Brenda,
I checked the site and it is working on my computer. Earlier I had a problem on this computer where the java wouldn’t load. Uninstalling and then reinstalling the latest version of java worked to get the issue resolved on my own computer. Let me know if this tack doesn’t work.

August 21, 2008 at 10:56 pm

Justus Randolph

P.S., you have to accept the digital certificate, so that you can cut and paste your data from a spreadsheet program.

January 8, 2010 at 6:18 pm

Sarah

I still can’t paste into the calculator from Excel (or any other app.) even after accepting signature – what can be wrong? Please help.

August 22, 2008 at 2:14 pm

roman

Usability-wise, I think you should move the calculator behind one extra link (or a ‘Start’ button). What happens is that the java application will load whatever you want it or not (e.g. for just to check the references). On most computers, the loading takes quite a long while and the browsers often chuck on it.

September 12, 2008 at 4:43 am

Greg

I spent about two hours entering data into the calculator (224 cases and 17 categories). When I pushed the button to calculate the results, I was continually asked to ensure I had entered the number of raters (which I had–2 raters). Eventually, I lost everything, including two hours of my life.

September 13, 2008 at 3:40 am

Justus Randolph

We figured out that the two hours of lost life was because the empty cells were left blank and not filled in with zeros. I’ll make a note in the “Instructions” that no cell can be left empty–empty cells need to have a zero.

Greg, thanks for pointing out this point of confusion and sorry about the wasted time.

Justus

September 13, 2008 at 4:12 am

Greg

When I made it clear that the calculator was not working for me, Justus went to a lot of trouble to solve the problem for me. Now that we understand the error that was made, everything works great.
The calculator caused a headache for a period, but once Justus figured out where things were going wrong, the calculator has been fantastic and I can feel my lost life returning!

November 11, 2008 at 6:14 am

Jeet

Hi Justus,

First of all I would like to thank you very much for hoting this online kappa calculator. I have fed my data (0 when category was mot used), but I am not getting any result when calculate button is clicked. Plaese tell me what else to do or how to get the results? Regards.

Jeet

January 28, 2010 at 11:12 am

Fam

Hi Justus,

My case is identical to that of Jeet. I accessed the site mutiple times at different times of the day, every instance clicking the button multiple times, but the calculator never displayed any results. I thought that sending you the results every time I want to check on my results is a bit hard, as more participants are taking part in my study.

Thanks in advance.

January 28, 2010 at 11:31 pm

Justus Randolph

Hi,

My best advice is to uninstall, then reinstall Java or try another computer. The OKC works from my computer.

Justus

February 1, 2010 at 8:31 am

Fam

Hi Justus,

I figured out what the problem was. It was a missing a ‘0’. I think the program would benefit greatly if an extra check is introduced to search for blanks and show an error message such as “xx cell(s) are blank, please fill before continuing” instead of displaying nothing and making it look like it’s not working.

You can also add a small comment to users that to paste their results they have to use Ctrl+v, and not their right mouse click. Someone may not know this.

Thanks Justus, and well done for providing the community with an easy to use, and effective needed tool.

November 12, 2008 at 5:55 am

Justus,
Thank you for providing an online calculator. Could you please tell me if my data set is appropriate for this analysis? I have 8 pairs of raters who each rated their own set of 10 subjects. (The same people are always paired together.) So, different subjects are sometimes rated by the same raters and sometimes not. I want to put the data from all rater-pairs into one analysis, rather than calculating a kappa for each pair of raters. Is that possible?
Thanks very much.

November 12, 2008 at 9:24 pm

Justus Randolph

Hi GA,

The Online Kappa Calculator isn’t set up for this kind of situation. Kappa only works when you have all raters rating the same sample. It sounds like a generalizability theory sort of problem to me. See:

http://www.psychology.sdsu.edu/faculty/matt/Pubs/GThtml/GTheory_GEMatt.html

If you wanted to stay with a Kappa approach, you could report the value of kappa for all rater pairs since there are only eight. It seems like that would tell the reader what they needed to know: highest kappa, lowest kappa, median/mean kappa. They would subjectively get a sense of overall interrater agreement.

November 12, 2008 at 9:50 pm

Justus Randolph

Hi Jeet,

Sometimes I have to hit the “submit” button a few times if the server is slow. Did you get it working? If not, get back to me and tell me more about your data set and I’ll try to figure it out.

Justus

November 13, 2008 at 5:07 am

Thanks very much, Justus. I appreciate your time.

November 20, 2008 at 7:37 pm

Jeet

Thanks for the tip Justus, but unfortunately it did not work.
My data has 31 cases, 6 catogaries and 3 observers.

0 3 0 0 0 0
0 3 0 0 0 0
0 0 1 2 0 0
0 3 0 0 0 0
0 0 1 1 0 1
1 2 0 0 0 0
1 2 0 0 0 0
0 2 0 1 0 0
1 2 0 0 0 0
0 3 0 0 0 0
1 2 0 0 0 0
0 1 0 2 0 0
1 2 0 0 0 0
0 3 0 0 0 0
0 3 0 0 0 0
0 2 0 1 0 0
0 3 0 0 0 0
0 1 0 1 1 0
2 1 0 0 0 0
0 3 0 0 0 0
0 2 0 1 0 0
0 0 2 0 1 0
0 1 1 1 0 0
0 0 1 0 1 1
0 0 1 0 1 1
0 0 2 0 1 0
0 0 2 0 1 0
0 0 1 1 1 0
0 0 1 1 1 0
0 2 0 1 0 0
0 1 1 0 1 0

Hope this may help you to help me more. Thanks for your time.

Cheers

Jeet

November 20, 2008 at 10:15 pm

Justus Randolph

Hi Jeet,

I was able to cut and paste your data set above right into the calculator and it worked fine. I’m not sure what kind of error you were having, but often uninstalling then reinstalling java will get the online kappa calculator working.

Anyway, your results are below:

Percent of overall agreement: 0.419355
Fixed-Marginal Kappa: 0.153976
Free-Marginal Kappa: 0.303226

November 26, 2008 at 10:31 pm

Jeet

Thank you very much Justus.

I am going to try using online kappa calculator from home and see, as I still have the same problem with it from my office (after pasting the data and clicking calculate, nothing happens, even after repeated clicking). I will let you know what happens from home.

Once again thanks for calculating my data.

Jeet

December 17, 2008 at 6:27 pm

JKS

Hi,

Thanks for providing this great resource! BUT…I’m also having problems getting the program to work at all. I’ve tried changing browers (Firefox, IE) and asking a co-worker to try on their Apple computer, but nothing happens when I push CALCULATE. I also reinstalled JAVA.

I have 239 cases with 4 categories and 5 raters. Is it possible this is just too many cases?

Thanks for any advice,

/JKS

5 0 0 0
0 0 1 4
4 1 0 0
0 0 2 3
1 0 4 0
0 0 5 0
2 0 3 0
0 0 0 5
3 0 2 0
0 1 2 2
0 0 4 1
4 1 0
4 0 0 1
2 2 0 1
0 0 4 1
0 0 0 5
0 0 0 5
0 0 4 1
0 0 4 1
5 0 0 0
0 0 0 5
0 0 3 2
0 0 0 5
2 0 3 0
5 0 0 0
5 0 0 0
5 0 0 0
0 0 2 3
1 0 0 4
1 0 4 0
0 0 0 5
0 0 1 4
1 0 4 0
0 0 0 5
0 0 0 5
4 0 1 0
4 0 1 0
0 0 0 5
0 0 5 0
5 0 0 0
5 0 0 0
0 0 3 2
0 0 0 5
0 0 1 4
0 1 3 1
5 0 0 0
0 0 0 5
5 0 0 0
5 0 0 0
0 0 5 0
5 0 0 0
5 0 0 0
0 0 0 5
5 0 0 0
5 0 0 0
0 0 3 2
0 0 2 3
0 0 5 0
0 0 1 4
0 0 0 5
0 1 3 1
0 0 1 4
0 0 1 4
4 0 1 0
0 0 5 0
0 0 2 3
0 0 4 1
0 0 4 1
0 0 0 5
0 0 4 1
0 0 2 3
0 0 2 3
0 0 2 3
0 5 0 0
0 0 0 5
0 0 5 0
4 0 1 0
0 0 5 0
0 0 5 0
1 0 2 2
3 0 2 0
1 0 3 1
5 0 0 0
0 0 5 0
0 0 2 3
0 0 3 2
4 0 1 0
0 0 0 5
2 0 0 3
4 0 1 0
0 0 1 4
0 0 0 5
5 0 0 0
5 0 0 0
0 0 4 1
0 0 4 1
5 0 0 0
0 0 4 1
2 0 3 0
2 0 3 0
0 0 2 3
1 0 4 0
0 0 1 4
0 0 0 5
0 1 4 0
0 0 0 5
0 0 0 5
0 0 5 0
2 0 3 0
1 0 4 0
1 0 4 0
0 0 4 1
0 0 3 2
2 0 3 0
1 0 3 1
0 0 4 1
0 0 4 1
2 0 3 0
2 0 3 0
2 0 1 2
0 0 2 3
0 0 3 2
0 0 0 5
0 0 3 2
0 0 3 2
0 0 0 5
0 2 3 0
0 0 1 4
0 0 2 3
0 0 3 2
1 0 2 2
5 0 0 0
0 0 3 2
4 0 1 0
0 0 1 4
5 0 0 0
1 0 0 4
0 0 1 4
5 0 0 0
5 0 0 0
5 0 0 0
5 0 0 0
5 0 0 0
0 0 4 1
0 0 5 0
0 0 1 4
0 0 1 4
0 0 5 0
5 0 0 0
4 0 1 0
0 0 1 4
0 0 0 5
0 1 2 2
0 0 0 5
0 0 2 3
0 0 0 5
3 0 2 0
5 0 0 0
0 1 4 0
5 0 0 0
4 0 1 0
0 0 4 1
0 0 2 3
1 0 4 0
0 1 4 0
0 0 2 3
0 0 1 4
0 1 4 0
0 0 0 5
0 0 5 0
5 0 0 0
0 0 0 5
0 0 1 4
4 0 1 0
3 0 2 0
1 0 4 0
0 0 0 5
0 0 0 5
0 0 0 5
0 0 0 5
0 0 0 5
3 0 2 0
3 0 2 0
0 0 0 5
0 0 0 5
0 0 0 5
0 1 3 1
0 0 1 4
0 0 2 3
0 0 1 4
0 0 0 5
1 0 4 0
0 0 4 1
0 0 1 4
0 0 5 0
5 0 0 0
1 0 0 4
0 0 0 5
0 0 0 5
5 0 0 0
1 1 2 1
3 0 2 0
0 0 0 5
0 0 1 4
0 0 4 1
0 0 2 3
0 2 3 0
5 0 0 0
0 0 0 5
0 2 3 0
0 0 0 5
1 0 4 0
5 0 0 0
0 0 0 5
1 4 0 0
5 0 0 0
0 0 0 5
5 0 0 0
0 0 4 1
5 0 0 0
0 0 4 1
3 0 2 0
2 0 2 1
0 0 4 1
0 0 5 0
5 0 0 0
5 0 0 0
5 0 0 0
0 0 2 3
0 0 0 5
0 0 0 5
0 0 0 5
0 0 0 5
1 0 4 0
0 0 0 5
4 0 1 0
2 0 0 3
4 0 1 0
0 0 0 5

December 17, 2008 at 6:32 pm

JKS

Hmm! I now see there is an error in my data, with a missing value. Now it works!

So nevermind! The only suggestion I have is that it might be nice that if this type of error occurs you get an error message. Now I wasn’t sure if I had done something wrong or if the program wasn’t working anymore.

Anyway, THANKS so much for making this available!

/JKS

January 7, 2009 at 10:23 am

Kim

Justus,
This is so cool! Thank you, thank you.
I wonder if you can tell me if I have things set up correctly. I have 16 items, rated on a 4 point scale, with 21 subjects rating the items.
So….
16 cases
4 categories
21 raters
Am I right?
Kim

January 7, 2009 at 10:36 am

Kim

Justus,
One other thing, I have collected my data through the Delphi Survey Method. I wonder which of the statistics, Fixed-Marginal Kappa or Free-Marginal Kappa would be most appropriate?
Kim

January 8, 2009 at 12:14 am

Justus Randolph

Hi Kim,

I wonder if either of the Kappa statistics are appropriate in your case. First, if I understand the Delphi method correctly, raters are able to change their ratings after hearing a summary of ratings, making the ratings dependent. I don’t think that the kappa family of statistics are appropriate for dependent ratings. (By “dependent” I mean that one rating or rater affects the ratings of another.) By being able to revise their answers based on other answers, it is obvious that the raters will be able to do better than chance.

Second, it seems that your scale is not actually categorical, but rather is continuous or ordinal because you wrote”on a 4 point scale.” If it is continuous or ordinal you would be better off using a different statistic. A great multirater agreement statistic for continuous scales is the intraclass correlation coefficient ( see e.g., http://www.nyu.edu/its/statistics/Docs/intracls.html).

However, if your ratings are independent and your scale is categorical (apples, oranges, and pears), one of the kappa statistics would be right for you and indeed there would be 16 cases, 4 categories, and 21 raters. Like Brennan and Prediger I suggest using free-marginal kappa if raters didn’t need to have a certain number of “1” ratings, a certain number of “2 ratings, and so on.

Hope this helps. Feel free to write back if you have any other questions. The next comment might help explain some of the reasons why there are better agreement statistics than kappa for ordinal or continuous scales.

–Justus

January 8, 2009 at 1:04 am

Justus Randolph

An online kappa calculator user, named Lindsay, and I had an e-mail discussion that I thought other online kappa calculator users might benefit from. I will excerpt parts of our conversation below.with permission. Lindsay, thanks for your great questions and letting me share them with others. Feel free to write back if you have any more questions or if I didn’t answer your questions.

***Lindsay wrote:

I am interested in using your online multi-rater free margin Kappa calculator for a research project; however, I am having a statistical problem and hoping you can help me understand.

I have an ordinal scale of 0 (unacceptable), 1 (acceptable), and 3 (excellent).

I have 3 raters, each using the scale above to rate 30 images.

When I enter the data into your online calculator, I get 0.10 free margin Kappa. If I transform the data into a dichotomous scale (0 unaccetable and 1 acceptable), the free margin Kappa goes up to 0.86.

What bothers me is that performing standard Cohen’s Kappa calculations via SPSS for Rater 1 vs. Rater 2, Rater 2 vs. Rater 4 and so on yields much lower kappas for the dichotomous ratings, while your online calculator yields much higher for dichotomous variables.

I’m trying to understand why it’s reversed.

***Justus wrote:

It seems like there are a few questions going on here:

Why does the free-marginal kappa differ from the fixed-marginal kappa?

Why does that relationship change when I make dichotomoize my variable?

Should I dichotimize my variable or not?

What statistic should I use?

===

–Why does the free-marginal kappa differ from the fixed-marginal kappa?

All other things being equal, free-marginal and fixed-marginal kappa
differs because of prevalence and bias. You can read about this from:

Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: Some
uses, misuses, and alternatives. Educational and Psychological
Measurement (41), 687-699.

Randolph, J. J. (2005). Free-marginal multirater kappa: An alternative
to Fleiss’ fixed-marginal multirater kappa. Paper presented at the
Joensuu University Learning and Instruction Symposium 2005, Joensuu,
Finland, October 14-15th, 2005. (ERIC Document Reproduction Service
No. ED490661)

–Why does that relationship change when I dichotomoize my variable?

Theoretically, since the free-marginal kappa will increase as the
number of categories increase, the free-marginal kappa should go up
after dichotomizing. However, I suspect that dichotomizing drastically
increased the percent of overall agreement and that is why you saw the
kappa values you did. I bet that it is much easier to categorize a
case as (unacceptable) or (acceptable or excellent) than unacceptable,
acceptable, or excellent. I’m guessing that there is a strong
distinction between unacceptable and (acceptable or excellent) and a
very fine distinction between acceptable and excellent. I strongly
suggest that you crosstabulate how many of each type of errors there
were (unacceptable-acceptable; unacceptable-excellent;
acceptable-excellent). Given chance, there should be about the same
number of errors in each category. I’m guessing that that won’t be the
case. Graphing where the errors are can tell you a lot about the
construct or phenomenon you are investigating. I always suggest doing
that crosstabulation.

In short, the kappa is going to change when you dichotomize the
variable because the percent of overall agreement is probably going to
change because of systematic categorizing errors in your case.

–Should I dichotimize my variable or not?

One can “massage” free-marginal kappa by increasing the number of
categories. Therefore, I suggest using only as many categories as are
theoretically justifiable. Since you chose three categories a priori,
I would stick with three categories.

–Which statistic should I use?

The Kappa family of statistics are appropriate when you have nominal
variables. So, a different type of statistic might be better for you.
My rationale for using a different statistic is that treating your
three categories as nominal assumes that an unacceptable-excellent
disagreement is the same degree of error as an acceptable-excellent or
unacceptable-acceptable error. If you want to treat your categories as
ordinal, there are better statistics than Kappa. I don’t have my books
handy right now, but you could probably find the right statistic from

Siegel, S., & Castellan, (1988). Nonparametric statistics for
the social sciences, (2nd ed.). New York: McGraw-Hill.

If you think it is appropriate to treat your categories as continuous,
a good candidate would be one of the variations of the intra-class
correlation coefficient. See:

http://www.ats.ucla.edu/stat/spss/library/whichicc.htm

***Lindsay wrote:

So free-marginal kappa increases as the number of categories increases. This is interesting, when applied to my results using your online calculator. I have found the exact opposite, entering in Non-dichotomous (3 category) data for 8 attributes using 3 raters and then entering the Dichotomous data. The dichotomous results are uniformly higher – much higher – than their non-dichotomous counterparts.

Some samples:
Q1 – 0.48 (Non-Dichotomous) 0.87 (Dichotomous)
Q2 – 0.33 (ND) 0.73 (D)
Q3 – 0.37 (ND) 0.82 (D)
Q4 – 0.27 (ND) 0.78 (D)

Strange. The pattern is consistent all the way down the line of attributes.

Any idea why this would occur? I have 30 cases per attribute, so I’m not working with extremely small samples.

*** Justus wrote:

So, ALL OTHER THINGS BEING EQUAL, an increase in the number of
categories will increase the value of free-marginal kappa. As an
exercise, try holding the percent of overall agreement constant and
changing the number of categories. I illustrate this in Figure 3 of

Click to access 27.pdf

I expect that the discrepancy you find between kappa values when you
convert a three- category system to a two-category system is because
the percent of overall agreement changes drastically when you switch
from a three-category system to a two-category system. The increase in
percent of overall agreement from reducing categories must outweigh
the associated decrease in agreement expected by chance. (Remember
that the kappa formula is (P-overall – Pexpected)/(1-Pexpected); two
variables affect the kappa value–not just the percent of expected
agreement.) In your data set, doesn’t the percent of overall agreement
dramatically increase when you treat the three categories as if they
were two categories? If so, that would help explain the discrepancy.

I would make a graph of what kind of errors (i.e., disagreements) the
raters made (e.g, 1-3, 1-2, and 2-3). I suspect how you split up the
three categories into two categories will make a difference in percent
of overall agreement and, therefore, make a difference in the values
of kappa you find. If the errors are not split up evenly, then how you
dichotomize the categories will make a difference in percent of
overall agreement. For example, if most of the errors are 2-3 errors,
then you can “hide” those errors by combing categories 2 and 3 into
let’s say a “4” category and then recalculating the errors as if there
were only a 1 and 4 category.

Overall, I would suggest that you use as many categories as you
originally had the raters use. That seems like it would give you the
most accurate picture of agreement.

February 13, 2009 at 6:43 am

Jessica

The table is not showing up on the site. Has it been moved?
Thanks!

February 13, 2009 at 8:45 am

Justus Randolph

Hi Jessica,

The Online Kappa Calculator works fine for me. I just tried it. Please describe your problem in more detail and perhaps I can help troubleshoot. I had a computer once where I had to install the latest version of Java to get the Calculator to load properly.

March 3, 2009 at 8:45 pm

Sofie

Hi Justus,
First I want to thank you for your work and the free calculator on the internet!
For my reliability study, I have calculated Fleiss’ kappa and in addition the Free-marginal multirater kappa. My question is how I can calculate the confidence intervals, which is also important to mention. I have already calculated the confidence intervals for the Fleiss’ kappa in another statistical program. Can I mention the same confidence intervals for the free-marginal kappa’s?

Kind regards,
Sofie

March 5, 2009 at 5:45 pm

Claire

Hi Justus,
I’m in a bit of a similar situation to Kim (one of the previous posters). I’m using a 4-point Likert scale to calculate inter-rater reliability. I used Cohen’s kappa when I only had 2 raters, but in my next study I have up to ten raters.
Someone suggested using a weighted version of Fleiss’ kappa (as I used a weighted version of Cohen’s kappa for my first study). So I guess I have a few questions:
1. Does a weighted Fleiss’ kappa exist?
2. If Fleiss’ kappa is still appropriate for me to use, which statistic would be most appropriate: Fixed-Marginal Kappa or Free-Marginal Kappa?
3. What should I do about missing variables (i.e. only 9 responses, but 10 raters)?
Thanks so much for your help & great program!
Claire

March 8, 2009 at 4:07 am

Justus Randolph

Hi Sofie,

Provide me with a little bit of information and I can figure out the confidence intervals for you and show you how I did it.

I’ll need:

-The total number of cases

-The number of cases you calculated kappa for

-The percent of overall agreement

-The number of categories

-The confidence intervals you are interested in (e.g., 95%, 90%. . . )

March 8, 2009 at 4:31 am

Justus Randolph

Hi Claire,

1. I haven’t read anything about a weighted Fleiss’s Kappa, but that doesn’t mean it doesn’t exist. If you can’t find it, it probably wouldn’t be too hard to figure a formula out. I’m swamped right now, but that might make a nice statistical paper someday for someone ; )

2. Since you have a continuous or ordinal scale (depending on how you think about Likert scales), I wouldn’t use a Kappa statistic at all. Why not use an intra-class correlation if you can consider the scale to be continuous (See P. E. Shrout & Joseph L. Fleiss (1979). “Intraclass Correlations: Uses in Assessing Rater Reliability”. Psychological Bulletin 86 (2): 420–428. doi:10.1037//0033-2909.86.2.420. Or, see http://www.ats.ucla.edu/stat/spss/library/whichicc.htm for how to compute it with SPSS.) If it’s ordinal, Siegel and Castellan’s Nonparametic Statistics for the Behavioral Sciences has a whole chapter on measures of association for ordinal data.

As you know, unweighted kappa statistics (like the ones used in the Online Kappa Calculator) are meant for assessing the reliability of categorical data. For example, unweighted Kappa statistics give the same weight to a strong disagreement (e.g., strongly disagree v. strongly agree) as a weak disagreement (e.g., strongly disagree v. disagree). Weighting the Kappa can let you assign weights for each level of disagreement if that is appropriate in your situation. That might be particularly helpful if you want to assign customized weights to various levels of disagreement. But, there are plenty of commonly used and easily computed multirater statistics for ordinal and continuous data. Why not use those?

About which to use if you were going to use a multi-rater kappa: The Kappa family of statistics can be divided into two categories–those that use 1/number of categories as the percent of expected agreement (free-marginal) and those that don’t (fixed-marginal). There is a lot of debate which situations it is appropriate to use the various types of Kappa, but I’m convinced by Brennan and Prediger’s argument (you can find the reference on the bottom of the Online Kappa Calculator page) that one should use fixed-marginal kappas (like Cohen’s kappa or Fleiss’s kappa) when you have a situation where you tell raters, for example, “Categorize these ten cases into two categories, AND MAKE SURE THAT YOU END UP WITH FIVE CASES IN EACH CATEGORY” and should free-marginal kappas when you have a situation where you tell raters, for example, “Categorize these ten cases into two categories. It doesn’t matter how many cases end up in each category.”

3. Like most Kappa formulas, the formula used in the Online Kappa Calculator needs a full data set. My best advice in your case is to just format the data set disregarding the rater who didn’t fully respond. I’m not aware of research that investigates the effects of using different missing data strategies on the values of the various kappa statistics. –There’s another good topic for a statistics paper.

Let me know if you have any other questions,

Take care,

Justus

March 8, 2009 at 8:43 pm

Sofie

Hi Justus,

I hope this is the information you need:

Total number of cases: 40 (population)

The number of cases I calculated kappa for: 18

Percent of overall agreement:
0,680 – 0,660 – 0,675 – 0,745 – 0,740 – 0,880 – 0,865 – 0,960 – 0,975 – 0,650 – 0,655 – 0,635 – 0,745 – 0,800 – 0,905 – 0,955 – 0,970 – 0,990

Number of categories: 2

Confidence interval I would like to calculate: 95 %

Thanks for your help!
Sofie

March 10, 2009 at 1:37 am

Justus Randolph

Hi Sofie,

One more thing–how many raters were there?

Justus

March 12, 2009 at 12:06 am

Sofie

There were 5!

March 12, 2009 at 1:54 am

Justus Randolph

Hi Sofie,

I’m still have a little problem making sense of this. I figured that using the 18 percent of overall agreements you gave, I could average those to get the overall percent of overall agreement? But, I’m not sure if that is right. I figured that the percent of overall agreement was 80.4722, but it seems to me that the percent of overall agreement should be a factor of 180 (With five raters, there are ten possibilities for agreement per case:1-2, 1-3, 1-4, 1-5, 2-3, 2-4, 2-5, 3-4, 3-5, 4-5. 10 possibilities * 18 cases = 180 possibilities). I need to figure out, out of the 180 possibilities for agreement in your data set, how many agreements there actually were.

When you use the Online Kappa Calculator for your data set that has 18 cases (drawn from a sample of 40 cases), 5 raters, and 2 categories, what are the specific values for percent of overall agreement, fixed-marginal and free-marginal kappa?

March 15, 2009 at 3:43 am

Sofie

Hi Justus,
I think there must have been a misunderstanding from my side. I’ll explain the whole situation: 5 raters examinated 18 ribs (rib 2-10 right and rib 2-10 left) of 40 people. For each rib, the 5 raters chose whether there was a blockade “yes” or “no” (=2 categories). Then I calculated the Fleiss’ kappa and the Free-marginal kappa for each rib.
So, I calculated 18 times the Fleiss’ and Free-marginal kappa but for each rib, there where 40 cases.

So for each rib:
Total number of cases: 40 (population)
Number of categories: 2
Confidence interval I would like to calculate: 95 %
Number of Raters: 5

Percent of overall agreement:
rib 2 right= 0,680
Rib 3 right = 0,660
Rib 4 right = 0,675
Rib 5 right = 0,745
Rib 6 right = 0,740
Rib 7 right = 0,880
Rib 8 right = 0,865
Rib 9 right = 0,960
Rib 10 right = 0,975
Rib 2 left = 0,650
Rib 3 left = 0,655
Rib 4 left = 0,635
Rib 5 left = 0,745
Rib 6 left = 0,800
Rib 7 left = 0,905
Rib 8 left = 0,955
Rib 9 left = 0,970
Rib 10 left = 0,990

Sofie

March 27, 2009 at 6:49 am

jrandolp

Hi Sofie,

Sorry for taking so long to get back to you. It’s been a really hectic week, or two.

Now I’ve got a good sense for what your data set is like. However, I’m not entirely sure what population you mean to make inferences about. If you still need help after reading my explanations below send me an e-mail (justus@randolph.name) and we can set up an appointment for me to do some statistical consulting for you, if you desire. I’m happy to write this long explanation here because it might have value for Online Kappa Calculator users in general.

On a side note, it looks like it’s not very easy for people to agree on whether a rib is concaded or not. When there is a lot of disagreement I suggest that you make a table of who tends to agree with whom and who tends to disagree with whom and why they tend to agree or disagree. You might find out that a simple clarification of the procedure is what is needed to boost agreement. Plus, you can find a lot of interesting results from that kind of sleuthing. For example, in a content analysis I did on publication bias we found that researchers tend to bury nonsignificant results in nonnumerical text and emphasize significant findings in numerical text and tables. We wouldn’t have found that had we not had low kappa and tried to troubleshoot the source of our disagreement.

GENERATING CONFIDENCE INTERVALS AROUND FREE-MARGINAL KAPPA WHEN YOU ARE INFERRING FROM A RELIABILITY SUBSAMPLE TO THE ENTIRE SAMPLE

Typically, I see confidence intervals drawn around a kappa statistic when one is making inferences from a reliability sub-sample to a sample. This might happen when you have limited time or resources and have multiple raters rate different cases in your sample. In this case, it is customary to overlap some of the cases to determine to what degree raters are agreeing. If the raters agree, there is justification for having different raters rate a different set of cases. Even if rating work isn’t shared, it’s good practice to have a second rater (or more raters) rate a sub-sample of cases to establish whether there the obtained rating are not “the idiosyncratic results of one rater’s subjective judgment” (Neuendorf, K.A., 2002, The Content Analysis Guidebook. Thousand Oaks, CA: Sage.).

In the case where not all raters rate all cases, the confidence intervals around the sample Kappa show the range where Kappa likely would have fallen had all raters rated all cases. That is, the inference is from the subsample of reliability cases to the population of sample cases. For example, in my dissertation I randomly sample a set of 352 cases. I rated all cases and had a second rater rate a random subsample of 53 cases. I calculated kappa for each case and confidence intervals around the kappa. The confidence intervals indicated the range it was likely for the population Kappa to have fallen had the second rater rated all cases. You can read the details from http://www.archive.org/details/randolph_dissertation

Sofie, in the case of your data set, since all raters rated all 40 cases, you don’t need to calculate confidence intervals if you meant to make an inference from the reliability subsample to all sampled cases. You know what the population parameter is; you don’t have to make any inferences.

Since this will probably be of interest to other Online Kappa Calculator users, I wrote a program to calculate confidence intervals around free-marginal kappa in the case of generalizing from a reliability subsample to a sample. I’m a big fan of resampling, especially for obscure statistics like free-marginal Kappa, so I used resampling here. You can run this program on a giftware resampling program, Statistics 101, which is based on the Resampling Stats language. You can download Statistics 101 from http://www.statistics101.net/. See http://www.statistics101.net/QuickReference.pdf for a quick reference to the software and to make sense of the program below.

STATISTICS 101 PROGRAM TO CALCULATE CONFIDENCE INTERVALS AROUND FREE-MARGINAL KAPPA AND PERCENT OF OVERALL AGREEMENT WHEN INFERRING FROM A RELIABILITY SUBSAMPLE TO ALL SAMPLED CASES.

This program presumes that there is a total 1000 sampled cases, that the number of expected agreements in the 1000 sample cases is 750, that the total number of expected disagreement in the sample cases is 250, that the size of the reliability subsample is 100, that the percent of overall agreement found in the reliability subsample is 0.75 (or 75%), that the percent of expected agreement is 0.5 (remember that percent of expected agreement is 1/number of rating categories), and that the desired confidence intervals are 95% intervals. Agreements are labeled with a “1” and disagreements are labeled with a “2”.

You can find the number of expected agreements and expected disagreements to put in line 1 by multiplying your percent of overall agreement by the total number of sampled cases (here it was 0.75*1000). Your number of expected disagreements can be found by multiplying the total number of sample cases by (1-percent of overall agreement). Here the expected number of disagreements was 250 because (1-0.75)*100=250. Note that only whole numbers are useable here so you might have to make your population sample size slightly larger or smaller to make your percent of overall agreement accurately reflected in a proportion reflected in whole numbers.

You can modify this program to fit your own needs by changing the numerical values in line 1 (i.e., replace 750 with your own number of expected agreements and 250 with your own number of expected disagreements), line 2 (replace 100 with the size of the reliability subsample), and line 3 (replace 0.5 with the percent of expected agreement in the reliability subsample (expected agreement = 1/number of rating categories).

URN 750#1 250#2 pop
COPY 100 size
COPY 0.5 expect
REPEAT 100000
SAMPLE size pop samp$
COUNT samp$ =1 agree$
DIVIDE agree$ size over$
SUBTRACT over$ expect numerat
SUBTRACT 1 expect denom
DIVIDE numerat denom kappa$
SCORE kappa$ kappa
SCORE over$ over
END
PERCENTILE kappa (2.5 50 97.5) Kappatiles
PERCENTILE over (2.5 50 97.5) overtiles
PRINT kappatiles
PRINT overtiles

The results are displayed below:
Kappatiles: (0.32 0.5 0.66)–Meaning that the 95% confidence intervals around the median free-marginal kappa (i.e., 0.5) are 0.32 and 0.66.
overtiles: (0.66 0.75 0.83)–Meaning the 95% confidence intervals around the median percent of overall agreement (0.75) are 0.66 and 0.83.

Note it’s possible to get a discrepancy between the median kappa or median percent of overall agreement reported here and the kappa reported by the Online Kappa Calculator for several reasons: (1) the number of expected agreements and expected disagreements in the population might not correspond with the percent of overall agreement. Remember that only whole numbers are possible in Line 1. (2) The online Kappa calculator calculates Kappa and percent of overall agreement differently than it is calculated here. The Online Kappa Calculator does not use the median kappa and does not use resampling.

GENERATING CONFIDENCE INTERVALS AROUND FREE-MARGINAL WHEN INFERING FROM ALL SAMPLED CASES TO ALL POSSIBLES CASES IN THE UNIVERSE.

A use for drawing confidence intervals around kappa that I haven’t personally seen in any research, but that I could imagine a use for is when you want to make an inference from all sampled cases to the possible universe of cases that could have been sampled from. Sofie, I think that this is probably the Kappa intervals that you are looking for. If I understand correctly, you had 5 raters rate 40 of the same rib (i.e., rib 1A or whatever). I would calculate percent of overall agreement for each of the 40 cases and use the program below. Replace the values in parentheses in Line 1 with the 40 percent of overall agreements you got for a particular rib. I suspect that you want to report interrater reliability for each rib (i.e., 1a, 2a, etc.) separately.

PROGRAM FOR GENERATING CONFIDENCE INTERVALS AROUND FREE-MARGINAL KAPPA WHEN GENERALIZING FROM ALL SAMPLED CASES TO THE UNIVERSE OF ALL POSSIBLE CASES

To modify this program to meet your own needs, replace the data in parentheses in line 1 with the percent of overall agreement you found for each case. Change the numerical value in line 3 with your own percent of expected agreement (i.e., 1/number of rating categories).

COPY ( 1 1 1 1 1 0 0 0 0 0 ) pop
SIZE pop size
COPY 0.5 expect
REPEAT 10000
SHUFFLE pop pop$
SAMPLE size pop$ agree$
MEAN agree$ mean$
SUBTRACT mean$ expect numerat
SUBTRACT 1 expect denom
DIVIDE numerat denom kappa$
SCORE kappa$ kappa
SCORE mean$ mean
END
PERCENTILE kappa (2.5 50 97.5) kappatiles
PERCENTILE mean (2.5 50 97.5) overtiles
PRINT kappatiles
PRINT overtiles

The results are listed below:
kappatiles: (-0.6 0.0 0.60)–Meaning that the 95% confidence intervals around the median free-marginal kappa (i.e., 0.0) are -0.6 and 0.60.
overtiles: (0.2 0.5 0.8)–Meaning the 95% confidence intervals around the median mean percent of overall agreement (0.5) are 0.2 and 0.8.

April 8, 2009 at 11:57 pm

Sofie

Hey,
I used the “program when generalizing from all sampled cases to the universe”. And it worked! The free marginal kappa’s correspond to the ones I calculated with your online kappa calculator, but now they are all negative. Is it ok if I just take the absolute value or is there more I (or you) should know?
The free marginal kappa’s I calculated now are also a little bit bigger. For example 0,67 in online kappa calculater becomes 0,69 (median) in statistics 101.

Sofie

April 9, 2009 at 3:36 am

jrandolp

Hi Sofie,

A negative kappa means that you would have done better if you had just guessed (i.e., a kappa of zero). So, no, the absolute value doesn’t work for Kappas. If the point estimate is on, but the sign is wrong, I’m guessing that you (or I) just got a sign switched somewhere.

I’m glad that this worked for you.

Take care,

Justus

April 16, 2009 at 9:57 am

thelovelydays

Hi Justus,
I have used the equations you described in your paper presented at the symposium in 2005 to calculate a free marginal multi-rater kappa, using Excel. It works well and I have checked its accuracy with your online calculator.

Now, I want to be able to calculate SE and 95% CI. I see your discussion on this above. I downloaded the program (Statistics101), but really could not makes heads or tails of it. Can you provide me with an equation I can use, or an Excel equation, for calculating SE. Also, what about p values? Can we derive those somehow too?

The data I have analysed is from 22 raters, 6 categories, 15 subjects.

Thanks for your help.

Ben

May 8, 2009 at 12:47 am

Justus Randolph

Hi Ben,

I replied a few posts down.

Justus

May 7, 2009 at 11:31 pm

Andy

Hello Justus
I have just come across your online calculator and had a go at one data set, entering the data manually. Worked brilliantly! Many thanks for setting this up. Now doing a second data set but wanted to paste an excel spreadsheet to save inputting all those noughts. Sorry for being a bit dense but how do I accept the digital certificate that allows me to proceed?

Cheers

May 8, 2009 at 12:46 am

Justus Randolph

Hi Andy,

The first time you went to the site, there should have been a pop-up screen that asked you if you would accept the digital certificate. Since you were successfully able to use the calculator, I think that you have already accepted it. Some folks have had a hard time inputting data from a spreadsheet. If worse comes to worse, you might have to enter it by hand. If you get an error check to make sure there is a number in each cell and that each row adds up to the number of raters.

Justus

May 8, 2009 at 12:04 am

Justus

Hi Ben,

I’m a convert to resampling these days. The Statistics101 program has a good book and online documentation with it to help make sense of the formula. They also have good info at http://www.resample.com.

You could easily change it to get P values or SE; however, why calculate those things when the confidence intervals are what is really meaningful? (See the APA’s Task Force on Statistical Inference’s statement for a rationale). I think the inferential question that we want to answer is what are the likely values of the population Kappa, not the probability of a kappa’s being 0. If you did do a P value, I would calculate the probablility of the population’s kappa being at or above the threshold acceptability value (e.g., .70).

Now, with that said you could find confidence intervals (or p values) around kappa in excel using resampling. There’s a great article in Teaching Statistics called Resampling with Excel that will tell you how to do it. It is is fun exercise. The article is linked below:

http://www3.interscience.wiley.com/journal/118769449/abstract?CRETRY=1&SRETRY=0

June 20, 2009 at 3:53 am

Jeff Levsky

Dear Justus,

Thank you very much for your online calculator. I was trying to use it online (3 raters, 2 categories, 21 samples) but was getting a java console exception:

java.lang.NoSuchMethodError: java.math.BigDecimal.(I)V
at KappaCal$1.mouseClicked(KappaCal.java:200)
at java.awt.AWTEventMulticaster.mouseClicked(Unknown Source)

Can you help me get this to work? Unfortunately, I have limited control over updating the java engine on my machines at work.

Thanks,

Jeff

June 20, 2009 at 4:11 am

Justus Randolph

Hi Jeff,

I don’t know a lot about Java, so I can’t help you with that error. I checked the calculator from two of my computers and the calculator worked fine in both cases. Perhaps the easiest solution is to find a different machine in which the Java is working. Whenever I’ve had a technical problem with the calculator, uninstalling then reinstalling Java worked every time. Sorry I couldn’t be of more help.

Take care,

Justus

August 22, 2009 at 1:21 am

Tracy

Hello Justus!

I have followed your online Kappa blog and it helped me realize I need to use an intraclass correlation (ICC) rather than a Kappa correlation. I have a question about the results I obtained…

I am trying to figure out if my calculations on the intraclass reliability coefficients I have obtained are adequate for my purposes. I asked 8 experts in the field of school climate 12 questions on a questionnaire I created, asking them to rate the importance of certain reliability, validity, norms, and varibles on an assessment of school climate. This is the Likert scale I used: 0-Not important at all 1-Somewhat important 2-Moderately important 3-Very important 4-No opinion 5- Don’t know.

I decided to throw out the responses from the raters that chose either 4 or 5 as a response, since it did not indicate ordinal data. This left me with responses for only 5 raters.

I plugged in my data into: Calculation form of intraclass correlation coefficient by Dr.Funatsu, Professor of Meisei University, Tokyo at http://www.wwq.jp/javascript/intracorre.html and calculated an ICC of 0.5192307692307757.

I listed the frequency for each variable as “1”. Is this correct? Is 0.5192307692307757 considered an adequate ICC? or is my sample size simply too small to give an adequate reading?

If you have a chance to respond, that would be wonderful–unfortunately, I have a very short deadline and I am running out of resources to determine an accurate answer.

Thanks so much!

Best regards,

Tracy

August 22, 2009 at 3:41 am

jrandolp

Hi Traci,

I’m glad that you found the Online Kappa Calculator discussion to be useful for you. To be honest, I’m getting out of my area of expertise with the ICC. I think that Dr. Funatsu could help you better than I can.

To answer what I can though:

I’m not sure what you should put for the frequency. I quickly looked at Dr. Funatsu’s calculator, and it wasn’t intuitive to me what was going on.

Since you can’t change your sample size now, I would calculate confidence intervals for the ICC if that’s possible. I think that it’s possible with SPSS. That way you would know the probable range of the parameter ICC that you are inferring to based on your sample size.

If I understand the ICC correctly, you can interpret it basically as you would the ubiquitous Pearson correlation (r). The ICC ranges from 1 to -1, where 0 indicates no correlation. Whether 0.52 is an adequate ICC is really relative. I would find other similar studies that have used the ICC and use them as reference points.

If you need help calculating intervals around the ICC, I offer personalized statistical support services. Send me a private e-mail if you are interested, justus@randolph.name

September 29, 2009 at 9:42 pm

Markku Paanalahti

Dear Justus

I have one case
8 categories
Two rates

Cat.0 Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7

1 2 1 1 1 2 1 1

the program tells me that Kappa can not be calculated. the two raters do not disagree at the same time in any of the categories.

Sincerly

Markku

October 6, 2009 at 3:32 am

jrandolp

Hi Markku,

The problem is that you need to retabulate your data so that the case
are in rows, the categories in columns, and the sum of agreements per
category in the cells. The sum of rows should equal two, not ten, in
your case since there are only two raters. Currently, looking at your
data set from the prospective of how the Online Kappa Calculator views
data, it tells me that there was one case, eight categories, and 10
raters (1+2+1+1+1+2+1+1=10).

For kappa, each case can belong to one and only category. So if you
have a case that is of the type “check all that apply” you would have
to have separate kappas for each characteristic. For example, if you
wanted to have two raters rate a cake on the three characteristics
(icing, flavor, texture) you would have to calculate Kappa for each of
those three characteristics.

Color: Is it white (1), blue (2), or pink (3) ?
Flavor: Is it (1) Chocolate (2) Angel Food?
Texture: Is it (1) hard (2) soft. ?

Suppose that for color, both raters agree that it was white. You would
tabulate the data thusly (the case in row 1, white blue and pink in
the first three columns, and the number of agreements per category in
the cells):
White(1) Blue (2) , or Pink (3)
Case A (a cake): 2 | 0 | 0

The 2 in cell A1 means that 2 raters thought that the cake was white.

You mentioned that all raters agreed in every time on all
characteristics; therefore, as a short cut I can tell you that for
each characteristic, the value of Kappa, free-marginal or fixed, is
1.00.

October 6, 2009 at 3:43 am

Markku Paanalahti

Hi Rudolph

Thank you very much for you answer- I start to understand a little bit more about Kappa statisics. Quite of relief I can tell you.

Markku

October 26, 2009 at 11:16 am

James Rucker

Hi Justus

Your calculator is rather wonderful and has saved my research project. Many thanks. I have about 180 sets of data I want to calculate Fleiss’ free marginal kappa on so I am wondering if there is a faster way to do it than having to cut and paste in each individual set of data, and do the same for each result?

Incidentally, on Apple Macs and Linux you can’t cut and paste data into the applet, even if you accept the certificates. Windows is fine.

Cheers

James

October 27, 2009 at 11:48 pm

Justus

Hi James,

I’m sorry that there is only one way to set up data in the OKC. An OKC user recently told me that he had success making an excel file with the formula’s used here. There is a reference to the formula I for Fleiss’s Kappa on the OKC page. I’m glad that you find the it useful.

Justus

November 17, 2009 at 6:55 pm

Precimax

We required sample calculation of attribute study like kapaa study.pl give me one sample for the same.

December 17, 2009 at 2:03 pm

Tom Hartley

Hi Justus, I have been given a problem by our 5 cytology screeners… are they all giving roughly the same diagnoses over the last 6 months of work. The dataset they have given me is not case by case but cytologist by cytologist eg
Cytologist 1 : 316 Negatives, 45 low grade, 30 intermediate, 13 high grade
Cytologist 2 : 135 Negatives, 13 ,, ,, , 19 ,, ,, , 19 ,, ,,
Etc Etc.
They don’t have the same cases and they have different workloads but the cases all come from the same panel of clinics.
Is it possible to calculated Kappa from consolidated data such as this ??

Thanks

December 18, 2009 at 1:12 am

Justus Randolph

Hi Tom,

Since the raters are categorizing different cases, I would put these into crosstabs and look at how the proportions differ, the adjusted residuals, and the value of Chi square. (A rule of thumb is that adjusted residuals greater than 2 in absolute value indicate that one thing is not like the others. On crosstabs with many cells, you will probably get some large residuals just by chance though too.) I’ll e-mail you a printout of what I mean. From the data set you gave me, these raters seem to rate differently (chi square = 15.45, 3, p = .001). This biggest difference in the rating was in the high grade category. Rater 1 rated about 3% in the high grade category, the other rated 10% in the high grade category. The adjusted residual for the high grade category was |3.5|.

If you wanted to do an inter-rater reliability study, you could have them rate a sample of the same cases and then calculate kappa.

Call or e-mail if you need further help. I offer private statistical consulting services.

December 18, 2009 at 7:25 am

b.evans

Howdy there!

Thank you very much for this great calculator! I have accepted the certificate, and I have definately had problems pasting my info into it from BOTH excel and SPSS. However, my list quite large (16categories, 250 subjects, three raters), and I really don’t want to enter it by hand. Can you help at all? thanks! all the best.

December 18, 2009 at 9:21 am

Justus Randolph

Hi B. Evans,

It’s hard to tell what the problem is. Often when this happens, it’s because the source dataset isn’t the same size as the destination dataset. Your source and destination data set should have 16*250*3 cells. For the dataset to work the empty cells should be filled in with zeros and the rows should sum to 3, since there are three raters.

Hope this helps,

Justus

January 23, 2010 at 11:50 pm

Jorge

Hi Justus,

I have been struggling to derive the confidence intervals for the kappa scores we have generated with your online calculator. In brief, we have 5 obsevers, 295 cases and 5 categories (nominal values). There is no prerequisite for the number of cases in each category.

Which Kappa score test should we use? How can we calculate the confidence intervals?

Many thanks for your help!

Best wishes,

Jorge

January 25, 2010 at 9:18 am

Justus Randolph

Hi Jorge,

I’ve written a resampling stats program for calculating confidence intervals in the March 27, 2009 post above. You can download free software to run it from http://www.statistics101.net/.

Justus

March 6, 2010 at 7:04 pm

Isabelle

Dear Justus,

I have come across your website with the online kappa calculator, which seems like a fantastic resource. However, unlike the other users of your blog, I am not yet at the stage of data analysis, my question is much more preliminary.

My research project investigates risk factors for suicide attempts. As part of it, I have developed a rating system to determine whether or not a given incident of self harming behaviour constitutes a suicide attempt. Accordingly, an event can be classed as a suicide attempt, not a suicide attempt, or undetermined.

Despite an intensive literature search, I have not been able to work out what would be an adequate sample size of the vignettes to be rated and the number of judges for calculating kappa. The papers I have read on the sample size requirements (e.g., Cantor, 1996; Flack et al., 1988; Sim & Wright, 2005) provide formulae or tables for estimating sample size for significance tests for kappa, but nothing on what would be considered appropriate if kappa were to be simply used as the sample estimate. Looking for clues, I have conducted an informal review of 10 studies where inter-rater agreement was calculated and kappa was not subject to hypothesis testing. What I found was that the authors had not commented on the rationale for decisions related to the size of their samples, or the number of raters used.

Unfortunately, I have not come across a paper that would provide recommendations in regards to when having three (or more) raters is advised, compared to having only two raters.

I thought perhaps you could suggest what is likely to be reasonable from the statistical point of view in terms of the sample size and preferable in terms of the number of raters. Any advice would be much, much appreciated.

Best regards,
Isabelle

March 6, 2010 at 11:29 pm

Justus Randolph

Hi Isabelle,

I’m not sure what you mean by “sample estimate” in the sentence: “estimating sample size for significance tests for kappa, but nothing on what would be considered appropriate if kappa were to be simply used as the sample estimate.” Please explain.

Could you send me the papers you referred to by e-mail and I’ll try to see what you mean. I’m guessing that you could use those formulas or tables there if you are using Cohen’s kappa to estimate the inter-rater reliability of your scale.

You might also want to check out generalizability theory. That might be another method of answering your question, if you want to think of raters as random factors.

March 6, 2010 at 11:34 pm

Justus Randolph

A few more questions:

Do you want to make an inference from a subset of cases, which you will have both judges rate, to all cases that you have data for?

Do you want to make an inference from all cases you have data for to the universe of possible cases?

Do you want to make an inference to all judges from the judges you have chosen?

Justus

March 7, 2010 at 3:27 pm

Isabelle

Hi Justus,

Thank you for your prompt reply. I have sent you an email with the additional info.

Best regards,
Isabelle

March 13, 2010 at 6:05 am

christian

@Isabelle

Maybe this is helpful:
Bonett DG. Sample size requirements for estimating intraclass correlations
with desired precision. Statist Med 2002; 21:1331–1335.

March 19, 2010 at 8:31 am

Isabelle

Hi Christian,

Thank you for your comment. An article by Altaye et al. (2001). A general goodness-of-fit approach for inference procedures concerning the kappa statistic. Statist Med, 20, 2479-2488 has solved my problem. I recommend this paper to anyone who wants to estimate the sample size in a reliability study that involves 3 categories, and would like to compare the numbers needed when only two judges make the ratings vs. when there are 3-7 raters.

I have downloaded the paper you suggested – a good resource to have for future reference!

Isabelle

March 20, 2010 at 5:51 am

Justus Randolph

Hi Isabelle,

Thanks for the great resource! This looks as if it will really be a benefit to the research community of Kappa users. I look forward to reading it.

Justus

March 20, 2010 at 5:38 am

Sara

Hi Justus

I’ve been having problems using my University’s preferred statistics package to calculate values using Kappa statistic and no one there knew how to use the package for that purpose.
I came across your online Kappa calculator and was able to obtain some values. What I would like to know is how did you formulate this calculator? Is it regularly calibrated and can anyone change it so that the results it provides aren’t accurate?
I would like to use the results I obtained but need to be assured of the accuracy of the calculator.
Many thanks

March 20, 2010 at 5:57 am

Justus Randolph

Hi Sara,

The formulas used in the OKC are listed in the description and the references at the bottom of the OKC page.

The java program that the OKC is based on is password protected. Only I have the password to the java program. I haven’t checked the calibration lately, but it was extensively tested by the programmer and I by hand-calculating to compare results between the formulas and the OKC. Specifically, I used known kappa values from a data set from the formula’s source, among other tests, and compared that to OKC results. The results were accurate at the time. To the best of my knowledge, nothing has been changed in the original program. You could run a few tests comparing the OKC results to hand-calculated results from the formulas used.

Write again if you need clarification. I appreciate your attention to detail.

Take care,

Justus

March 29, 2010 at 9:14 am

Mary

Justus, What a wonderful tool! It was great to be able to calculate the IRR for my study!
Thanks!!
Mary

April 6, 2010 at 4:11 am

Naomi Fineberg

I can not fine the “button” to start the program. What exactly do I have to do to run the program?

Thanks,
Naomi

May 12, 2010 at 12:29 am

Justus

Hi Naomi, Sorry for not getting back to you earlier. It’s the “Show Table” button.

Justus

May 11, 2010 at 9:52 pm

Hillel

Hi,
This is such a great tool! Thanks!
I was wondering, is there a significance p level which accompanies the Kappa value? I think spss has that for the simple 2 judge kappa, but it does not appear in the output from the online calculator. I would need that info for reporting this in a paper, no?

Thanks,
hillel

May 12, 2010 at 12:32 am

Justus

Hi Hillel,

I think that SPSS can calculate p values or confidence intervals for Cohen’s 2 rater kappa. There is also an SPSS macro for Fleiss’s kappa, it’s mentioned in one of the comments above.

I’ve written resampling stats/statistics 101 code for calculating confidence intervals around free-marginal multirater kappa. It’s about half-way up the page. Hope this helps.

Justus

May 31, 2010 at 12:54 pm

Ken

I can use your great tool to calculate a kappa for 104 cases last Friday. But today my Java runtime blocks it with the following dump. Before getting the error I was asked two questions: do you trust this program? (yes); block unsafe content? (no). Any advice? thx

Java Plug-in 1.6.0_20
Using JRE version 1.6.0_19-b04 Java HotSpot(TM) Client VM
—————————————————-
c: clear console window
f: finalize objects on finalization queue
g: garbage collect
h: display this help message
l: dump classloader list
m: print memory usage
o: trigger logging
q: hide console
r: reload policy configuration
s: dump system and deployment properties
t: dump thread list
v: dump thread stack
x: clear classloader cache
0-5: set trace level to
—————————————————-

java.lang.SecurityException: class “KappaCal” does not match trust level of other classes in the same package
at com.sun.deploy.security.CPCallbackHandler$ChildElement.checkResource(Unknown Source)
at com.sun.deploy.security.DeployURLClassPath$JarLoader.checkResource(Unknown Source)
at com.sun.deploy.security.DeployURLClassPath$JarLoader.getResource(Unknown Source)
at com.sun.deploy.security.DeployURLClassPath.getResource(Unknown Source)
at sun.plugin2.applet.Plugin2ClassLoader$2.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at sun.plugin2.applet.Plugin2ClassLoader.findClassHelper(Unknown Source)
at sun.plugin2.applet.Applet2ClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.plugin2.applet.Plugin2ClassLoader.loadCode(Unknown Source)
at sun.plugin2.applet.Plugin2Manager.createApplet(Unknown Source)
at sun.plugin2.applet.Plugin2Manager$AppletExecutionRunnable.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Exception: java.lang.SecurityException: class “KappaCal” does not match trust level of other classes in the same package

June 14, 2010 at 6:07 pm

Isabelle

Hi Justus,

I have read your discussion with Sofie on deriving the confidence intervals, downloaded the Statistics101 software, but I am not clear on the values I need to input to be able to calculate CIs for my data. I was wondering if you would be able to help me figure out what I am missing.

I have 70 incidents, three categories (suicide attempt, not suicide attempt, undetermined)and three raters.

I would like to calculate CIs for the agreement on the final category (% overall agreement=0.80, free-marginal kappa=0.70), as well as for the subcomponents of the rating form, which were (as a short hand):
Self-injury – 92.4% agreement, free-marginal kappa=0.85
Explicit presence – 91.1% agreement, free-marginal kappa=0.82
Explicit absence – 92.2% agreement, free-marginal kappa=0.84
Implicit presence – 76.3% agreement, free-marginal kappa=0.53

Thank you in advance.

Best regards,
Isabelle

July 1, 2010 at 4:21 pm

eugen

Thank you!

July 16, 2010 at 6:13 pm

Melek

Hi Justus,

I am working on a study which investigates whether the English language teachers implement communicative language teaching method properly in their classrooms. Data was collected by video recording the lesson of the 20 teachers. 60 lessons were vidoetaped to be analysed. And an observation form with 30 items has been conducted. An example is below:

1) The teacher implements group work in the class. YES NO

Two raters will analyse the data.In order to achive inter-rater reliability, I decided to conduct a king of training for the rates. For the training session, they are qoing to watch 15 lessons and fulfill the observation forms individually. In order to rate the inter-rater relibility, I will use Cohen’s Kappa. However, the problem is that, I do not know how to cover all the items of the observation form ( 30 items).I would appreciate it very much if you give advice me about how to do it.

I searched web , but the given sample examples online did not fit with my research.

July 17, 2010 at 1:17 am

Justus

Hi Melek,

I would calculate a kappa for each item. My dissertation has an example of what I mean: http://www.archive.org/details/randolph_dissertation

Hope this helps.

Justus

July 19, 2010 at 2:26 pm

Melek

Thank you very much. That helped a lot.

July 22, 2010 at 3:18 pm

Bruce

Hi Justus,

My study is of 3 doctors rating of MR images.
25 caes, 4 categories and 3 raters.
They had to decide whether there was a pinched nerve on the MRI and say:

No=0, yes and its on the right =1, yes and its on the left = 2, and yes and its on the right and left = 3

Given this do I use: free of fixed marginal kappa?

Thanks for any help.

Bruce

July 23, 2010 at 12:57 am

Justus

Hi Bruce,

There is no consensus in the research community on this issue; however, I think Brennan and Prediger (see the reference at the bottom of the OKC page) make a great argument for used free marginal kappa in a research situation like the one you have.

Hope this helps,

Justus

July 23, 2010 at 4:04 pm

Bruce

Thank you for yoru promt response.

Is there anyway to generate 95% CIs for these Kappa values?

July 24, 2010 at 1:14 am

Justus

Hi Bruce,

I have written a resampling stats code for generating CIs around free marginal kappa. The code is in a March 27, 2009 comment to Sofie above. You can get free software to run the program at http://www.statistics101.net/

Take care,

Justus

September 6, 2010 at 11:18 pm

frederique

Hi Justus
I tried your on line Kappa calculator with imaging data (47 cases, 2 categories) and 3 analysis methods (qualitative response) and it works prefectly well. Thanks for your help.

September 10, 2010 at 8:33 pm

Justus

You’re welcome. I’m happy to help the research community.

September 10, 2010 at 5:26 am

Robin

I’ve been trying to understand Kappa for months now and still don’t. I had a Likert survey that I gave to 5 people, with 37 items on it. They were to rate each item on a scale from 1 to 6 (1=not at all, 2=most likely not, 3=may not, 4=may, 5=most likely, 6=fully). The question for each of the 37 items was “does this statement measure this construct”. The whole idea was that about half of them were supposed to be items that measure the construct and half were supposed to not measure that construct. At a glance, it looks like there is pretty good agreement on all of the items. For instance, I did 6-point Likert so that it would be a forced choice, so if they selected 4, 5 or 6, it would indicate a good representation of the construct, whereas if they selected 1, 2 or 3, it would suggest that this item did NOT measure the construct. I was told that I needed to report inter-rater reliability, and have already done percentages, which have shown a lot of agreement. So I don’t understand why the Kappa values are coming up as they are. I posted my Likert data below, and the result of the calculation. Please help.

a 0 0 0 0 1 4
b 0 0 0 0 1 4
c 0 0 0 0 0 5
d 0 0 1 2 2 0
e 0 0 0 0 3 2
f 0 0 0 2 2 1
g 0 0 1 0 3 1
h 0 1 0 2 2 0
i 0 0 1 2 1 1
j 0 1 1 3 0 0
k 0 0 1 0 2 2
l 0 1 3 1 0 0
m 0 0 0 2 2 1
n 0 0 0 2 1 2
o 1 0 0 3 0 1
p 0 0 0 1 3 1
q 0 0 0 2 1 2
r 0 0 1 0 4 0
s 0 1 0 1 1 2
t 0 0 1 1 2 1
u 1 0 0 4 0 0
v 1 1 1 1 1 0
w 0 3 1 1 0 0
x 2 2 0 0 1 0
y 1 1 3 0 0 0
z 0 1 1 1 2 0
{ 1 0 0 3 0 1
| 0 1 1 1 1 1
} 1 2 1 1 0 0
~ 1 2 1 1 0 0
1 2 0 2 0 0
0 2 0 2 1 0
0 0 2 0 2 1
1 1 2 1 0 0
2 0 2 1 0 0
2 2 0 1 0 0
2 3 0 0 0 0

Percent of overall agreement Po : 0.267567
Fixed-marginal kappa : 0.105881
Free-marginal kappa : 0.121080

September 10, 2010 at 5:30 am

Robin

Also, I do have SPSS 15.0, and know to go through “analyze… decriptives… crosstabs” and click on Kappa in the statistics box, but I get an error in there. It doesn’t run the Kappa, and I’m not sure why. That’s why I looked online and found your site.

September 10, 2010 at 12:10 pm

Fabulous program! It’s things like this that make the Internet truly great.

September 10, 2010 at 8:32 pm

Justus

Thank you!

Justus

September 10, 2010 at 8:31 pm

Justus

Since your data are continuous or ordinal, Kappa might not be right for you, since a 5-6 disagreement is as severe as a 1-6 disagreement.

I might try the intraclass correlation or one of the many other statistics that takes the interval or ordinal nature of your scale into consideration. Kappa works best for nominal agreements. If you were to collapse your scale into a yes/no scale, recast it with two categories and try it again with Kappa. You might also think about using alpha to measure the internal consistency of your scale and factor analysis to see if there are indeed two factors as you posit.

Typically, one reports Kappa for each variable when there are multiple items in a scale.

September 15, 2010 at 2:00 am

Sylvano

Your multi-rater Kappa calculator is a gem.

Such a lot of work… which I do respect. A professional Statistician here… praise. A question… if I may…? What use are point estimates of Kappa ( however derived), without confidence intervals? To labour a point… say an estimate of Kappa = 0.78. Makes a world of difference if that estimate is based on 5 raters / 20 subjects… as opposed to 5 raters / 200 subjects.

Without some measure of point estimate sampling error (reflecting sample size), all you have is a number… information content = zero. For me anyway.

Put a 95% ci on your applet results… then we really know where we are. Assuming all necessary experimental design issues satisfied.

All well meant. You have a diamond… but (to me) it is rough, not brilliant cut.

All seriously well meant, and no offence intended.

September 15, 2010 at 2:49 am

Justus

Hi Sylvano,

I do intend to put CI’s around the various kappas someday, when things slow down, which might be a while!

Justus

September 22, 2010 at 9:51 am

jemima

ok. this calculator won’t work for those who have rows with a total of different raters.

such as

category 1 category 2

A 5 10
B 93 19

see if you put raters as 15, then b won’t go through. the total (n=127) is what raters should be but i dont get the calculator lol

November 6, 2010 at 2:47 am

Leslie

Thanks for making this available. (and thanks for making it possible to cut and paste data for all 111 subjects in one fell swoop.) Kind of cool that anyone even thought to create this.

November 19, 2010 at 9:31 pm

Madawa

Hi Justus,

Thank you for making this programme available.

In my case, I carried out a survey among 406 medical educators on the importance of different professional attributes of doctors (30 items). The rating scale was: extremely important, very important, somewhat important, slightly important and not important. In the analysis of results, I collapsed extremly & very to one category, somewhat & slightly to another and unimportant to the final. I enterd the collapsed categories to the calculator. Numebr of categories = 3, cases = 30, raters = 406. The percentage agreement was 0.85 and free margin kappa was 0.77. Am I methodologically correct? Your feedback is highly appreciated.

Many thanks.

November 20, 2010 at 2:15 am

Justus Randolph

Hi Madawa,

What you have is a summated rating scale. Therefore, kappa is not the right statistic for your analysis. The following books might be useful for you:

Hope this helps,

Justus

December 7, 2010 at 8:55 pm

Madawa

Many thnaks for your advice.

November 27, 2010 at 8:46 am

Mark

My study design involves 128 professionals reviewing a single, identical case summary and rating whether or not they believe child abuse occurred in the case. In addition to percent agreement on the dichotomous outcome, I would like to add a kappa coefficient. Does your program apply to such a design? How much does sample size matter: will an agreement rate of 85% generate the same kappa with 100 raters as with 500 raters?

November 30, 2010 at 3:30 am

Justus Randolph

Hello Mark,

Good question.

Yes, kappa is an appropriate statistic for that situation. In the OKC, use 1 case, 2 categories, and 128 raters. In the cells include the number of raters who believed it did occur and the number who didn’t.

The kappa estimate will not change based on the N size; the confidence intervals around kappa will change however. If you need smaller CIs, then you should increase the N size. If you look earlier in this discussion, I’ve included some programs to calculate CIs around free-marginal kappa. You can experiment with those to get the sample size you would need under different situations. I was curious and modified the program I wrote above to show the 95% CIs around kappa and percent of overall agreement if half agreed and half didn’t. The results are below:

kappatiles: (-0.171875 0.0 0.171875)
overtiles: (0.4140625 0.5 0.5859375)

(The middle number is the median–Remember kappa is 0.0 and percent of overall agreement is 0.5 when the results are exactly as good as chance. The lowest and highest numbers are the upper and lower 95% CIs.)
With that sample size (128) and the most conservative case when half agree and half disagree, the there is a +- .17 margin of error (95% CI) around kappa and a +- 9% margin of error (95% CI) around the percent of overall agreement.

The resampling stats/statistics 101 program I used to calculate this is given below>

NUMBERS 64#1 one
NUMBERS 64#0 two
CONCAT one two pop
Print pop
SIZE pop size
COPY 0.5 expect
REPEAT 10000
SHUFFLE pop pop$
SAMPLE size pop$ agree$
MEAN agree$ mean$
SUBTRACT mean$ expect numerat
SUBTRACT 1 expect denom
DIVIDE numerat denom kappa$
SCORE kappa$ kappa
SCORE mean$ mean
END
PERCENTILE kappa (2.5 50 97.5) kappatiles
PERCENTILE mean (2.5 50 97.5) overtiles
PRINT kappatiles
PRINT overtiles

Hope this helps.

Justus

February 9, 2011 at 9:26 am

Roxana Patricia De las salas Martnez

Dear Justus Randolp

1. I have a question about the minimal sample size for Fleiss’s kappa
for five no ordinals categories, could you help me with information about it?

2. For Fleiss´s kappa applies the distribution of weighted kappa (2Ke2) for sample size?
Thanks you very much.
I´m waiting for your answer. Please contact me by e-amail

Roxana
e-mail contact: rdelassalas@gmail.com

February 10, 2011 at 4:36 am

Justus Randolph

1. If you are using ordinal categories, you can still use Kappa but their are better statistics because Kappa uses categories.

I don’t have a formula for sample size but you can experiment with the Resampling Stats programs above to figure out what sample size you need to get the confidence intervals you want. If you just want to demonstrate that the ratings are intersubjective, you really just need two raters.

2. I don’t believe that Fleiss’s kappa is weighted in the way the Cohen’s weighted kappa is.

Hope this helps,

Justus

February 28, 2011 at 5:32 pm

Lubos

Hi Justus, Thanks for a great tool.

One of my raters skipped some of the cases. Can your calculation take into account missing values of this type or do I have to create a “no answer” category for this?

March 1, 2011 at 3:49 am

Justus Randolph

Hi Lubos,

You’ll have to probably delete the cases. Theoretically, “no answer” doesn’t really seem like a valid response category.

March 1, 2011 at 12:59 pm

Lubos

Thanks Justus. Unfortunately, deleting whole cases is not an option for me. I have a fixed set of 10 cases I am investigating through 13 raters. I could maybe remove the rater but I would lose/ignore his other valuable ratings.

I have a few more questions:
1/ Could your kappa be extended to consider missing values?
2/ Could your kappa be extended to consider weightings of various disagreements?
3/ Have you published your formula in any journal?
4/ A few responses above, you mentioned madawa had a summated rating scale. What do you mean by “summated”? I do not think Spector’s book talks about combining answer choices in that sense.

March 1, 2011 at 5:41 pm

Justus Randolph

Hi Lubos,

Good questions:

1. Like most kappa statistics, and other statistics in general, the formula does not work with missing data. You either have to remove raters, remove cases, or impute missing data values.

2. No, not yet at least.

3. No, but it turns out that my multirater free-marginal kappa (the fixed marginal kappa in my calculator) is unique and has turned out to be called Randolph’s Kappa. You can read a review of it in Warrens (2009).

Warrens, M. J. (2010). Inequalities between multi-rater kappas. Advances in Data Analysis and Classification, 4, 271-286. doi: 10.1007/s11634-010-0073-4

4. Summated means that the research sums (or averages) the value of each survey item to get a single score for the scale or subscale. This, of course, doesn’t work for categorical variables.

March 1, 2011 at 9:04 pm

Lubos

Thanks again Justus.
1. That is a pity and maybe a good opportunity to fill that gap.
2. To me, weightings would make it quite robust.
3. Thanks for the reference. Hope you will get more recognition like that.
4. OK, we have a common understanding of the term but then I do not think that madawa’s design was a summated rating scale. I think he had an ordered categorical scale (or possibly interval-level) used to rate the importance of various personal characteristics of doctors to ascertain the crucial ones. Then he decided to collapse the scale to three points, effectively coming very close to a nominal scale.

March 29, 2011 at 12:48 am

shital

what does Po mean. I had 4 raters.3 categories and 102 cases. the Po was 0.625817.
does that mean percentage time that 3 or more raters agreed? or all 4 have to agree.

when i calcaulated using a simple calculator, how many times 3 or more raters agreed, it gave me 67.6% agreement.

March 29, 2011 at 4:17 am

Justus Randolph

Hi Shital,

Po is percent of overall agreement. It’s the proportion of times that raters agreed over all possible ratings. You might have gotten a different number because, for example, if there are three raters there are three possible agreements per case (e.g., agreement between 1 and 2, 2 and 3, and 1 and 3, the OKC the number of agreements between all raters, not just when all three agreed.

April 4, 2011 at 5:33 pm

Menka

Hi,
I’m trying to calculate kappa for agreement between two annotators on a range of words. How can I do this with your online calculator? The problem is that there is a text. Each rater finds certain spans of text which we want to annotate. Now, the span may be the same or just one of the boundaries may be the same or both boundaries may be different. I was thinking of doing a weighted Kappa where I can give a score of 2 for exact span match 1 for one boundary match and 0 for none. But, I am not sure how to do it on the calculator. can you please help?

April 5, 2011 at 2:20 am

Justus Randolph

Hi Menka,

That’s a good question. The OKC doesn’t calculate weighted kappa. However, you could also use a nonparametric measure of association and use that as the measure of reliability. See Siegel and Castellan’s Nonparametric Statistics for the Behavioral Sciences; I love that book.

April 27, 2011 at 8:24 am

Sandy

Hi Justus,

I’m working on a coding sheet for a meta analysis and would like to use kappa to report inter-coder reliability. I am unable to calculate this in SPSS, as several items may be checked under one question (e.g., one participant may have several disabilities) and this type of data is not compatible with SPSS (so, I think). Also, each question has a different number of options for coders to select. Will your calculator work in this particular case? If not, do you have any suggestions?

Thanks.

April 29, 2011 at 11:49 am

Justus

Great question Sandy.

Let’s imagine that you have this item in your coding book.

Please circle all categories that apply to this article:

a
b
c

There are two things you can do. The first is to consider a, b, and c to be dichotomous variables. So, you would calculate a value of kappa for a, b, and c. You would have two categories (yes and no) and however many raters and cases you have for each item.

If you are interested in how many cases are a only, b only, c only, a and b only, a and c only, b and c only, and a, b, and c; then you would use kappa with seven categories and however many raters and cases you have.

Hope this helps,

Justus

June 27, 2011 at 4:36 am

Walter Moerkerken

Hello Justus,

This answer helps me as well! One follow-up question for this; if I choose to use your first suggestion (calculate a value of kappa for each variable, using categories ‘yes’ and ‘no’), do I have to switch from using the free-marginal kappa to fixed-marginal kappa? The answer now always will be Yes XOR No.

May 27, 2011 at 3:50 am

Jill

Hi, The tool is great. My question is, what do you recommend for calculating weighted kappa? I have ordinal data – a Likert Scale where 16 individuals are rating 289 individual items.

June 10, 2011 at 12:19 am

Justus

Hi Jill,

Sorry for not getting back to you earlier. I’ve been really busy.

If you have ordinal data, I wouldn’t use kappa because you lose the ordinal information. I might use something like this instead:

http://www.springerlink.com/content/u52p148r225225pw/

Hope this helps,

Justus

June 23, 2011 at 3:57 am

John Close

Hi Justus,

First,I want to thank you for the great program. We have found it very useful. I have a question: We used Online Kappa to compute the multirater agreement among 6 raters on 2 categories (accept or reject) on numerous items (journal articles). The ratings were done on three sets of articles (over 200 in each set) at 3 time periods: pre rater calibration, post calibration, and post-post calibration. The respective results were k=.489 (74% apreement), k=.599 (80% agreement), and k=.612 (81% agreement). My question is how is percent/proportion agreement calculated in this case where there are six raters? With only two raters it is straightforward, but I cannot determine how it is done with 6 raters,

I wish to do an omnibus significance test that the proportion of agreement differs significantly over the three rating times. I plan to use the Friedman test and, if significant, follow it with pairwise McNemar tests. I cannot find a test that will let me do the same with the kappas.

Thanks much.

John
University of Pittsburgh

June 27, 2011 at 4:58 am

Justus

Hi John,

Great questions. The formula that I use to calculate multirater percentage of agreement is given in the following reference:

Randolph, J. J. (2005, October). Free-marginal multirater kappa: An alternative to Fleiss’s fixed-marginal multirater kappa. Paper resented at the Joensuu University Learning and Instruction Symposium 2005, Joensuu, Finland. (ERIC Document Reproduction Service No. ED490661)

That is a good question about comparing the kappas. I’ll have to think about that one. In the meantime, if you really need to know the answer I could probably write a bootstrapping code for it.

An easier solution might be to use what Agresti calls M^2 on the number of agreements/disagreements over measurements. It’s basically chi square testing for a linear trend. SPSS calls is the linear-by-linear test or something like that. Imagine a cross-tabulation with six cells. The columns represent your three measurements in order. The rows are the number of agreements or disagreements. The M^2 will tell you if there is a linear increase in the proportion of agreements over time. If the parameters (categories, raters) stay the same over measurements, kappa isn’t really relevant if you just wanted to know if calibration worked. Hope this helps.

June 27, 2011 at 5:02 am

Justus

Hi Walter,

Nope, you don’t have to switch kappas. I think that a good rule of thumb is to use a free-marginal kappa when raters don’t have a certain number of cases to assign to a category, regardless of the number of categories.

Justus

July 6, 2011 at 7:45 pm

Matt

Hi Justus,

Just as I was giving up trying to work SPSS I found your online calculator it is great thanks!

Just a couple of questions based on the scenario below:

I am getting anaesthetists (79 at present) to rate the physical status of 16 hypothetical cases using a well recognised 1-5 scale called the ASA-physical status classification where:

1 – healthy patient
2- patient with mild disease
3- patient with severe disease
4- patient with severe disease that is a constant threat to life
5- a dying patient

My questions are:

Firstly- I’ve entered in my prelininary data and get the kappa statistic out- which of the free or fixed marginal kappa statistic should I report and what reference can I use to support this decision?

Secondly- It would be useful if I could tell what the level of agreement was for each individual case so I could assess which cases were the most controversial and caused the biggest disagreement. Is this appropriate or indeed possible?

Kind regards,

Matt

September 8, 2011 at 10:03 pm

Justus

Hi Matt,

On the OKC website there are a few article that discuss the free-marginal kappa used here. Check out the Randolph article and the Mathias one.

I think that it’s a great idea to do an error analysis . . . for each case and each rater. You can make lots of insights that way.

August 23, 2011 at 4:05 am

Breon

Dear Justus
I also am looking for a solution SPSS cannot provide. I have 37 subjects classified into 8 categories by two raters.
Can your programme allow me get a kappa with a 95% confidence interval.

Many thanks

Breon

September 8, 2011 at 10:01 pm

Justus

Hi Breon,

See my response to Addy. So far you have to use resampling stats code. I’m working on implementing CIs and SEs here.

September 6, 2011 at 12:10 am

Arun

Dear Justus

Great job!
Is there a way to test significance of change in the kappa value? How do I get p values and confidence interval?

Regards.

Addy

September 6, 2011 at 10:44 pm

Justus

Hi Addy,

Check the posts looking for what looks like computer code. That will give you info on how to create CIs. Take care,

Justus

September 8, 2011 at 9:22 pm

Karin

Hi Justus,

Your online calculator is great! You’ve helped me very much.
This is actually my first time using any kind of statistic program for my study (medicine, University of Utrecht, Holland) and I am finding myself to be a bit lost. I was assigned to be a part of an ongoing medical study, more specifically to determine the inter-rater agreement between 8 radiologists rating 35 patients. There a five categories and in one patient I have a missing. I decided to leave this case out of the calculator and use 34 cases, with 5 categories and 8 raters. This gave me an overall agreement of 0.62500, a fixed-marginal kappa of 0.159372 and a free-marginal kappa of 0.53125.
I didn’t know the difference between fixed and free-marginal kappa, so I read your article: Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss’ Fixed-Marginal Multirater Kappa. This helped me understand a lot more about it, and seeing as the raters had no a priori knowledge of the amounts in each category, I am assuming I should definitly not use the fixed-marginal kappa. I still am not sure whether to use the overall agreement or the free-marginal kappa. Could you help me out? Is there a way to not leave the one case with a missing out of the calculation?
I have to warn you, when it comes to statistics, I am a compleet nitwit.

Thanks for your time,
and I wanted to say that reading all the questions (well, the ones I understood at least…) has helped me a lot in learning some more about Fleiss kappa. I didn’t what I was expected to calculate at first, just that it was called inter-rater agreement. Now I have a feeling I sort of know what I’m doing.

Thanks again,
Karin

September 8, 2011 at 10:00 pm

Justus

HI Karin,

Thanks for the kind words. I’m glad that it’s useful.

There are various ways to handle missing data. . . but I’m not an expert on any of them. You could impute the missing data point for example . . . when the other raters gave an X rating, , , what did the rater in the missing case tend to do. You could report kappa with and without replacement of the missing value.

September 8, 2011 at 11:29 pm

Karin

Thanks Justus!
I hadn’t considered the possibility of reporting two kappa values. Could you explain to me the difference between overall agreement and free-marginal kappa?
Karin

September 9, 2011 at 12:12 am

Justus

Overall agreement, as the name, implies is the proportion of ratings in which raters agreed. It can go from 0% agreement to 100% agreement. Kappa, on the other hand, is adjusted for chance. Like a correlation coefficient it can take on values from -1 to 1, where 0 = agreement just as good as chance and 1 equals perfect agreement above chances. Kappas of .7 or above are considered acceptable.

September 9, 2011 at 12:13 am

Justus

Overall agreement is the percentage of ratings in which raters agreed. It goes from 0-100. Kappa is adjusted for chance. It can’t hurt to report both.

September 28, 2011 at 6:54 pm

Cat

Hi, sorry if i’m being really dense here but I have 4 categories of data with 2 raters coding. I don’t understand this ‘input a zero if no raters agreed that a case belonged to that category. The sum of each row should equal the number of raters’. how can each row equal the no. of raters if no raters agree on a category? Think i’m missing something!

My data is coded as yes no for each category so if I have one rater saying yes, the other no do i enter this in that category as 0?

Help much appreciated!

Thanks

September 28, 2011 at 7:18 pm

Cat

Sorry scrap that last message – have worked it out thanks!

October 5, 2011 at 12:52 am

Chiao

Hi, Justus, I am a student who is doing a study which is related to translation methods. To be honest, I did not know much about statistics and kappa, so I have spent time reading books/articles related to it. Now, I have basic knowledge, but I bump into a problem while using your online kappa calculator.
In my research, I want to examine the way(s) translators deal with certain words. For example, “verbally” is the word that I am probing into, so I select 10 sentences in which “verbally” is contained, and these 10 samples are translated by 10 different translators (i.e. every sample is translated by 1 translator, and 10 different translators tackle these 10 samples). I categorize each sample into suitable translation method category (9 in total). In this case, I think that I should enter 10 in “NO of Cases”, 9 in “NO of categories” and 1 in “NO of Raters”. However, I find that the free-marginal kappa is always -0.125000 no matter how I arrange the figures in the table. Is it because there should be more than 2 raters? Or is there any mistake in my setting? Should I actually see the 10 different samples as 1, and put 1 as the NO of Cases and 10 as NO of Raters?
Thank you very much for your suggestions!!!

Chiao

October 5, 2011 at 9:10 pm

Justus

Hi Chiao,

I think that the problem is that you only have one rater. Kappa is useful when you want to calculate interrater reliability, so you would need two or more raters rating the same cases and using the same rating categories. Kappa would work for example if you and another person (2 raters) looked at each of the ten translations (10 cases) and decided which of the nine translation methods was used (9 categories). Does that make sense? Using two raters helps ensure that the ratings are intersubjective . . . that is, they are not purely subjectively based on your own ratings.

Take care,

Justus

October 5, 2011 at 9:25 pm

Chiao

Yes, that makes sense, and actually I sort of knew what the problem is…In addition to this analysis, I have conducted a test in which every sample was translated by 6 translators. The results from this test will be more suitable for kappa.

I was thinking about comparing the kappa values for this analysis results with those for my test results. But it seems to be infeasible now. I will have to figure out a new path. Thank you very much for your reply!!

November 12, 2011 at 1:15 pm

Sharon

Hello Justus:

I am looking for your guidance in using the kappa to understand to what extent there is interrater reliability among raters using a new state teacher evaluation instrument. In my research study, the 39 raters were instructed to rate 17 elements using six categories. There are four teachers who were rated. Am I correct in saying that there are 17 cases? I have tried to input the data using 17 cases (the number of elements rated), 6 categories, and 39 raters. Across the table, a tally of the six categories should equal 39 (they do–I’ve doublechecked!) I get a message that the p should equal the number of raters and that the calculator can’t compute. I did this for the first teacher and got a value of 0.3 and did the second teacher the same way, yet get this message. What am I doing wrong?

Sharon

November 15, 2011 at 6:03 am

Sharon

Justus:

I stuck with the task and found out what the issue was which was keeping the calculation from being done!

November 24, 2011 at 7:20 pm

lizzie

This is fantastic! I calculated Fleiss’ Kappa on my data and it was quite clear that the values were not representative of the actual levels of agreement (I was getting high agreement but low kappa), because of the way in which the expected levels of agreement are calculated. A free marginal multirater kappa was exactly what I needed. Thank you so much. This has saved me a lot of headaches!

November 25, 2011 at 6:37 am

percentage calculator

Brilliant! I’m such a fan of calculators, they make things so much easier, especially things that are already extremely difficult 🙂

December 6, 2011 at 7:47 pm

Fredrik, Sweden

I am amazed both by the calculator and your answers to peoples questions. I will be using the calculator in a pilot study, and have two questions;

I have six raters (this n will be larger in a coming study), they will conduct an interview with three patients (actors using three standardized scripts) using a clinical interview with 7 items (referred to as item i-o in the data exapmle below). Each item is assessed by the rater as “symptom absent (0)” or “symptom present (=1)”. This means that a case may end up with a yes (1) ranging from 0-7 times.

Question 1. Can I calculate an over all Kappa for all seven items or should I calculate a Kappa for each item? Data may look like this:

Case a;
rater 1)item i)1, item j)1, item k)0, item l)1, item m)0, item n)1, item o)0
rater 2)item i)1, item j)1, item k)0, item l)1, item m)0, item n)1, item o)0
rater 3)item i)0, item j)1, item k)0, item l)1, item m)0, item n)1, item o)0
rater 4)item i)0, item j)1, item k)0, item l)1, item m)0, item n)1, item o)0
rater 5)item i)1, item j)1, item k)0, item l)1, item m)0, item n)0, item o)1
rater 6)item i)0, item j)1, item k)0, item l)1, item m)0, item n)1, item o)0

Case b
rater 1)item i)0, item j)1, item k)0, item l)1, item m)0, item n)1, item o)0
rater 2)item i)0, item j)1, item k)0, item l)1, item m)0, item n)1, item o)0
rater 3)item i)0, item j)1, item k)1, item l)1, item m)0, item n)1, item o)0
rater 4)item i)0, item j)1, item k)1, item l)0, item m)0, item n)1, item o)0
rater 5)item i)0, item j)1, item k)0, item l)0, item m)0, item n)1, item o)0
rater 6)item i)0, item j)1, item k)0, item l)1, item m)0, item n)1, item o)0

Case c
rater 1)item i)0, item j)1, item k)1, item l)1, item m)0, item n)1, item o)0
rater 2)item i)0, item j)1, item k)0, item l)1, item m)0, item n)0, item o)0
rater 3)item i)0, item j)1, item k)1, item l)0, item m)0, item n)1, item o)0
rater 4)item i)0, item j)1, item k)0, item l)1, item m)0, item n)1, item o)0
rater 5)item i)0, item j)1, item k)0, item l)0, item m)0, item n)1, item o)0
rater 6)item i)0, item j)1, item k)0, item l)1, item m)0, item n)1, item o)1

Question 2. Would you use a fixed or free marginal Kappa?

Once again thanks for a great calculator!

December 7, 2011 at 6:50 am

Justus

HI Fredrik,

I would calculate a kappa for each item. It would only take one sentence to report: the kappas for items i, j, k, l, m, n, and o were . . . . . . , respectively. In that case there would be six raters, three cases, and two categories. You could compute and report descriptive stats for those seven kappas if you wanted to get a sense of the big picture.

I would use free-marginal kappa since you aren’t specifying that a certain amount of cases have to be 1s and 0s.

Take care,

Justus

December 7, 2011 at 2:38 pm

Fredrik, Sweden

I would nominate you for a Nobel prize if there was one in statistics, you’re the best! Thanks Justus.

December 8, 2011 at 9:06 pm

Justus

Hi Fredrik,

Thanks for the kind words.

December 10, 2011 at 11:27 am

Webbie

Hello Justus,

Two questions for you. I have obtained two kappa values for two different image sets (same readers, same ordinal scale). I’m interested in knowing if the two kappas are significantly different……..or is that even a valid concern (ie, would I simply report one as “good agreement” and the other as having “moderate agreement”?) Secondly, I was wondering how I could obtain a confidence interval for a given kappa value and is that possible through your calculator?

Thanks for any feedback you can provide.

-Webbie

February 27, 2012 at 6:16 pm

Justus

Hi Webbie,

I don’t have a program for seeing if two kappas are statistically different. You could use resampling to do it though.

Speaking of resampling, I have code written above in the resampling stats/Statistics 101 language to calculate confidence intervals around kappa. The OKC doesn’t do this yet.

December 11, 2011 at 10:53 pm

Odhrán Murray

Hi,
When trying to use RefGrab-it to bring your webpage into my refworks account it keeps grabbing the second reference your site at the bottom of your page instead of your websites. Any suggestions?
Odhrán

February 27, 2012 at 6:17 pm

Justus

Hi Odhran,

I don’t use that software. I’m not sure how to help you.

December 15, 2011 at 11:11 pm

Fredrik, Sweden

Dear Justus, I wish to constult you with another inter-rater reliability question this time:
Is kappa statistics suitable for “free-text” ratings as well? I have an idea for a study in which I would interview triads (1 patient and 2 treating clinicians) and get them to describe the patients main “problems”, “problem-behaviors”, “goal-behaviors”, and “interventions”. As all these variables can vary in an almost infinite number of ways I dont think a meaningful categorical list can be created (unless I use thousands of categories and that would be impossible for raters to use) and even if it could be done it would be to general and non-specific to be clinically interesting. Instead I would like to interview the patient and clinicians and elicit verbal descriptions of a certain predetermined detail-level (e.g., 2-5 word specific descriptions). Is kappa suitable when there are no fixed number of categories available and answers/ratings can vary infinitely? How would you analyze such data?

Your help is invaluable!

/Fredrik

January 26, 2012 at 10:28 pm

Justus Randolph

Hi Fredrik,

Sorry to take so long lately. I’ve been swamped.

Anyway, kappa isn’t suitable in this case. It sounds like you need to go through one or two rounds of pilot testing to see what the range of category descriptions are likely to be, then create new categories where you find themes in the category descriptions. Once you have a fixed number of categories, you can use kappa.

Take care,

Justus

March 12, 2012 at 11:24 am

Michal Schneider

HI, thanks for creating this site! I have tried to use it but receive a message that sais that i need to enter numerical values for my raters. I thought i did! I am puzzled…. I have 74 cases, 2 categories (yes, no) and 4 raters. the data is as follows:
3 1
4 0
3 1
4 0
1 3
3 1
3 1
3 1
3 1
1 3
1 3
1 3
4 0
3 1
3 1
3 1
4 0
4 0
1 3
1 3
1 3
3 1
2 2
1 3
1 3
4 0
4 0
3 1
1 3
1 3
4 0
1 3
3 1
1 3
4 0
3 1
2 2
4 0
1 3
1 3
3 1
4 0
3 1
3 1
1 3
3 1
3 1
2 2
1 3
2 2
1 3
3 1
3 1
4 0
3 1
1 3
1 3
4 0
1 3
4 0
3 1
4 0
1 3
3 1
4 0
3 1
1 3
4 0
4 0
3 1
2 2
1 3
what am i doing wrong here? Thanks, Michelle

March 12, 2012 at 7:51 pm

Justus

Did you do this?:

NO of cases = 74
NO of categories = 2
NO of raters = 4

If so, you need to make sure that all of the rows add up to 4 (either 4,0; 1,3; 2, 2; 3, 1; or 0, 4).

One mistake will lead to an error. You can create the data set in Excel to check each row then cut and paste.

March 12, 2012 at 7:56 pm

Justus

Wait, do you only have 72 cases?

March 13, 2012 at 6:40 am

Michal Schneider

Thanks – got the results now! One more question – the overall kappa was 0.61 but the fixed marginal kappa (appropriate in my scenario i think – raters had to assign a ‘yes’ or ‘no’ value to each case) was only 0.178. How can there be such a huge difference? and which kappa should i use?
Many thanks for helping everyone ou! Much appreciated!
Michelle

March 13, 2012 at 5:34 pm

Justus

There’s debate still on which to use under which circumstances. I agree with Brennan and Prediger and, therefore, think that free-marginal kappa is appropriate in your case. Here are some references if you want to explore further:

Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement (41), 687-699.

Warrens, M. J. (2010). Inequalities between multi-rater kappas. Advances in Data Analysis and Classification.
Advance online publication. doi:10.1007/s11634-010-0073-4

Randolph, J. J. (2005). Free-marginal multirater kappa: An alternative to Fleiss’ fixed-marginal multirater kappa. Paper presented at the Joensuu University Learning and Instruction Symposium 2005, Joensuu, Finland, October 14-15th, 2005. (ERIC Document Reproduction Service No. ED490661)

March 22, 2012 at 4:35 am

Tony

I’d like to ask what is probably a very naive question. I need to analyze survey results. I have subjects make judgments about the truth or falsity of sentences based on scenarios they have been given. I want to determine the degree of agreement among subjects for each item. In other words, overall agreement on the task is less important for my purposes than the degree of agreement among subjects for each item on the survey. Is a free-marginal kappa what I want here? What would generally be considered a good minimum value for this? Thanks!

March 22, 2012 at 11:33 am

Justus

Hi Tony,

I think that kappa would be appropriate in this situation. If you need it for each individual item, set the No of cases to one, the No of raters to however many raters you have and the No of categories (I’m guessing two: true or false). Then repeat for each item.

A rule of thumb is that a kappa of .70 or above indicates adequate interrater agreement.

There’s debate on which version to use but I think that free marginal kappa is appropriate since you unless you are forcing raters to put categorize a certain number of cases as true and a certain number of cases as false. See the Brennan and Prediger article referenced at the bottom of the OKC calculator for more info.

March 22, 2012 at 12:54 pm

Tony

Justus, thanks so much. From your explanation, I do think the free marginal kappa is what I need. I will also check out the Brennan and Prediger article. Many thanks!

March 31, 2012 at 2:16 am

Ricardo

Hi Justus, first of all thank you for providing this wonderful Kappa calculator. I’d like to ask a question, for the following setting: 50 cases, 3 categories (OK, M+ and M-) and 3 raters. When i insert all this data in your calculator, she calculates the kappa considering both 3 categories, as expected.
I would to know if your calculator allows to calculate the Kappa for just one category. For example, i would like to know what is the agreement of the raters for the category “OK”. If your calculator allows the previous operation how should i put the input?

Thank you for your time.

March 31, 2012 at 2:33 am

Justus Randolph

HI Ricardo,

Turn the variable into an OK/Not OK (dichotomous) variable and use two categories. either it’s 1 (yes) or 0 (no).

Hope this helps.

Justus

April 11, 2012 at 6:58 pm

bbw

I appreciate, result in I found just what I was having a look for. You have ended my four day long hunt! God Bless you man. Have a great day. Bye

June 12, 2012 at 3:33 am

divya

mine says error coz num of agreements isn’t equal to d num of raters. but i’ve checked it many times…. i dunno wat wud have gone wrong….

June 12, 2012 at 10:14 pm

divya

justus,
I did a study… a questionairre which has 32 questions was applied on 15 patients by 3 raters. they are marked on a 4 point scale. so now how many cases, and categories do i have?? i’m kinda clueless on how to go about the calculations.

June 17, 2012 at 12:10 am

Justus

I would calculate kappa for each question. Therefore there would be 15 cases, 3 raters, and 4 categories.

June 13, 2012 at 10:08 pm

Mohamed

Hi Justus,
I filled in the table with details involving 7 cases, 12 categories and 2 observers.
Pressing calculate, an error message appears stating that “Kappa cannot be calculated because the number of agreements in case (a) are not equal to he number of raters which is 2”.

I am not sure where it went wrong. I checked all entries and they are correct.
Could you help me please
Thanks,
Mohamed

June 17, 2012 at 12:11 am

Justus

A common error is that you have to put in 2 / 0. You can’t leave a cell blank. Did you try that?

July 30, 2012 at 9:20 pm

Lyn

Hi Justus,
I am trying to calculate kappa with this resource. When I press calculate, it gives me this message: “Ensure that you enter numeric values for number of raters”. My number of raters are 2 and I have indicated “2”. Am I doing something wrong? Thanks for prompt response.

July 30, 2012 at 9:43 pm

Justus

The calculator is working from my end. I’m not sure what’s wrong. What is the number of raters, categories, and cases?

August 3, 2012 at 1:44 pm

Lyn

Hi Justus,
Thanks for the reply. I have 2 raters, 3 categories and 131 cases.
I’ve attached my table so that you try it. Thanks again for your prompt response.
Evelyn

Personal / social complaint Information need Request for assistance
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
0 0 2
1 1 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
1 1 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
1 0 1
1 1 0
2 0 0
2 0 0
2 0 0
0 0 2
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
1 0 1
1 0 1
2 0 0
0 0 2
2 0 0
2 0 0
2 0 0
1 0 1
2 0 0
2 0 0
0 0 2
2 0 0
2 0 0
2 0 0
2 0 0
0 0 2
2 0 0
0 0 2
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
0 0 2
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
0 0 2
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
0 0 2
1 0 1
2 0 0
0 0 2
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
0 0 2
2 0 0
0 0 2
0 0 2
2 0 0
0 0 2
0 0 2
0 0 2
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
2 0 0
0 0 2
0 0 2
0 0 2
0 0 2
0 0 2
0 0 2
0 0 2
1 0 1
0 0 2
2 0 0
2 0 0
2 0 0
2 0 0

December 22, 2012 at 1:17 am

Justus

About your reply below, the data worked for me. You only had 130 cases might have been the problem.

December 18, 2012 at 7:26 pm

Ahmed M

downloading the java calculetor took me longe time. as every minute passed on the length time remain increase even in hour. when i started loading the remaining time was less than 2 hour but know it reached 4hrs. what is the problem

December 22, 2012 at 1:12 am

Justus

It loads fine on my computer. You might try to uninstall and reinstall Java. Sometimes my server gets slow, but it usually only ever takes a few minutes to load.

December 21, 2012 at 2:34 am

gregory smith

Thanks for this calculator. I had implemented my own version and needed to verify my results. This calculator and my program get the same answers.

Now my problem… Why do I get -0.091 agreement when the respondents were so closely aligned. Here is my data:

3 0 0
3 0 0
3 0 0
2 1 0

I get kappa = -0.091 and I would expect it to be very close to 1.0 since all the respondents agreed except in one case.

December 22, 2012 at 1:10 am

Justus

Here are three good references on the free-marginal kappa and the cases when it might be appropriate to use it.

Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement (41), 687-699.

Warrens, M. J. (2010). Inequalities between multi-rater kappas. Advances in Data Analysis and Classification.
Advance online publication. doi:10.1007/s11634-010-0073-4

Randolph, J. J. (2005). Free-marginal multirater kappa: An alternative to Fleiss’ fixed-marginal multirater kappa. Paper presented at the Joensuu University Learning and Instruction Symposium 2005, Joensuu, Finland, October 14-15th, 2005. (ERIC Document Reproduction Service No. ED490661)

December 21, 2012 at 3:51 am

gregory smith

After some more reading, I see I’ve stumbled upon the problem with Fleiss’ Kappa and why Randolph’s R(S) Free-Marginal Kappa is offered. Where can I get details on R(S)? Is the source code for the online calculator available?

December 22, 2012 at 1:11 am

Justus

See the comment above. The low kappa probably had to do with marked marginal asymmetry.

December 28, 2012 at 10:57 pm

Lucy

Hi Justus, I just wanted to thank you for making the kappa calculator available, I wish I had found it a couple of years ago when I calculated Fleiss kappa by hand (i’m no mathematician!). This time round my analysis has been much easier 🙂 This forum is also really helpful for understanding the various issues with the kappa coefficients – thank you! Lucy

December 29, 2012 at 6:23 am

mef

Hello Justus.
First, thank you for you hard work and research on this topic and for making your calculator available to the public.

Second, the link to your article:
Randolph, J. J. (2005). Free-marginal multirater kappa: An alternative to Fleiss’ fixed-marginal multirater kappa. Paper presented at the Joensuu University Learning and Instruction Symposium 2005, Joensuu, Finland, October 14-15th, 2005. (ERIC Document Reproduction Service No. ED490661)
is no longer working. Do you know where we can access this presentation/paper now?

Thank you

January 12, 2013 at 6:42 pm

ajanta akhuly

hi, i am a Phd student in psychology and need to calculate kappa co-efficient for my data. But, i cannot find the link to your kappa calculator. Could you plz give the link.The OS on my computer is windows, will it work?

January 13, 2013 at 2:01 am

Justus

You can find the Online Kappa Calculator here:

http://justusrandolph.net/kappa/

Take care,

Justus

January 13, 2013 at 12:29 pm

ajanta akhuly

Thanks.. but i went to that link but can’t see it 😦

January 15, 2013 at 3:58 pm

ajanta

Thanks a lot.. need to instal JAVA to get to the calculator

January 29, 2013 at 10:58 am

Addy

Hi there, thanks for the great calculator. Was just wondering if there’s a simple method of comparing 2 kappa values (testing significance) of the same and different samples/cases? Thanks.

February 1, 2013 at 12:24 am

Justus

Hi Andy,

If you scroll up (March 27, 2009 — To Sofie), I’ve written some Resampling Stats/Statistics 101 code for creating confidence intervals around kappa. You can use that to compare kappa values.

February 1, 2013 at 12:24 am

Justus

Addy, not Andy. Sorry.

March 8, 2013 at 8:31 pm

Vina

Hi Justus,

I have a question due to your publication about the Free-marginal Multirater Kappa. There is something I don’t quite understand: Why does the kappa-value increase with increasing categories? One should expect the kappa to decrease, right, because of increased variation(along with increased categories)?
Looking forward to your reply!

March 8, 2013 at 11:35 pm

Justus Randolph

Hi Vina,

Good question. It might help just to explain with an example, but first a review.

1. A kappa of zero means that you are guessing just as good as change. A kappa greater than zero means that you are guessing better than chance. A kappa less than zero means that you are guessing worse than chance.

The formula for kappa is

(observed agreement – expected agreement by chance)/ (1 – expected agreement by chance)

2. In a free-marginal kappa, the percent expected by chance on any one trial is 1 divided by the number of categories.

So, if I flip a coin and can guess the right result 50% of the time, that’s not very impressive because one would expect to guess correctly 50% of the time just by chance. In that case, the kappa would be zero, (.50 -.50)/(1-.50) = 0. However, it would be impressive if I could correctly guess the results of rolling a die 50% of the time because the expected probability by chance would only be about 17% (i.e., 1/6 categories). Since the observed agreement (50%) is greater than the expected agreement by chance (17%) the value of a free-marginal kappa would be greater than zero; in fact, it would be .40,(.50 -.17)/(1 -.17) = .33/.83 = .40

So in short, holding the percent of overall agreement constant, the more categories you have in a free-marginal kappa, the greater the value of kappa. I demonstrate this in the article. That’s why it’s important to only use the number of categories that are theoretically justifiable in a free-marginal kappa.

March 31, 2013 at 10:59 pm

ajanta akhuly

may be the answer is already there on the blog, but im anyways asking it:
how is your fixed marginal kappa and free-marginal kappa related to Fleiss’ multiple rater Kappa coefficient and Conger’s multiple-rater Kappa coefficient ?
thanks
ajanta

April 9, 2013 at 1:50 am

jrandolp

You might check out these articles:

Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement (41), 687-699.

Warrens, M. J. (2010). Inequalities between multi-rater kappas. Advances in Data Analysis and Classification.
Advance online publication. doi:10.1007/s11634-010-0073-4

Randolph, J. J. (2005). Free-marginal multirater kappa: An alternative to Fleiss’ fixed-marginal multirater kappa. Paper presented at the Joensuu University Learning and Instruction Symposium 2005, Joensuu, Finland, October 14-15th, 2005. (ERIC Document Reproduction Service No. ED490661)

Justus

May 14, 2013 at 6:17 pm

Justin W Walthers

Justus,
Your free-marginal kappa calculator is great and I have found it very useful. However, in your 2005 paper, you reference SPSS syntax to calculate the free-marginal multirater kappa, located at

http://www.geocities.com/justusrandolph/mrak_macro.

Following the link leads to a dead end, unfortunately. I was wondering if perhaps you still had the SPSS macro or at least knew where it could be located?

Thanks!

May 16, 2013 at 7:24 pm

Justus Randolph

Hi Justin,

I totally forgot about this macro. You might check it against the values in the OKC. Basically, it’s just a modification of Nichol’s SPSS code for fixed-marginal kappas.

You can find it at:

http://justusrandolph.net/kappa/mrakmacro.htm

July 13, 2013 at 2:33 pm

Jenni

Hi there, sorry to land you with such a preliminary question, but I’m not sure how to enter my data.
I have 73 codes (each describing a different portrayal of a person who stutters), and 40 films. 2 raters have decided whether each of these codes exists or not in each fillm. So for each code for 40 films either, 2 raters, 1 rater, or 0 raters have marked the code as existing.
I’ve fallen at te first hurdle and am unsure how to proceed. Perhaps I should just calculate % agreement and be done with it. Is Kappa possible?

Many thanks,
Jenni

July 17, 2013 at 4:07 pm

Justus Randolph

Hi Jenni,

Just to clarify. You had two raters rate 40 films. Those raters coded those films on 73 binary variables. If that’s the case, the number of raters will always be 2, and the number of categories will also be 2. If you want to report kappa for each of the 73 characteristics, there will be 40 cases (you would end up with 73 kappas–one for each characteristic). If you want to report a kappa for each film, the number of cases will be 73. If you wanted to get it down to a single number, you could make the number of cases 73*40 if it’s not important to report the kappa for each film or characteristic; however, you would be ignoring the dependent structure of the data.

Hope this helps.

Justus

July 24, 2013 at 4:42 pm

Mari

Hi Justus,

I’ve been trying to calculate the (Fleiss) Kappa for a SCID-I assessment. There are 44 subjects and there are 26 diagnostic categories, ranging from Panic disorder to Depression. For each of these diagnostic categories there are 2 options: either the subject has a depression(for example) (1) or hasn’t (0).
There are 3 raters.

For some reason I can’t grasp what to fill for ‘cases’. Is it even possible to calculate the Kappa for the entire SCID-I? I do understand how to calculate the Kappa for just one of the diagnostic categories, but I really would like the Kappa of the entire diagnostic instrument.

So,
– 44 subjects
– 3 raters, (PRE POST and Follow-up)
– 2 options (yes/no)
– 26 diagnostic categories.

Thank you,

Mari

July 24, 2013 at 4:50 pm

Justus Randolph

Hi Mari,

It looks like you have the same research situation as Jenni. See the comment above. You could either create a kappa for each diagnostic category or combine all of the data and try to get an overall kappa. You ignore the data structure if you do the latter however.

Hope this helps,

Justus

July 24, 2013 at 5:11 pm

Mari

Alright, I will do both in my research paper just to be sure. If I would calculate the Kappa for the category Depression. Is this the right way to go:

Number of cases: 44
Number of categories: 2
Number of raters:3

Then I would do this 25 more times for each category.

And if I would combine the 26 categories, would that mean that I’m left with the category: “Disorder(unspecified)”?

July 24, 2013 at 5:18 pm

Justus Randolph

Hi Mari,

That sounds like a good plan to present all of the kappas, especially in a thesis. You might find out that some disorders are easy to agree that it is present or not, while others are not. That sounds like an interesting finding itself.

If you did collapse it, I guess that you could call it a disorder or not. If I were you, I would just report the categories separately. It’s more meaningful than collapsing all of your data. It would like be as if you assumed there were 44*25 cases being rated by three people on one variable, but that’s not really the situation.

Hope this helps,

Justus

July 24, 2013 at 5:25 pm

Mari

Yes, exactly I will do that.

But just to be sure

Is this the right way to go:

Number of cases: 44
Number of categories: 2
Number of raters:3

Thank you so much for your help!

September 19, 2013 at 12:13 pm

Tony Wright

Hello Justus! I was wondering if I could ask you a question. I have used your online kappa calculator
to get the kappa values for a survey task in which raters gave true/false/not sure answers to questions.
Each question has a control-condition version and an experimental-condition version. I want to get the kappa
score for each control-condition item and compare it with the kappa for the equivalent experimental condition
version. For most items, the majority answers went from majority “true” with a high kappa score to majority
“false” with kappa scores ranging from fairly high to fairly low, as expected.

I then wanted to calculate the “distances” as it were, between the kappas for each control-condition item and
its experimental condition counterpart. I did this by changing the sign of the kappas for majority “false” answers
to negative (keeping the sign of majority “true” answers positive) and subtracting the experimental-condition kappa
from the control-condition kappa. This, I felt, gives me a distance between the two kappa scores on a number
line, so as to see the magnitude of the effect of the experimental condition on raters’ judgments. In this way,
high agreement on “true” would put the item far to the right, in the positive direction, from zero on a number line,
and high agreement on “false” would put the item far to the left, in the negative direction, from zero on a number line.

An item with majority “true” answers with high agreement for the control condition and majority “false” answers with
high agreement for the experimental condition shows a very large effect. An item with majority “true” answers with high agreement for the
control condition and majority “true” answers with low agreement for the experimental condition shows a smaller effect,
etc. Does this seem like a valid thing to do with kappa scores?

Sorry for the long question. Thanks so much for this valuable tool you’ve made available!

November 4, 2013 at 2:06 pm

Justus Randolph

Hi Tony,

I haven’t seen that approach before, but it seems logical to me. The distance would be the effect size

I haven’t written a resampling stats code for getting confidence intervals around a free-marginal kappa. You could get see if the control condition’s kappa is outside of the range of the experimental condition’s kappa to see if the difference was statistically significant. The code is posted higher on this blog.

Good luck,

Justus

September 24, 2013 at 9:42 am

PIPELIER

Hello,

I would just like more information please. Can you tell me how maximum cases, categories and raters your Online Kappa Calculator admits ?

Thank you.

November 4, 2013 at 2:08 pm

Justus Randolph

I had the code changed years ago so that the OKC has very high limits. I can’t remember what they are exactly, but they are high enough that anyone has asked for higher limits yet. Let me know if you run into a problem.

Justus

October 30, 2013 at 10:26 am

Kristin Pontoski

Hi Justus,

Thank you for your effort in creating this excellent resource! I was wondering if you know of any published examples that used your calculator (or more specifically the formula you present for the free-marginal statistic for multiple raters in your 2005 paper)? I’ve used your calculator to compute the free-marginal kappa and it resulted in generally good agreement among raters. However, as relative novices in this subject area, my co-authors and I would like to make sure that this method has support in the literature before we include its use in an upcoming submission in the event that reviewers ask about it.

Also, do you have recommendations for acceptable cut-off values for what designates adequate agreement? We’ve been going with a value of 0.6 as it’s presented elsewhere in the literature, but I know this can be a topic of some debate.

Thank you for your help!

Kristin

November 4, 2013 at 2:13 pm

Justus Randolph

Hi Kristin,

According to Google Scholar, there have been 120 citations of Randolph’s kappa (the multirater free-marginal kappa) so far. You can see a list of them on Google Scholar. Try this link:

http://scholar.google.com/scholar?oi=bibs&hl=en&cites=7804045286045263621

Let me know if that link doesn’t work. Alternately, you could find out citations by searching for “Justus Randolph” on Google Scholar. Then click on “Citations” for the article “Free-marginal multirater kappa (multirater κfree): an alternative to Fleiss’ fixed-Marginal multirater kappa.”

Hope this helps,

Justus

November 4, 2013 at 2:27 pm

Justus Randolph

Hi Kristin,

My heuristic is that .70 or greater is the cut-off for what I would consider to be “acceptable” agreement in my field (educational research). I think that I originally got that from Neundorf’s Content Analysis Handbook. But you’re right; what is acceptable is debatable within a field. A .60 kappa might be good agreement in some fields. Plus, like p < .05, a "acceptable" kappa of .70 or greater is an arbitrary distinction.

I have written a resampling stats code for creating CIs around free-marginal kappa. That might provide some additional information for your readers when interpreting the level of agreement. You can find it earlier on this blog.

Hope this helps,

Justus

November 4, 2013 at 2:32 pm

Kristin Pontoski

Justus, thanks so much for your replies to my questions!

April 13, 2014 at 11:58 pm

Vandana

Hello — i have a large dataset, greater than 1000 rows. Is there an easy way to paste into the generated table? I’m working on a Mac and I’ve accepted the digital certificate. However, the table I see doesn’t really give me a scroll bar, and trying to select all the cells to do a paste from Excel doesn’t seem to work. Any help you can provide would be much appreciated. Vandana

April 14, 2014 at 1:20 pm

Justus Randolph

Hi Vandana,

I’ve heard that the cut and paste feature doesn’t work well on Macs. One users solution was to try the cut and paste procedure from a PC. Sorry about the inconvenience.

Justus

April 14, 2014 at 1:26 pm

Jean-Yves

Hi Vandana,
If I’ve a good memory you copy your rows with cmd+c and you paste rows in online Kappa calculator with ctrl+v like a PC.

June 23, 2014 at 4:51 pm

Elizabeth Witt

I don’t even have a calculator appearing. There is a blank spot in the upper right of the screen. I did get a pop-up asking me to update Java, which I did, including a step to uninstall a previous version. But I still don’t see a calculator or any way to enter data.

June 24, 2014 at 4:35 pm

Justus Randolph

Hi Elizabeth,

The calculator is working on both of my computers. I have to accept the digital certificate to get it to work. If that doesn’t work, temporarily turn down the security. Go to JAVA in your apps then go to Configure Java. From the security tab, choose Medium. That worked for me when I had a problem with the OKC after a Java update. Hope this helps. Let me know what happened.

Take care,

Justus

January 18, 2015 at 12:26 pm

Su Lucas

Hi Justus

Your Randolph’s Kappa seems to be the answer to our prayers, but I would like to check that we are correct in our interpretation:

In our field (diagnostic radiology) we often have 3 readers reviewing images retrospectively for research purposes. Their findings are documented on a binary tick sheet, i.e. is the finding present on the image, yes/no. Up to now we have been using Fleiss’ Kappa to test for inter-reader agreement, but we know that it works best if the prevalence of a finding is approximately 50%.

Our problem comes in when the prevalence of a radiological finding is either very high or very low. This is where we think this free-marginal multi-rater kappa is the solution to our problem.

Are we correct? If not, is there a different test you could advise we use?

Your suggestions, advise and input will be greatly appreciated.
Thank you in advance
Su

January 20, 2015 at 10:59 am

Justus Randolph

Hi Su,

This is a great example of where Randolph’s Kappa (i.e., the free-marginal multirater kappa) would work well Fixed-marginal kappas (e.g., Fleiss’s or Cohen’s) can get very small even when there is very high agreement when the prevalence is low. When using free-marginal kappas (like Randolph’s Kappa) be sure not to have extraneous categories; they inflate the value of a free-marginal kappa. Hope this helps.

Justus

June 24, 2015 at 11:08 am

Elizabeth Witt

Justus, Any progress in fixing this? I just opened the kappa calculator and am not able to see the data entry table, nor does it even ask me to approve the certificate. All my Java settings are as they should be.

Elizabeth

June 29, 2015 at 12:24 pm

Justus Randolph

Hi Elizabeth,

Sorry for not getting back to you sooner. I have a programmer working on a second version of the OKC that is more stable.

He said that as java changes, which browsers work changes.

I tried the OKC with Firefox today and it worked. There are more security steps to go through in Chrome. I’ll let you know when V2 is released.

Take care,

Justus

June 29, 2015 at 2:49 pm

eawitt

Thanks, Justus. It is such a wonderful, handy tool when it works!

Elizabeth

May 25, 2016 at 5:22 am

Rosemary

Hi Justus

I have the same issue as Elizabeth. Could you please let me know the status of the release?

Thanks
Rosemary

March 9, 2017 at 10:15 am

Justus Randolph

I had it reprogrammed in JavaScript. Go for it.

March 9, 2017 at 3:30 pm

Rosemary

Hi Justus
Thank you so much.
Rosemary

November 18, 2015 at 12:32 pm

Zazaza

Hi
If there are 5 raters, rating 30 cases as 0 or 1, what is the formula for Po? Thanks. I would think it would be the total number of cases with 5 zeros or 5 ones divided by 30 but it does not seem to be case.

March 9, 2017 at 10:16 am

Justus Randolph

See the accompanying article on the help page for the formulas.

November 18, 2015 at 2:22 pm

Zazaza

Hello please disregard my first post. I believe I found the equation.

March 8, 2017 at 2:16 pm

samuel

Hi!
I have been using this app and I like it!!!
But I have a big doubt…what Kappa statistic implements for app? Fleiss or Cohen???

March 9, 2017 at 10:14 am

Justus Randolph

The birater version and multirater version of the fixed-marginal kappa should be Cohen’s and Fleiss’s kappa, respectively. See the help information and accompanying refs for details.

Justus

May 25, 2018 at 5:48 am

Bárbara Lorence Lara

Hi!
I have been using this app to calculate Fleiss´s Kappa But I doubt if I’m applying it well.
I have 3 raters (experts) that assessed wellbeing indicators of their country with 21 items in a survey. These items have different categorical options, sometimes three options and others two options. My questions are:
Could I apply Fleiss´s Kappa for only one case? (The country)
If applicable, would be better free-marginal or fixed-marginal?
Would have I apply Fleiss´s Kappa for each item? If so, could I calculate a total average?

I tried to perfomed the index with one of the items with disagreement. And the results are:
Percent overall agreement: 33.33%
Free.marginal kappa = 0.20
95% CI for free-marginal kappa (NaN, NaN)
Fixed-marginal Kappa = -0.50
95% CI for fixed-marginal kappa (-1.00, 0.89)

At these results, what’s the meaning of NaN, NaN?
Thank you in advance for your support

May 25, 2018 at 3:14 pm

Justus Randolph

Hi Barbara,

Thank you for your question. The OKC has recently been updated and I see that there is a bug in the free-marginal confidence intervals when the case is equal to one. I’ll try to get it corrected when I get a chance.

There is a lot of debate on the merits of free- v. fixed-marginal kappas. I tend to agree with Brennan and Prediger that free-marginal is appropriate when there is no restriction on how many cases one can assign to a given category. Fixed-marginal kappas are more popular though. The Information page of the OKC has a link to the Brennan and Prediger article and other articles that review the merits of the different kappas.

Yes, it’s appropriate to have a kappa for each item. I see that sometimes when people want to evaluate the interrater reliability of individual items, they tend to report a kappa for each item as well as summary statistics for all item’s kappa.

I hope this helps. If you send me an e-mail, I’ll send you a message when the CIs get fixed in the one-case instance if it is important for your research to have the kappas. I used Gwet’s variance formula for free-marginal kappa (he call’s it the Brennan-Prediger kappa), so you should be able to calculate it manually. His book is also referenced on the information page

Take care,

Justus

May 25, 2018 at 5:31 pm

Justus Randolph

Hi Barbara,

I checked out the formulas; they are ok. I might have to think about it more, but I think that the CIs and are simply undefined in the single case for free-marginal kappas. The kappas and CIs can be undefined when there is perfect agreement in the single case for fixed-marginal kappas. Things get weird in the extreme cases.

June 1, 2018 at 9:34 am

Bárbara

Dear Justus,

Thank you very much for your quick and helpful support on this regard. We will inform the kappa index and we will leave the confindence interval.
We will keep following your advances on the topic.
Best regards,
Bárbara Lorence

October 21, 2019 at 12:51 am

David

Dear Justus,
I was wondering if you could explain something to me. Why do Situations A & B below have the same free-marginal kappa values? I guess I am not getting something, but I thought the values the closer the ‘clusters’ are would produce a higher kappa value. In these cases they are both 0.17.

Situation A (3 raters)
1 2 3 4 5
Presentation 1 0 0 1 2 0
Presentation 2 0 0 1 2 0
Presentation 3 0 0 1 2 0
Presentation 4 0 0 1 2 0
Presentation 5 0 0 1 2 0
Percent overall agreement = 33.33%
Free-marginal kappa = 0.17

Situation B (3 raters)
1 2 3 4 5
Presentation 1 1 0 0 2 0
Presentation 2 1 0 0 2 0
Presentation 3 1 0 0 2 0
Presentation 4 1 0 0 2 0
Presentation 5 1 0 0 2 0
Percent overall agreement = 33.33%
Free-marginal kappa = 0.17

October 21, 2019 at 2:12 pm

Justus Randolph

Hi David,

That is a good question. It looks like the percent of overall agreement for each presentation is the same in both Situation A and B. In every case, the two out of three ratings were in agreement. The categories in a free-marginal kappa are independent so, it doesn’t matter if the ratings are adjacent or not. For example, these sets of ratings would all have the same overall agreement: (0, 0 , 1, 2, 0; 1, 2, 0, 0; 1, 0, 0, 2, 0). An assumption of free-marginal kappa is that the ratings are categorical and that the probability given chance of a rating in any category is 1 / the number of categories.

If your categories are ordinal, then a kappa statistic probably isn’t the best statistic for you in this case. (I’m guessing that your rating categories are ordinal since you assume that situations with adjacent ratings should have a higher level of agreement.) Instead, you would want to use a multirater ordinal measure of interrater agreement. Gwet’s AC2 is one statistic that comes to mind. Using an ordinal measure of reliability, Situation A should have a higher reliability than Situation B since the ratings were close. You can find it in Gwet’s Handbook of Interrater Reliability (2nd ed.). Note that Gwet calls the free-marginal multirater kappa calculated in the Online Kappa Calculator the BP (Brennan and Prediger) statistic. There are many other ordinal measures too if AC2 doesn’t work for you.

P.S. You have some categories that never got a rating. If you use a free-marginal kappa, you might decide whether each category is justified. One idiosyncrasy of a free-marginal kappa is that kappa will increase as you increase the number of categories, all other things being equal. So, I recommend using as few categories as you can while still retaining categories that are theoretically-needed, meaningful, and/or useful. Fixed-marginal kappas don’t have this idiosyncrasy. Instead, they are affected by the marginal symmetry of categories, all other things being equal.

I hope this helps,

Justus

October 22, 2019 at 11:19 pm

David

Hi Justus,
Thank you very much for the prompt and easy to understand reply. I am just getting up to speed with all of this, and as I do not have a background in statistics it is rather overwhelming. Nonetheless, your explanation was very, very helpful.
Cheers,

October 23, 2019 at 1:05 pm

Justus Randolph

Hi David, I am happy to help. Feel free to ask as many questions as you have. If you have a question, there are probably hundreds or people in the research community who have the same one.

The seminal text on nonparametric statistics–Nonparametric Statistics for the Behavioral and Social Sciences–also has some statistics of ordinal association that might be of interest to you.

Another option is the intraclass correlation coefficient (ICC) if you can consider the rankings to be continuous rather ordinal. Here is a good resource on how to choose the right ICC: http://web1.sph.emory.edu/observeragreement/spss.pdf A rule of thumb is that you need at least fifteen equal-interval values that the rating can take to be considered to continuous. So, for example, a variable with a Likert rating from 1 – 5 is probably better considered ordinal rather than continuous. However, if your variable is the sum or average of a large number of items from a rating scale, you probably can consider that summated score to be continuous since the sum can take on more values than 1 – 5.

I hope this helps. Good luck on your project!

Justus

April 13, 2020 at 8:00 pm

SJH

Dear Justus,

Thank you for developing this tool and for taking the time to read my query. I have a limited knowledge of statistics so I apologize in advance for the ignorance of my question.

I want to measure inter-rater reliability for a project in which three raters gave scores from 0-50 to a series of documents. For example, the three raters scored document A. Rater #1 scored it 32. Rater #2 scored it 31. Rater #3 scores it 30.

I cannot figure out how to use your calculator to calculate the Fleiss’ kappa of these three scores.

Again, thank you for your response and patience.

SJH

April 14, 2020 at 12:15 pm

Justus Randolph

Hi SJH,

It looks like you have a continuous rating variable. In that case an intraclass correlation coefficient would be a good multirater interrater reliability statistic for you: https://www.medcalc.org/manual/intraclasscorrelation.php . If you use SPSS, there is a way to easily calculate the ICC. I think it’s in the scale function. I’ve found this guide by Nichols to be helpful in deciding which of the various ICC options to use:

Click to access spss.pdf

October 21, 2020 at 2:32 pm

Siyavash

Hi and thanks for your help in understanding kappa. I was wondering how we could also calculate p-value in this.

October 22, 2020 at 2:11 pm

Justus Randolph

Hi Siyavash,

You could backtrack from the CIs to get a p value. See this link:
https://www.bmj.com/content/343/bmj.d2304

Or if you are interested in a simple null hypothesis test, the value of kappa is statistically significant (p < .05) if the confidence interval doesn't include zero. Zero is the value of kappa that you would expect given chance judgments alone.

Take care,

Justus

My online kappa calculator

Share this:

Related

Blogroll

242 comments

Leave a reply to Jeet Cancel reply