Skip Nav

What is Reliability?

Follow TQR on:

❶How do you know that? Some of the sources of error in your dissertation may include:

Split-half method

This article is a part of the guide:
Reliability in research
Internal Reliability and Personality Tests

Fleischmann and Pons announced to the world that they had managed to generate heat at normal temperatures, instead of the huge and expensive tori used in most research into nuclear fusion. This announcement shook the world, but researchers in many other institutions failed to replicate the experiment. The reason some tests do this is to increase their internal reliability.

Internal reliability is about the consistency across separate items within a measure. A test is internally consistent if each item contributes equally to the overall construct being measured. If you are a physicist or a chemist, repeat experiments should give exactly or almost exactly the same results, time after time. The behavior of phosphorous atoms, DNA molecules or natural forces like gravity are very unlikely to change. Ecologists and social scientists, on the other hand, understand that achieving identical results on repeat experiments is practically impossible.

Complex systems, human behavior and biological organisms are subject to far more random error and variation. While any experimental design must attempt to eliminate confounding variables and natural variations, there will always be some disparities in these disciplines.

Reliability and validity are often confused; the terms describe two inter-related but completely different concepts. This difference is best described with an example:. A researcher devises a new test that measures IQ more quickly than the standard IQ test:.

Reliability is an essential component of validity but, on its own, is not a sufficient measure of validity. A test can be reliable but not valid, whereas a test cannot be valid yet unreliable. A test that is extremely unreliable is essentially not valid either. A bathroom scale that measures your weight one day as kg and the next day as 2 kg is not unreliable, it merely is not measuring what it is meant to.

There are several methods to assess the reliability of instruments. In the social sciences and psychology, testing internal reliability is essentially a matter of comparing the instrument with itself. How could you determine whether each item on an inventory is contributing to the final score equally? One technique is the split-half method which cuts the test into two pieces and compares those pieces with each other. The test can be split in a few ways: Split-half methods can only be done on tests measuring one construct — for example an extroversion subscale on a personality test.

The internal consistency test compares two different versions of the same instrument, to ensure that there is a correlation and that they ultimately measure the same thing. For example, imagine that an examining board wants to test that its new mathematics exam is reliable, and selects a group of test students. For each section of the exam, such as calculus, geometry, algebra and trigonometry, they actually ask two questions, designed to measure the aptitude of the student in that particular area.

If there is a high internal consistency, i. The test - retest method involves two separate administrations of the same instrument, while internal consistency measures two different versions at the same time.

Researchers may use internal consistency to develop two equivalent tests to later administer to the same group. A statistical formula called Cronbach's Alpha tests the reliability and compares various pairs of questions. Luckily, modern computer programs take care of the details saving researchers from doing the calculations themselves.

There are two common ways to establish external reliability: The Test-Retest Method is the simplest method for testing external reliability, and involves testing the same subjects once and then again at a later date, then measuring the correlation between those results.

One difficulty with this method lies with the time between the tests. This method assumes that nothing has changed in the meantime.

If the tests are administered too close together, then participants can easily remember the material and score higher on the second round. But if administered too far apart, other variables can enter the picture: To prevent learning or recency effects, researchers may administer a second test that is different but equivalent to the first.

Anyone who has watched American Idol or a cooking competition will understand the principle of inter-rating reliability. An example is clinical psychology role play examinations, where students are rated on their performance in a mock session. Another example is a grading of a portfolio of photographic work or essays for a competition.

Processes that rely on expert rating of performance or skill are subject to their own kind of error, however. Inter-rater reliability is a measure of the agreement of concordance between two or more raters in their respective appraisals, i. The principle is simple: If, however, the judges have wildly different assessments of that performance, their assessments show low reliability.

Importantly, reliability is a characteristic of the ratings, and not the performance being rated. In psychometry, for example, the constructs being measured first need to be isolated before they can be measured. If our measure, X , is reliable, we should find that if we measure or observe it twice on the same persons that the scores are pretty much the same.

But why would they be the same? If you look at the figure you should see that the only thing that the two observations have in common is their true scores, T.

How do you know that? Because the error scores e 1 and e 2 have different subscripts indicating that they are different values. But the true score symbol T is the same for both observations. That the two observed scores, X 1 and X 2 are related only to the degree that the observations share true score. You should remember that the error score is assumed to be random. Sometimes errors will lead you to perform better on a test than your true ability e. But the true score -- your true ability on that measure -- would be the same on both observations assuming, of course, that your true ability didn't change between the two measurement occasions.

With this in mind, we can now define reliability more precisely. Reliability is a ratio or fraction. In layperson terms we might define this ratio as: You might think of reliability as the proportion of "truth" in your measure. Now, we don't speak of the reliability of a measure for an individual -- reliability is a characteristic of a measure that's taken across individuals. So, to get closer to a more formal definition, let's restate the definition above in terms of a set of observations. The easiest way to do this is to speak of the variance of the scores.

Remember that the variance is a measure of the spread or distribution of a set of scores. So, we can now state the definition as:.

We might put this into slightly more technical terms by using the abbreviated name for the variance and our variable names:. We're getting to the critical part now. If you look at the equation above, you should recognize that we can easily determine or calculate the bottom part of the reliability ratio -- it's just the variance of the set of scores we observed You remember how to calculate the variance, don't you?

It's just the sum of the squared deviations of the scores from their mean, divided by the number of scores. But how do we calculate the variance of the true scores. We can't see the true scores we only see X! Only God knows the true score for a specific observation. And, if we can't calculate the variance of the true scores, we can't compute our ratio, which means we can't compute reliability! The bottom line is So where does that leave us? If we can't compute reliability, perhaps the best we can do is to estimate it.

Maybe we can get an estimate of the variability of the true scores. How do we do that? Remember our two observations, X 1 and X 2? We assume using true score theory that these two observations would be related to each other to the degree that they share true scores. So, let's calculate the correlation between X 1 and X 2.

Here's a simple formula for the correlation:. If we look carefully at this equation, we can see that the covariance, which simply measures the "shared" variance between measures must be an indicator of the variability of the true scores because the true scores in X 1 and X 2 are the only thing the two observations share!

So, the top part is essentially an estimate of var T in this context. And, since the bottom part of the equation multiplies the standard deviation of one observation with the standard deviation of the same measure at another time, we would expect that these two values would be the same it is the same measure we're taking and that this is essentially the same thing as squaring the standard deviation for either observation.

But, the square of the standard deviation is the same thing as the variance of the measure. So, the bottom part of the equation becomes the variance of the measure or var X.

If you read this paragraph carefully, you should see that the correlation between two observations of the same measure is an estimate of reliability.

What is Reliability?

Main Topics

Privacy Policy

Reliability is a necessary ingredient for determining the overall validity of a scientific experiment and enhancing the strength of the results. Debate between social and pure scientists, concerning reliability, is robust and ongoing.

Privacy FAQs

Reliability in research Reliability, like validity, is a way of assessing the quality of the measurement procedure used to collect data in a dissertation. In order for the results from a study to be considered valid, the measurement procedure must first be reliable.

About Our Ads

Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. The use of reliability and validity are common in quantitative research and now it is reconsidered in the qualitative research paradigm. Since reliability and validity are rooted in positivist perspective then they should be redefined for their use in a naturalistic approach.

Cookie Info

Internal validity - the instruments or procedures used in the research measured what they were supposed to measure. Example: As part of a stress experiment, people are shown photos of war atrocities. Example: As part of a stress experiment, people are shown photos of war atrocities. If findings from research are replicated consistently they are reliable. A correlation coefficient can be used to assess the degree of reliability. If a test is reliable it should show a high positive Saul Mcleod.