Practical Guidelines to Develop and Evaluate a Questionnaire

Address for correspondence: Dr. Dipankar De, Additional Professor, Department of Dermatology, Post Graduate Institute of Medical Education and Research (PGIMER), Chandigarh, India. E-mail: ni.oohay@ed_raknapid_rd

This is an open access journal, and articles are distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License, which allows others to remix, tweak, and build upon the work non-commercially, as long as appropriate credit is given and the new creations are licensed under the identical terms.

Abstract

Life expectancy is gradually increasing due to continuously improving medical and nonmedical interventions. The increasing life expectancy is desirable but brings in issues such as impairment of quality of life, disease perception, cognitive health, and mental health. Thus, questionnaire building and data collection through the questionnaires have become an active area of research. However, questionnaire development can be challenging and suboptimal in the absence of careful planning and user-friendly literature guide. Keeping in mind the intricacies of constructing a questionnaire, researchers need to carefully plan, document, and follow systematic steps to build a reliable and valid questionnaire. Additionally, questionnaire development is technical, jargon-filled, and is not a part of most of the graduate and postgraduate training. Therefore, this article is an attempt to initiate an understanding of the complexities of the questionnaire fundamentals, technical challenges, and sequential flow of steps to build a reliable and valid questionnaire.

Keywords: Instrument, psychometrics, questionnaire development, reliability, scale construction, validity

Introduction

There is an increase in the usage of the questionnaires to understand and measure patients' perception of medical and nonmedical care. Recently, with increased interest in quality of life associated with chronic diseases, there is a surge in the usage and types of questionnaires. The questionnaires are also known as scales and instruments. Their significant advantage is that they capture information about unobservable characteristics such as attitude, belief, intention, or behavior. The multiple items measuring specific domains of interest are required to obtain hidden (latent) information from participants. However, the importance of questions or items needs to be validated and evaluated individually and holistically.

The item formulation is an integral part of the scale construction. The literature consists of many approaches, such as Thurstone, Rasch, Gutmann, or Likert methods for framing an item. The Thurstone scale is labor intensive, time-consuming, and is practically not better than the Likert scale.[1] In the Guttman method, cumulative attributes of the respondents are measured with a group of items framed from the “easiest” to the “most difficult.” For example, for a stem, a participant may have to choose from options (a) stand, (b) walk, (c) jog, and (d) run. It requires a strict ordering of items. The Rasch method adds the stochastic component to the Guttman method which lay the foundation of modern and powerful technique item response theory for scale construction. All the approaches have their fair share of advantages and disadvantages. However, Likert scales based on classical testing theory are widely established and preferred by researchers to capture intrinsic characteristics. Therefore, in this article, we will discuss only psychometric properties required to build a Likert scale.

A hallmark of scientific research is that it needs to meet rigorous scientific standards. A questionnaire evaluates characteristics whose value can significantly change with time, place, and person. The error variance, along with systematic variation, plays a significant part in ascertaining unobservable characteristics. Therefore, it is critical to evaluate the instruments testing human traits rigorously. Such evaluations are known as psychometric evaluations in context to questionnaire development and validation. The scientific standards are available to select items, subscales, and entire scales. The researchers can broadly segment scientific criteria for a questionnaire into reliability and validity.

Despite increasing usage, many academicians grossly misunderstand the scales. The other complication is that many authors in the past did not adhere to the rigorous standards. Thus, the questionnaire-based research was criticized by many in the past for being a soft science.[2] The scale construction is also not a part of most of the graduate and postgraduate training. Given the previous discussion, the primary objective of this article is to sensitize researchers about the various intricacies and importance of each step for scale construction. The emphasis is also to make researcher aware and motivate to use multiple metrics to assess psychometric properties. Table 1 describes a glossary of essential terminologies used in context to questionnaire.

Table 1

Glossary of important terms used in context to psychometric scale

Term	Definition
Psychometrics	A science which deals with the quantitative assessment of abilities that are not directly observable, e.g., confidence, intelligence
Reliability	Refer to the degree of consistency of instrument in measurements, e.g., is weighing machine giving similar results under consistent conditions?
Validity	Refer to the ability of an instrument to represent the intended measure correctly, e.g., is weighing machine giving accurate results?
Likert scale	A psychometric scale consists of multiple items that arrived through a systematic evaluation of reliability and validity, e.g., quality-of-life score
Likert Item	It is a statement with a fixed set of choices to express an opinion with the level of agreement or disagreement
Latent variable	Represent a concept or underlying construct which cannot be measured directly. Latent variables are also known as unobserved variables, e.g., health and socioeconomic status
Manifest variable	A variable which can be measured directly. Manifest variables are also known as observed variables, e.g., blood pressure and income
Double-barrel item	A question addressing two or more separate issues but provides an option for one answer, e.g., do you like the house and locality?
Negative item	It is an item which is in the opposite direction from most of the questions on a scale
Factor loadings	Demonstrate the correlation coefficient between the observed variable and factor. It quantifies the strength of the relationship between a latent variable (factor) and manifest variables. It is key to understand the relative importance of items in the final questionnaire. An item with high factor loading is more important than others
Cross-loading	An observed variable with loading more than threshold value on two or more factors, e.g., education level with value >0.35 for both teaching and research domains. The items with cross-loadings are candidates for deletion from a questionnaire
Reverse scoring	The practice of reversing the score to cancel positive and negative loading on the same factor, e.g., changing the maximum rating (such as strongly agree=5) to a minimum (such as strongly agree=1) or vice versa
Floor and ceiling effect	The inability of a scale to discriminate between participants in a study as the high proportion of participants score worst/minimum or best/maximum score, e.g., more than 80% responses are received by single option among the five options for a Likert item. Item is poorly discriminating between participants and is a candidate for deletion
Eigenvalue	An indicator of the amount of variance explained by a factor. The factor with the highest eigenvalue explains the maximum amount of variance and practically makes a factor most important. The eigenvalue is obtained by column sum of squares of factor loading

The process of building a questionnaire starts with item generation, followed by questionnaire development, and concludes with rigorous scientific evaluation. Figure 1 summarizes the systematic steps and respective tasks at each stage to build a good questionnaire. There are specific essential requirements which are not directly a part of scale development and evaluation; however, these improve the utility of the instrument. The indirect but necessary conditions are documented and discussed under the miscellaneous category. We broadly segment and discuss the questionnaire development process under three domains, known as questionnaire development, questionnaire evaluation, and miscellaneous properties.

An external file that holds a picture, illustration, etc. Object name is IDOJ-12-266-g001.jpg

Flowchart demonstrating the various steps involved in the development of a questionnaire

Questionnaire Development

The development of the list of items is an essential and mandatory prerequisite for developing a good questionnaire. The researcher at this stage decides to utilize formats such as Guttman, Rasch, or Likert to frame items.[2] Further, the researcher carefully identifies the appropriate member of the expert panel group for face and content validity. Broadly, there are six steps in the scale development.

Step I

It is crucial to select appropriate questions (items) to capture the latent trait. An exhaustive list of items is the most critical and primary requisite to lay the foundation of a good questionnaire. It needs considerable work in terms of literature search, qualitative study, discussion with colleagues, other experts, general and targeted responders, and other questionnaires in and around the area of interest. General and targeted participants can also advise on items, wording, and smoothness of questionnaire as they will be the potential responders.

Step II

It is crucial to arrange and reword the pool of questions for eliminating ambiguity, technical jargon, and loading. Further, one should avoid using double-barreled, long, and negatively worded questions. Arrange all items systematically to form a preliminary draft of the questionnaire. After generating an initial draft, review the instrument for the flow of items, face validity and content validity before sending it to experts. The researcher needs to assess whether the items in the score are comprehensive (content validity) and appear to measure what it is supposed to measure (face validity). For example, does the scale measuring stress is measuring stress or is it measuring depression instead? There is no uniformity on the selection of a panel of experts. However, a general agreement is to use anywhere from a minimum of 5–15 experts in a group.[3] These experts will ascertain the face and content validity of the questionnaire. These are subjective and objective measures of validity, respectively.

Step III

It is advisable to prepare an appealing, jargon-free, and nontechnical cover letter explaining the purpose and description of the instrument. Further, it is better to include the reason/s for selecting the expert, scoring format, and explanations of response categories for the scale. It is advantageous to speak with experts telephonically, face to face, or electronically, requesting their participation before mailing the questionnaire. It is good to explain to them right in the beginning that this process unfolds over phases. The time allowed to respond can vary from hours to weeks. It is recommended to give at least 7 days to respond. However, a nonresponse needs to be followed up by a reminder email or call. Usually, this stage takes two to three rounds. Therefore, it is essential to engage with experts regularly; else there is a risk of nonresponse from the study. Table 2 gives general advice to researchers for making a cover letter. The researcher can modify the cover letter appropriately for their studies. The authors can consult Rubio and coauthors for more details regarding the drafting of a cover letter.[4]

Table 2

General overview and the instructions for rating in the cover letter to be accompanied by the questionnaire

Content	Explanation
Construct	Definition of characteristics of the measurement
Purpose	To evaluate the content and face validity
How	Please rate each item for its representativeness and clarity on a scale from 1 to 4
Evaluate the comprehensiveness of the entire instrument in measuring the domain
Please add, delete, or modify any item as per your understanding
Measure	CVR	CVI
Characteristics	Importance	Representative	Clarity
Scoring	0-Not necessary	1-Not representative	1-Not clear
1-Useful	2-Need major revisions to be representative	2-Need major revisions to be clear
2-Essential	3-Need minor revisions to be representative	3-Need minor revisions to be clear
4-Representative	4-Clear
Formula	CVR = (N_E -N/2)/(N/2)	CVI_R =N_R/N	CVI_C=N_C/N
where N_E=number of experts rated an item as essential	where CVI_R=CVI for representativeness	where CVI_C=CVI for clarity
N_R=Number of experts rated an item as representative (3 or 4)	N_C=Number of experts rated an item as clear (3 or 4)
N=Total number of experts	N=Total number of experts

Step IV

The responses from each round will help in rewording, rephrasing, and reordering of the items in the scale. Few questions may need deletion in the different rounds of previous steps. Therefore, it is better to evaluate content validity ratio (CVR), content validity index (CVI), and interrater agreement before deleting any question in the instrument. Readers can consult formulae in Table 2 for calculating CVR and CVI for the instrument. CVR is calculated and reported for the overall scale, whereas CVI is computed for each item. Researchers need to consult Lawshe table to determine the cutoff value for CVR as the same depends on the number of experts in the panel.[5] CVI >0.80 is recommended. Researchers interested in detail regarding CVR and CVI can read excellent articles written by Zamanzadeh et al. and Rubio et al.[4,6] It is crucial to compute CVR, CVI, and kappa agreement for each item from the rating of importance, representativeness, and clarity by experts. The CVR and CVI do not account for a chance factor. Since interrater agreement (IRA) incorporates chance factor; it is better to report CVR, CVI, and IRA measures.

Step V

The scholars require to address subtle issues before administering a questionnaire to responders for pilot testing. The introduction and format of the scale play a crucial role in mitigating doubts and maximizing response. The front page of the questionnaire provides an overview of the research without using technical words. Further, it includes roles and responsibilities of the participants, contact details of researchers, list of research ethics (such as voluntary participation, confidentiality and withdrawal, risks and benefits), and informed consent for participation in the study. It is also better to incorporate anchors (levels of Likert item) in each page at the top or bottom or both for ease and maximizing response. Readers can refer to Table 3 for detail.

Table 3

A random set of questions with anchors at the top and bottom row

Items	Strongly disagree (SD)	Disagree (D)	Neutral (N)	Agree (A)	Strongly agree (SA)
Duration of disease (since onset)	SD	D	N	A	SA
Number of relapse(s) of the disease	SD	D	N	A	SA
Duration of oral erosions (present episode)	SD	D	N	A	SA
Number of relapse(s) of oral lesions	SD	D	N	A	SA
Persistence of oral lesions after subsidence of cutaneous lesions	SD	D	N	A	SA
Change in size of existing lesion in last 1 week	SD	D	N	A	SA
Development of new lesions in last 1 week	SD	D	N	A	SA
Difficulty in eating normal food	SD	D	N	A	SA
Difficulty in eating food according to their consistency	SD	D	N	A	SA
Inability to eat spicy food	SD	D	N	A	SA
Inability to drink fruit juices	SD	D	N	A	SA
Excessive salivation/drooling	SD	D	N	A	SA
Difficulty in speaking	SD	D	N	A	SA
Difficulty in brushing teeth	SD	D	N	A	SA
Difficulty in swallowing	SD	D	N	A	SA
Restricted mouth opening	SD	D	N	A	SA
Strongly disagree	Disagree	Neutral	Agree	Strongly agree

Step VI

Pilot testing of an instrument in the target population is an important and essential requirement before testing on a large sample of individuals. It helps in the elimination or revision of poorly worded items. At this stage, it is better to use floor and ceiling effects to eliminate poorly discriminating items. Further, random interviews of 5–10 participants can help to mitigate the problems such as difficulty, relevance, confusion, and order of the questions before testing it on the study population. The general recommendations are to recruit a sample size between 30 and 100 for pilot testing.[4] Inter-question (item) correlation (IQC) and Cronbach's α can be assessed at this stage. The values less than 0.3 and 0.7, respectively, for IQC and reliability, are suspicious and candidate for elimination from the questionnaire. Cronbach's α, a measure of internal consistency and IQC of a scale, indicates researcher about the quality of items in measuring latent attribute at the initial stage. This process is important to refine and finalize the questionnaire before starting the testing of a questionnaire in study participants.

Questionnaire Evaluation

The preliminary items and the questionnaire until this stage have addressed issues of reliability, validity, and overall appeal in the target population. However, researchers need to rigorously evaluate the psychometric properties of the primary instrument before finally adopting. The first step in this process is to calculate the appropriate sample size for administering a preliminary questionnaire in the target group. The evaluations of various measures do not follow a sequential order like the previous stage. Nevertheless, these measures are critical to evaluate the reliability and validity of the questionnaire.

Data entry

Correct data entry is the first requirement to evaluate the characteristics of a manually administered questionnaire. The primary need is to enter the data into an appropriate spreadsheet. Subsequently, clean the data for cosmetic and logical errors. Finally, prepare a master sheet, and data dictionary for analysis and reference to coding, respectively. Authors interested in more detail can read “Biostatistics Series.”[7,8] The data entry process of the questionnaire is like other cross-sectional study designs. The rows and columns represent participants and variables, respectively. It is better to enter the set of items with item numbers. First, it is tedious and time-consuming to find suitable variable names for many questions. Second, item numbers help in quick identification of significantly contributing and non-contributing items of the scale during the assessment of psychometric properties. Readers can see Table 4 for more detail.

Table 4

A sample of data entry format

(a) Illustration of master sheet
Participant	Age	Religion	Family	Height	Weight	Q1	Q2	Q3
1	25	1	1	185.0	85.0	1	5	2
2	26	3	1	155.0	63.0	2	5	1
3	22	2	2	155.0	57.0	4	2	1
4	35	2	1	158.5	67.5	3	2	2
5	49	1	2	175.0	64.0	2	4	3
6	40	4	1	159.0	78.0	2	4	3
Qi→ith Question in the questionnaire, where i=1,2,3, … n
(b) Illustration of coding sheet
Variable label	Description		Coding and valid range		Measurement scale
Participant	A random serial number to participant		None		String
Age	Age in years		None (30-70 years)		Interval
Religion	Religion of the participant		1=Hindu 2=Sikh 3=Muslim 4=Others		Nominal
Q	Level of agreement in the question		1=Strongly disagree 2=Disagree 3=Neutral 4=Agree 5=Strongly agree		Ordinal

Descriptive statistics

Spreadsheets are easy and flexible for routine data entry and cleaning. However, the same lack the features of advanced statistical analysis. Therefore, the master sheet needs to be exported to appropriate software for advanced statistical analysis. Descriptive analysis is the usual first step which helps in understanding the fundamental characteristics of the data. Thus, report appropriate descriptive measures such as mean and standard deviation, and median and interquartile/interdecile range for continuous symmetric and asymmetric data, respectively.[9] Utilize exploratory tabular and graphical display to inspect the distribution of various items in the questionnaire. A stacked bar chart is a handy tool to investigate the distribution of data graphically. Further, ascertain linearity and lack of extreme multicollinearity at this stage. Any value of IQC >0.7 warrants further inspection for deletion or modification. Help from a good biostatistician is of great assistance for data analysis and reporting.

Missing data analysis

Missing data is the rule, not the exception. Majority of the researchers face difficulties of finding missing values in the data. There are usually three approaches to analyze incomplete data. The first approach is to “take all” which use all the available data for analysis. In the second method, the analyst deletes the participants and variables with gross missingness or both from the analysis process. The third scenario consists of estimating the percentage and type of missingness. The typically recommended threshold for the missingness is 5%.[10] There are broadly three types of missingness, such as missing completely at random, missing at random, and not missing at random. After identification of a missing mechanism, impute the data with single or multiple imputation approaches. Readers can refer to an excellent article written by Graham for more details about missing data.[11]

Sample size

The optimum sample size is a vital requisite to build a good questionnaire. There are many guidelines in the literature regarding recruiting an appropriate sample size. Literature broadly segments sample size approaches into three domains known as subject to variables ratio (SVR), minimum sample size, and factor loadings (FL). The factor analysis (FA) is a crucial component of questionnaire designing. Therefore, recent recommendations are to use FLs to determine sample size. Readers can consult Table 5 for sample size recommendations under various domains. Interested readers can refer to Beavers and colleagues for more detail.[12] The stability of the factors is essential to determine sample size. Therefore, data analysis from questionnaires validates the sample size after data collection. The Kaiser–Meyer–Olkin (KMO) criterion testing the adequacy of sample size is available in the majority of the statistical software packages. A higher value of KMO is an indicator of sufficient sample size for stable factor solution.

Table 5

Sample size recommendations in the literature

Sample size criteria
Subject to variables ratio	Minimum sample size	Factor loading
Minimum 100 participants + SVR ≥5	At least 300 participants	At least 4 items with FL >0.60 (minimum 100 participants)
51 participants + number of variables	At least 200 participants	At least 10 items with FL >0.40 (minimum 150 participants)
At least SVR >5	At least 150-300 participants	Items with 0.30 ≤ FL ≤0.40 (minimum 300 participants)