Quantitative Data Management Chapter 18

PTSD (Post Traumatic Stress Disorder) In Adolescent OR Child
July 27, 2022
Assessment 2 Instructions: Policy Proposal
July 27, 2022

Quantitative Data Management Chapter 18

Quantitative Data Management Chapter 18

Nursing Research

Florida National University

Commencing Quantitative Data Management

As soon as you begin collecting data, it is time to consider what to do with it as it comes in. You could just allow the data to accumulate and deal with it later, but there are several reasons not to do this. The first reason is probably obvious: if you do not begin to manage the data right away, you will have a huge job awaiting you later.

Tappen, R. M. (2015). Advanced Nursing Research: From Theory to Practice (2nd ed.). Jones & Bartlett Learning.

Brief Review of Quantitative Research

Before we go further, let us make sure that we understand quantitative research. What is Quantitative Research/Methodology?

Quantitative methodology is the dominant research framework in the social sciences. It refers to a set of strategies, techniques and assumptions used to study psychological, social and economic processes through the exploration of numeric patterns.

Quantitative research gathers a range of numeric data. Some of the numeric data is intrinsically quantitative (i.e., personal income), while in other cases the numeric structure is imposed (i.e., ‘On a scale from 1 to 10, how depressed did you feel last week?’).

The collection of quantitative information allows researchers to conduct simple to extremely sophisticated statistical analyses that aggregate the data (i.e., averages, percentages), show relationships among the data (i.e., ‘Students with lower grade point averages tend to score lower on a depression scale’) or compare across aggregated data (i.e., the USA has a higher gross domestic product than Spain).

Quantitative research includes methodologies such as questionnaires, structured observations or experiments and stands in contrast to qualitative research.

Qualitative research involves the collection and analysis of narratives and/or open-ended observations through methodologies such as interviews, focus groups or ethnographies.

Coghlan, D., Brydon-Miller, M. (2014). The SAGE encyclopedia of action research (Vols. 1-2). London, : SAGE Publications Ltd doi: 10.4135/9781446294406

Watch For The Following Issues As They Relate To Quantitative Data Management

Signatures on consents are not witnessed; copies of the consent have not been given to every participant.

Duplicate identification (ID) numbers have been inadvertently assigned to participants. (This can happen, for example, if you use the last four digits of the Social Security number, which has been a common but not recommended practice.)

Rating scales are scored incorrectly. This is most likely to happen with scales that have complicated scoring rules.

Scale scores are not correctly totaled (mathematical errors).

The wrong version of a test was used—the short form of the CES-D (a depression scale) or STAI (an anxiety scale), for example, instead of the long form.

A page is missing from the test packet, so items are missing.

An important variable such as age or gender has been left out by mistake.

Items are missed or left blank.

Responses to open-ended questions are difficult to read or too abbreviated to be useful.

Watch For The Following Issues As They Relate To Quantitative Data Management

Most of these errors are more likely to occur if you have someone else collecting or entering data for you or if participants are completing the forms themselves either on paper or electronically.

Failure to correct these problems as quickly as possible may lead to serious problems later, particularly when you try to analyze the results of the study.

A third reason to begin data management as soon as data collection begins is to make it possible to conduct preliminary analyses of the results at several points during the study. If the study is funded, the funder may require interim reports including some preliminary results. Also, if your study involves an intervention, the preliminary analysis may indicate some concerns about the intervention that would otherwise not be known until the end of the study. A series of periodic preliminary or interim analyses are essential if the intervention requires a data and safety monitoring plan (Cook & DeMets, 2008).

Finally, if you are using multiple data collectors you will also need to check interrater reliability periodically. Even if you have only one data collector, you still need to be alert to the possibility of rater drift over time, which can be detected through preliminary analysis.

Managing Data & Set Up a Tracking System

Managing Data

The more data you collect, the more important it is that you manage them well. In this section, a number of tasks related to data management are described. Some are one-time actions; others are done continuously, throughout the data collection phase of your study. Some may seem a little tedious, such as maintaining a tracking system, but keeping your data organized prevents many problems.

Set Up a Tracking System

As soon as you enroll your first participant, you should set up a paper or electronic tracking system. This allows you to monitor the success of your recruitment strategies, your randomization to treatment groups, characteristics of the sample that might be critical (for example, have you been able to recruit as many males as females?), and whether you have been able to keep up with your time frame for the study.

Ensure the Security of the Data

It is very likely that you have assured all participants both verbally and in a written consent form that all the information they provide will be kept confidential.

In studies where the information sought is especially sensitive, anonymity may have to be assured. It is your responsibility to ensure that participants’ confidence in these assurances is not violated, that their personal health information is not shared with unauthorized individuals, and that it cannot be inadvertently exposed.

Following are some suggestions for maintaining the security of your data:

Allow access only to those who have had training regarding the protection of human subjects.

Store all paper files in locked file cabinets, preferably within a locked storage area.

Use password protection to guard all electronic data files. Use complex, not simpler or obvious, passwords.

Separate participants’ names from the remaining data. Keep participant names and Social Security numbers (if needed) in a separate, secure file.

In some large studies, the data files are encrypted and kept on a secure computer that is designated for this purpose only.

Physically protect laptops and computers from unauthorized access as well.

Create backup copies of all electronic data files and store them in a separate, secure location.

If you are working on networked files, talk with the network administrator about your data security needs.

Develop a Filing System for Your Data & Review Each Packet/Response

Develop a Filing System for Your Data

Whether paper, electronic, or a combination of both, the data you have collected are not only personal and confidential (if not anonymous), but also very valuable.

A well-organized filing system for the original data, whether field notes, audiotapes, videos, or test packets, will prevent having misplaced or lost information.

Consents may be filed with test information or separately, but it is necessary to be able to connect them with the data if a problem arises or an audit is conducted, unless you have IRB permission to keep data entirely anonymous.

Review Each Packet/Response

As indicated at the beginning of the chapter, several errors and omissions may occur during data collection. The sooner these are caught and corrected, the better.

With some, it may only be possible to correct the errors on future data collection.

In other cases, scoring may be corrected, and blanks filled in on collected data.

Create Codes for Open-Ended Questions

This is a task that cannot be completed until a substantial amount of data has been collected. There are several different ways to create these codes.

The first is to anticipate participants’ answers and create the codes (categories) before data collection begins. This can be done if you are relatively certain of the range of answers that will be obtained.

The number of participants who select “other” will be one indication of how successful your prediction of their responses was.

A second approach is to create the codes after the data have been collected.

You can list all of the responses, tabulate any that were used multiple times to reduce the list, and then cluster responses that are similar. Ideally, you want to reduce the categories to a manageable number (no more than five or six) for analysis, but this is not always possible.

SELECTING THE SOFTWARE FOR THE DATABASE

Are your data entirely quantitative?

This includes nominal data, such as gender, marital status, type of residence, or any other item that can be neatly divided into a small number of categories to which numbers can be assigned. Another common type of nominal data is the yes–no questions such as “Have you ever been told by a healthcare provider that you have asthma?”

Are there substantial amounts of open-ended responses that you may want to analyze qualitatively?

You probably want to create a second, qualitative database using qualitative data analysis software if the data are extensive or a word-processing program if they are not.

Will the data analysis be very simple, involving just descriptive data such as means, percentages, totals, and subtotals by various groupings?

You can use very simple software such as Excel, Minitab, or a similar program for analysis. However, most research studies require data analysis that goes beyond this very basic information, in which case you will want to use more sophisticated data analysis software such as SPSS, SAS, STATA, R, EcStatic, SYSTAT, or STATISTIX (Salkind, 2008).

Will you want to present your results graphically?

Sometimes a picture really is worth a thousand words, and a few well-designed graphs, charts, or tables may enliven your written report or presentation.

If this is the case, you will want to select data analysis software that produces attractive graphs, charts, and other visual displays. JMP is an example, but other programs also do this well.

Is there better support available for one software package versus another?

SELECTING THE SOFTWARE FOR THE DATABASE

This may be a matter of personal preference.

If you encounter a problem you cannot resolve on your own, is there somewhere or someone you can turn to for help? Quantitative data analysis software is complex.

You do not have to be a novice researcher to encounter problems that you cannot resolve on your own.

Online help guides are not always sufficient; in some cases, they are more difficult to understand than the software itself, although the software producers have made considerable progress in improving their online support.

When selecting a software package for purchase Salkind (2008) suggests you call the tech support number to see how long you are on hold before someone responds to your call.

What is the cost of the software?

The cost of some quantitative data analysis packages to a private individual is astronomical.

Many universities and other research-intensive organizations purchase multiuser licenses so that an individual may pay a very reasonable amount for access to the software.

In other cases, there are discount prices for students and for faculty doing research or online versions they can access free of charge.

It is worthwhile to inquire about these.

There are also free programs available online but be sure they will meet your needs before adopting them.

Many are very limited in scope; others will handle only small amounts of data unless you pay an upgrade fee.

Is the software compatible with your system?

As operating systems change, the data analysis software may temporarily be incompatible with the newest version.

Further, the memory capacity (RAM) of your desktop or laptop may need to be increased to accommodate a very large program.

Check these requirements before making your selection.

DATABASE CREATION

Test the Program

Before launching the effort to create a complete database, it is a good idea to practice with the software you have selected.

You can do a trial run on a “mini” database with just a few (3 or 4) variables and a few (fewer than 10) hypothetical responses. Use a name for this database that will make it clear it is just a trial run.

Call it “Mini,” “Trial,” or “Junk,” and remember to delete it when you are done.

Tutorials are available for many software packages.

These are usually worthwhile, but they are not a substitute for trying to set up a mini database on your own.

Once you have mastered creation of a trial database, you are ready to create the real one.

Develop a Codebook

Before launching the effort to create a complete database, it is a good idea to practice with the software you have selected.

You can do a trial run on a “mini” database with just a few (3 or 4) variables and a few (fewer than 10) hypothetical responses.

Use a name for this database that will make it clear it is just a trial run.

Call it “Mini,” “Trial,” or “Junk,” and remember to delete it when you are done.

Tutorials are available for many software packages. These are usually worthwhile, but they are not a substitute for trying to set up a mini database on your own.

Once you have mastered creation of a trial database, you are ready to create the real one.

SPSS Databases (Figure 18-1 in your text provides an excellent example)

1. Use names that tell you what the variable represents (ID, age, gender, and so forth) instead of symbolic names (Var1, Var2 or QuesA, QuesB), which are hard to remember.

2. Define the type of variable you have named, whether it is numeric (a number), string (text), or a date.

3. Label the variable, explaining what the short name stands for (this is an optional step). For example, ID stands for identification number; MMSE stands for Mini-Mental State Examination (Folstein, Folstein, & McHugh, 1975); TNVS stands for the Newest Vital Sign (a test of health literacy).

4. If you anticipate having any missing data, you can also specify how missing values will be designated, such as 88 or 99. If not designated, SPSS will assume the number is real and include it in any analysis.

5. Under the Measure column you can specify whether the variable is nominal, ordinal, or scale (interval or ratio).

SAS Databases

SAS Databases

The procedure for creating an SAS database is similar to using SPSS.

Using proc (i.e., procedure) FSEDIT, you indicate that a new database is being created and give it a name using SAS conventions for naming files.

Data input is easier if you list the variables in the same order as they appear in the test packet.

Variables are added by entering their name, indicating whether they are numeric (N) or character ($), and indicating the maximum length of the number or letter string for the variable and labels (which are optional).

External Files

SPSS, SAS, and other data analysis programs accept datasets created in other programs.

When you use this approach, it is most important that you use conventions acceptable to the data analysis program.

Data Input

You are ready now to enter your data. Accuracy is paramount, so try to do this at a time when distractions and interruptions can be kept to a minimum.

There are several ways to check the accuracy of the data input.

Entering the data twice and comparing the databases is an effective but time-consuming strategy.

Random checks of accuracy require less time but pick up the majority of the problems.

Summary

Data management is an interim step between data collection and data analysis.

Proactive data management can help you detect potential problems when they can still be corrected.

It also allows for the preliminary analyses that are often required, either for reports or for data and safety monitoring.

Beginning researchers are likely to need guidance from expert researchers, software tech support, tutorials, and manuals as well as their consultant statistician when setting up all but the simplest databases, but these tasks become much easier with practice.

Basic Quantitative Data Analysis Chapter 19

Nursing Research

Florida National University

DATA CLEANING

Now that all of the data have been entered into the database and the database has been imported into the analysis program (if you were using an external file), it is time to review it once more for any errors that could affect the outcomes of the analysis. The following are a few things to consider:

Look at the data matrix on screen or in the printout.

Recheck the scoring of tests to be sure it was done correctly.

Make sure the correct coding categories were used.

Look for outliers, any number (value) that is outside the expected range for a variable.

While you are going through the previous steps, watch for the frequency of missing data, especially for a pattern in the missing data.

Handling missing data at this phase of the study presents some challenges, which are discussed in the next section.

MISSING DATA

Data may be missing at the item level (some items are left unanswered), construct level (all questions about gastrointestinal problems are left blank), or person level (some participants were not retested) (Newman, 2014).

There are a number of reasons why data may still be missing despite efforts to fill in as much as possible during data collection:

The participant may have declined to answer a question or take a particular test.

The participant withdrew, moved, became ill, or died.

An item or test was missed inadvertently.

Participant fatigue, agitation, or another negative response made it necessary to omit all or part of the data collection.

A participant deliberately omitted the test or item because he or she thought it was not applicable or did not understand the question.

Poor directions or poorly worded questions elicited no response or the wrong response.

VISUAL REPRESENTATION

Now that all of the data have been entered into a database, the database has been cleaned, and missing data are filled in or replaced as much as possible, you are ready to initiate the actual analysis of the data.

If your work has been careful and thorough up to now, this phase should go relatively smoothly and more quickly than you might at first imagine. Informally, this first stage of analysis is called “eyeballing” the data.

Many statisticians caution eager researchers not to skip this initial phase of analysis.

Much of what is done here is not included in a final report or published article on your study, but it is a valuable first look at characteristics of the sample (the participants), the outcomes of the study, and the relationship between the two.

This first look may suggest relationships that were not anticipated and may cause you to use some additional analyses not anticipated when the study was designed. In other words, this may be a stage of discovery as well as evaluation.

Actually, you began eyeballing the data during the previous steps directed toward finding errors and dealing with missing data.

Now you will eyeball the data once more to look at the characteristics of the results on each variable and how the variables are related one to another.

Graphics capabilities differ considerably from one data analysis package to another, but most should be able to generate the simple visual representations discussed here.

There are many ways to visually represent data.

We will look at a few basic ones: stem and leaf, box plots, bar and pie charts, and plots.

Data from a sample of 150 students at an urban university who were majoring in nursing and psychology will be used for illustrative purposes.

Stem and Leaf:

You may have to use a little imagination to actually see these representations as a stem and leaf, but this is often the first graphic studied. An example can be seen in Figure 19-1.

(Figure 19-1) Stem and leaf and box plot for years in the United States in a sample of college students.

Box Plots/Bar Charts and Pie Charts

Box Plots

A box plot is illustrated in Figure 19-1 (Previous Slide) to the right of the stem and leaf.

The box plot also suggests an imbalance toward the lower end of the range.

The horizontal line across the box indicates the median (middle value) or 50th percentile of this distribution.

The top of the box is the 75th percentile (75% of the cases fall at or below this line), and the bottom is the 25th percentile (25% of the cases fall at or below this line).

Bar Charts and Pie Charts

These charts are especially helpful in visualizing differences between several groups within a sample.

The simplest bar and pie charts show the frequency (total number) within each group.

For example, Figure 19-2 is a very simple pie chart that illustrates the proportion of the college student sample that is European American (26.75%), African American (17.20%), Hispanic American (28.03%), and Afro-Caribbean (28.03%).

It is quickly apparent that the African American group is a little smaller than the other three groups, and there are a few less European Americans than there are Hispanic American and Afro-Caribbean students.

(Figure 19-2) Pie chart showing proportion of each ethnic group within the sample.

The bar chart in Figure 19-3 illustrates a slightly more complex relationship between nominal data (ethnic group) and interval data (depression scores): the mean scores on the CES-D depression scale (Radloff, 1977).

In this instance, you can see that the Hispanic American students have higher CES-D scores than the others.

The European American and Afro-Caribbean students have similar mean scores; the African American students fall between them and the Hispanic American students.

(Figure 19-3) Bar chart: Mean depression scores of nursing and psychology students by ethnic group membership.

(Figure 19-5) Plot of age by years in the United States: Hispanic American and Afro-Caribbean college students.

BASIC DESCRIPTIVE STATISTICS

It is time now to begin describing the characteristics of your sample numerically.

There are some basic characteristics that are almost always reported: sample size, age, gender, ethnicity, income, health status, and location (place of residence, type of facility, and so forth).

Basic descriptive statistics are essential to virtually every quantitative data analysis. They are used to

(1) continue examination of the distribution of values within the dataset, and

(2) to describe the characteristics of the sample. Normality We look more formally now at the distribution of the values or scores in the dataset to evaluate the extent to which the data are normally distributed.

Skew You may recall from your basic statistics course that a normal distribution of the values of a specific variable is bell-shaped and symmetrical (O’Rourke, Hatcher, & Stepanski, 2005) (see Figure 19-6).

BASIC DESCRIPTIVE STATISTICS

(Figure 19-6) Example of normal curve and standard deviations above and below 0 (the mean). Percentages are the area under the curve, the proportion of subjects that fall within each standard deviation.

If you turn the stem and leaf on its side, you can see the extent to which the shape of the leaf does or does not resemble the bell-shaped curve.

In Figures 19-7 and 19-8, you can see that the distribution of values for years in the United States is not normal: there are too many values in the lower range and too few in the higher range.

This is a positive skew. The distribution of age has even more cases near the lower end of the range and fewer at the high end of the range, also a positive skew.

Example of normal curve and standard deviations above and below 0 (the mean). Percentages are the area under the curve, the proportion of subjects that fall within each standard deviation.

BASIC DESCRIPTIVE STATISTICS

Figure 19-7 Histogram and normal distribution curve for years in the United States of college student sample.

Figure 19-8 Histogram and normal distribution curve for age of college student sample.

BIVARIATE ASSOCIATION

The term bivariate refers to relationships between two variables (O’Rourke et al., 2005).

The Pearson product moment correlation coefficient, represented symbolically as r, is the most commonly used bivariate measure of association.

There are several points relevant to analysis of study outcomes that need to be mentioned.

You can create a correlation matrix by entering several variables at one time as a set of variables to be analyzed (see Figure 19-11).

This helps you see the differences in the strengths of relationships between various pairs of variables.

For example, in Figure 19-11, you can see that the relationship of age to years in the United States is r = 0.74, p < 0.0001, a strong and statistically significant (that is, not by chance) relationship. BIVARIATE ASSOCIATION (Figure 19-11) Pearson correlation coefficients: Correlation matrix. It is also in the positive direction: as age goes up, years in the United States also rise. There are also some negative relationships such as the relationship of depression scores to degree of immersion in the dominant (mainstream American) society as measured by the Stephenson Multigroup Acculturation Scale (SMAS), r = 0.44 (rounded) (Stephenson, 2000). This moderately strong association suggests that depressive symptomatology is less when acculturation is higher and higher when the person is less acculturated. Such a relationship calls for further investigation: Is this true of everyone in the sample? Or of people in just one of the ethnic groups? Is it due to the type of questions asked (i.e., would this relationship hold if other measures of depression and acculturation were used)? Note also that some of the very weak correlations such as the number of years in the United States and the depression score are not significant. (The p value for significance is found under each correlation value.) ADDITIONAL MEASURES OF ASSOCIATION There are times when another measure of association is more appropriately used (O’Rourke et al., 2005). The Spearman rank-order correlation coefficient (rs) should be used, as its name implies, when ordinal (i.e., ranked or ordered) data are analyzed. It may also be used if the data distribution is seriously skewed (i.e., not normally distributed) (O’Rourke et al., 2005). The chi-square statistic (χ2) is useful when both variables are nominal-level data (i.e., simple counts of people within a category such as gender or ethnic group membership). A frequently used statistic, chi-square compares the observed (actual) frequencies in each cell to the expected frequencies. SUMMARY The basic data cleanup, replacement of missing data, visual representations, descriptive statistics, and bivariate association statistics described in this chapter are useful in the analysis of virtually every nursing research study. For some studies, the analysis is complete after accomplishing these basic steps. For other, more complex studies, it is necessary to continue with more advanced analytic techniques. Whether the analysis is basic or advanced, you will want to have a good statistics textbook, a guide to whatever statistical analysis software you are using, and guidance from an expert researcher and/or statistician to help you complete your quantitative data analysis correctly. .MsftOfcThm_Accent1_Fill { fill:#4472C4; } .MsftOfcThm_Accent1_Stroke { stroke:#4472C4; }