Multiple linear and logistic regression and sensitivity and specificity

Clients seeking care at a weight loss clinic answered a short survey about their experience. The survey collected information on their age, gender, diet adherence and also contained a few short screening tools that assessed the minutes they exercised per day, the minutes sedentary per day, and their confidence in losing weight. These variables were used to predict weight loss and if they would recommend the clinic to a friend.

Part A. Multiple Linear Regression

Task: Evaluate if there is a relationship (predict) between the personal characteristics and their weight loss. Prepare a short description of what was done and what you found.

Conduct a multiple linear regression to predict weight loss using all of the personal characteristic variables (if appropriate). Use the steps provided in module 9 as a guide of how to conduct this analysis and include in your description what you did, such as the following:

a. Define the hypothesis

H0: 1 = 2 = … = k = 0

There are no significant associations at all. That is, all of the coefficients are zero and none of the variables belong in the model.

None of the personal characteristics will be associated with weight loss.

H1: At least one is not zero

The alternative hypothesis is not that every variable belongs in the model but that at least one of the variables belongs in the model.

ex) At least, one of the independent variables significantly contributes to weight loss.

b. Describe each variable using appropriate descriptive statistics; no need to recode anything but make sure dummy coding is correct

Table 1. Characteristics of selected clients of the weight loss clinic (N=51)

n %

%

Range Mean (SD)

Age (years) 51 19- 46 26.9 (8.1)

Exercise minutes per day 51 30-50 39.6 (5.5)

Confidence in success 51 11-26 17.8 (3.4)

Sedentary minutes per day 51 106-124 114.4 (4.4)

Diet adherence

No 19 37.3

Yes 32 62.7

Gender

Male

22 43.1

Female 29 56.9

c. Run the bivariate associations need to avoid multicollinearity

Table 2. Correlation table (Spearman values shown when nominal variable included)

exercise confid sedentary Sex* diet* age

exercise 1

confid -.456 1

sedentary .837 -.352 1

sex 0.005 -0.058 -.039 1

diet -.242 .447 -.297 -.098 1

age .322 .163 .281 -.028 .033 1

lbslost -.542 .515 -.454 .139 .438 -.06

Intellectus statistics

Table 1

Pearson Correlation Results Among age, confid, sedentary, exercise, and lbslost

Combination rp Lower Upper p

age-confid 0.16 -0.12 0.42 .254

age-sedentary 0.28 0.01 0.52 .046

age-exercise 0.32 0.05 0.55 .021

age-lbslost -0.06 -0.33 0.22 .674

confid-sedentary -0.35 -0.57 -0.08 .011

confid-exercise -0.46 -0.65 -0.21 < .001
confid-lbslost 0.52 0.28 0.69 < .001
sedentary-exercise 0.84 0.73 0.90 < .001
sedentary-lbslost -0.45 -0.65 -0.20 < .001
exercise-lbslost -0.54 -0.71 -0.31 < .001
Note. The confidence intervals were computed using = 0.05; n = 51; Holm corrections used to adjust p-values.
Exercise and sedentary were highly inter-correlated (multicollinearity). We should not be putting both into model (please see the Step 4 from the Munro book p.322). Exercise is more strongly related to the dependent variable (lbslost) as the correlation coefficient as seen in table 2 is -.542 as compared to the correlation between sedentary and lbslost, r=-.454. So, among the two variables, exercise was chosen. Age, gender and diet were included in the final model. Almost, always plan to include some socio-demographic measures in the model as these are often important confounding variables. Finally, in the final regression model, four independent variables (age, gender, diet, exercise and confidence) were included.
d. Run the full model check assumptions (ok to enter the variables you are selecting all at once)
Intellectus statistic: Results for Linear Regression
Table 2
Results for Linear Regression with diet, sex, age, confid, and exercise predicting lbslost
Variable B SE CI t p
(Intercept) 28.22 6.95 [14.23, 42.21] 0.00 4.06 < .001
dietYes 2.67 1.27 [0.10, 5.24] 0.26 2.09 .042
sexFemale 1.57 1.12 [-0.69, 3.83] 0.16 1.40 .167
age -0.01 0.08 [-0.17, 0.15] -0.02 -0.13 .896
confid 0.39 0.21 [-0.03, 0.81] 0.26 1.86 .070
exercise -0.33 0.13 [-0.59, -0.07] -0.35 -2.53 .015
Note. CI is at the 95% confidence level. Results: F(5,45) = 7.53, p < .001, R2 = 0.46
Unstandardized Regression Equation: lbslost = 28.22 + 2.67*dietYes + 1.57*sexFemale - 0.01*age + 0.39*confid - 0.33*exercise
e. Summarize your findings in text.
We evaluated if there was a relationship between several patient characteristics (age, sex, diet adherence, exercise minutes, and confidence in their success) and their weight loss. Our sample consisted of 51 clients, slightly more females (56.9%) with a mean age of 26.9 years (Table 1). The mean minutes of exercise and sedentary activity per day were 39.6 and 114.4 respectively. Higher scores on confidence with success meant less confidence (mean =17.8). Exercise and sedentary time were found to be highly correlated (r=-.84) and in order to avoid multicollinearity issues only one of these variables was entered into the multiple regression model. Multiple linear regression was used to estimate weight loss using multiple independent variables. Forty five percent of the variance of the pounds lost was explained by five variables (age, sex, diet adherence, confidence in success, and exercise ) (R2 = 0.45, R2 adjusted = .40). Overall, the model was statistically significant in predicting weight loss (F = 7.53, p < 0.01).
The estimated equation from the multiple linear regression analysis results is
Y = a + 1 (Age) + 2 (Sex) + + 3 (diet)+ 4 (exercise) + 5 (Confidence)
lbslost = 28.225 -.01 (Age) +1.57 (Sex) + 2.67(diet)-0.33(exercise) + 0.39(Confidence)
Note. Sex value is Male=0, Female=1, diet adherence 0=no and yes=1
We reject the null hypothesis because some of the coefficients were found to have a significant association with pounds lost. Age, sex and confidence were not significant in the model (p>.05); however, exercise is significant (t=-2.53, p=.01). For each one minute increase of exercise per day pounds decreased on average by 0.33 (beta=-0.33). Diet adherence was also a significant predictor in this model (p<.05). Those adhering to the diet lost on average 2.67 pounds more than those not adhering to the diet.
If used sedentary instead of exercise in the model
Forty one percent of the variance of the exam score was explained by five variables (age, gender, diet adherence, sedentary time, and confidence) (R2 = 0.41, R2 adjusted = .35). Overall, the model was statistically significant in predicting weight loss (F = 6.39, p <= 0.001). We reject the null hypothesis because at least one of the coefficients were found to have a significant association with the outcome, pounds lost. Age, sex, diet adherence and sedentary minutes were not significant in the model (p>.05). For each increase of one point in the confidence score, which meant less confidence, .53 pounds were lost (beta= .53).

Part B Multiple logistic regression

Task: Now we would like to see if we can find a relationship (predict) between these baseline measures and if they will recommend the clinic to others. No model needs to be run output is provided so we focus on interpreting it.

1. Is running a multiple logistic regression appropriate for this task? Explain why it is or is not appropriate.

Yes appropriate as Logistic regression should have a categorical variable as the dependent variable. The recommend yes/no is a binary dependent variable.

2. Define the hypotheses

Null hypothesis is that none of the independent variables affects the probability of the dependent variable (yes or no). This implies that all of the coefficients are zero.

H0: 1 = 2 = … = k = 0 None of the client variables will be associated with recommending the clinic.

H1: At least, one is not zero At least, one of the independent variables significantly contributes to the recommendation result.

3. How many and what percent of clients indicated they would recommend the clinic?

25 clients (49%)

4. Using the output as shown below write a summary of what we found.

DV: recommend clinic to others (1=yes vs 0=no)

B S.E. Sig. OR 95% C.I.for OR

Lower Upper

Step 1a lbslost .248 .095 .009 1.282 1.064 1.544

Sex (female vs male) 1.393 .694 .045 4.028 1.034 15.683

age .011 .044 .801 1.011 .928 1.102

diet .113 .757 .882 1.119 .254 4.935

Constant -7.228 2.780 .009 .001

Two variables (lbslost and sex) were significant predictors in this logistic regression analysis. The odds ratio for sex is 4.03 (95% CI: 1.03 to 15.68). It indicates that female clients are four times more likely to indicate they will recommend the clinic to their friends than male clients. The odds ratio for weight loss is 1.28 (95% CI: 1.06 to 1.54). It indicates that for every additional one pound lost the odds of clients recommending the clinic goes up by 28%. No association was found between age and recommending (p>.05) and having adhered to the diet and recommending the clinic (p>.05)

Part C. Sensitivity & Specificity

Our survey used a self report measure of diet adherence. We want to assess if the results are valid and accurate by comparing the self report with a gold standard (stool sample detecting microbiome and should see only small amounts of fats and sugars, etc). We identify 15 true positives out of the 32 clients who self identified as being adherent and 18 true negatives.

1. Fill in the following table

Gold standard positive Gold standard negative Total

Self report + adherence 15 17 32

Self report nonadherence 1 18 19

Total 16 35 51

Gold standard positive Gold standard negative Total

Self report + adherence a (True positive) b (False positive) 32

Self report nonadherence c (False negative) d (True negative) 19

Total 16 35 51

2. Calculate the sensitivity of the self report. a/a+c 15/16=94%

3. Calculate the specificity of the self report. d/b+d 18/35=51%

4. What does this meanwas our self report response OK ?

Low specificity implies many false positives. d/b+d – low d means larger b

A limitation of the self report is people want us perhaps to think they are doing the diet when they arent (gold standard contradicts their self report). These people would be misclassified as dieting (and if assume the diet is real and people following it should lose weight) thus it may influence our findings such that we may not see the expected association. Harder to find differences if negative people (non dieters) are mixed in with the true positives (where diet is working) this tends to bias any relationship established towards null (making it harder to reject the null hypothesis).

If we find an association we may be incorrectly concluding the diet works because too many of the clients were not actually dieting. Not valid.

Part D. Run Chart

In addition to creating a figure that illustrates the run chart, provide a summary of the context of the analysis and your run chart findings that include answers to a few questions.

Clinic X set out to improve the health of their diabetic patients. While developing an evidence based educational program for their clinic they monitored the proportion of A1C levels that were less than 7% for the first nine months of 2017 (mean=0.30, standard deviation=0.07). In October, November and December of 2017 the clinicians made improvements in their diabetes education program. The run chart displayed above illustrates that after the educational program the mean proportion of patients with a A1C level less than 7% increased to 0.64 (SD= 0.08). We cant be sure the changes lead to the improvements as more refined measurement is not available to determine what specifically lead to the improvement. Additional data useful for next steps in this quality improvement initiative could include identifying the range of hemoglobin A1C levels of all patients and what are the natural cut off points for different groups of patients, e.g., those less than 7.0%, those 7.0% to less than 8.0%, those 8.0% to less than 9.0%, and those more than 9.0%. So all groups of patients show improvements or are improvements made only among just a few of the groups of patients.