DBA801 Statistical Methods Cheat Sheet

Aligned with GGU/upGrad course material — quick reference for quantitative business research

CAT = Categorical/Nominal
ORD = Ordinal
CONT = Continuous
BIN = Binary (Yes/No)
MIX = Mixed types
(*) = Method NOT covered in DBA801 course material but essential for handling common research scenarios. Use with caution in exams — acknowledge it as supplementary knowledge.

The 3 Basic Model Groups (from DBA801)

A. Test of Differences

"Is there a significant difference between groups?"

2 groups T-test
3+ groups One-way ANOVA
2+ IVs Factorial ANOVA
+ covariate ANCOVA
Multiple DVs MANOVA / MANCOVA
Non-normal data* Mann-Whitney / Kruskal-Wallis

B. Test of Association / Relationships

"Is there a relationship between variables?"

2 continuous vars Correlation (Pearson r)
2 nominal vars Chi-Square
Predict continuous DV Simple / Multiple Regression
Predict binary DV Logistic Regression
Complex model + latent vars SEM / PLS-SEM
Mechanism (why?) Mediation
Boundary condition (when?) Moderation

C. Cluster Model

"How do observations group together?"

Segment customers/people K-means / Hierarchical Clustering
Reduce many variables Factor Analysis (EFA)
Attribute preference Conjoint Analysis
Time-dependent forecasting ARIMA / Time Series

A
Test of Differences DBA801: Preparatory Content + Constructing Statistical Models
Method When / Purpose IV → DV Business Example Key Thresholds & Caveats
T-test
Independent samples
Compare means of 2 groups. Course says: "conducted when the researcher has only two groups in the data set" CAT (2 groups)
CONT
Compare average test scores between two groups receiving different teaching methods
IV: Teaching method (A/B) → DV: Test scores
p < .05 n ≥ 30/group
Assumes normality & equal variances. Use Levene's test to check. If violated, use Welch's t-test.
Paired T-test *
Dependent samples
Compare means from same group at 2 time points (before/after, pre/post-test) CAT (time)
CONT
Did employee performance improve before vs after a training program? (pre-test / post-test)
IV: Time (pre/post) → DV: Performance score
p < .05
Differences must be ~normally distributed. Key for experimental pre/post designs discussed in the course.
One-way ANOVA
Analysis of Variance
Compare means across 3+ groups. Course: "conducted when the researcher has three or more groups" CAT (3+ groups)
CONT
Compare income across 3 age groups (under 30, 30-50, over 50). Or: CCC experiment comparing 4 incentive groups on program attendance
IV: Group (A/B/C/D) → DV: Programs attended
F-statistic p < .05
Only tells you "a difference exists" — use post-hoc (Tukey/Bonferroni) to find WHICH groups differ. Check normality + homogeneity of variance.
Factorial ANOVA
Two-way / Three-way
Test effects of 2+ categorical IVs + their interaction on a continuous DV. Course covers 2×2, 3×3, 3×2, 5×7 designs CAT × CAT
CONT
Does the effect of ad format (video/image/text) on engagement vary by placement (social/search/email)?
Main effects + interaction effect. Course example: gender × age group on job satisfaction
F for main effects + interaction
Interaction effect is the key insight — it reveals when the effect of one IV depends on the level of another. Report eta squared (η²) for effect size. Data must be continuous.
ANCOVA
Analysis of Covariance
Compare group means while controlling for a covariate. Course: "when a control variable needs to be considered for some aspects of a factor" CAT + CONT covariate
CONT
Compare sales across price points while controlling for advertising spend
IV: Price group + Covariate: Ad spend → DV: Sales
Homogeneity of regression slopes
Covariate must not interact with IV. Covariate must correlate with DV. Linearity between covariate and DV within each group.
MANOVA
Multivariate ANOVA
Compare groups on multiple DVs simultaneously. Course: "used when multiple dependent variables exist" CAT
→ Multiple CONT
Do different training programs affect both accuracy and speed and quality simultaneously?
Course: effect of price, quality, brand on satisfaction AND loyalty
Wilks' Λ p < .05
Controls Type I error across multiple DVs. Assumes multivariate normality + homogeneity of covariance matrices. Can use bootstrapping if assumptions violated.
MANCOVA
MANOVA + controlling for covariates. Course: "extended version of ANCOVA that allows multiple DVs" CAT + CONT covariates
→ Multiple CONT
Effect of training programs on productivity while controlling for employee experience and education level Same as MANOVA + covariate checks
Combines MANOVA + ANCOVA logic. Requires even larger samples.
Mann-Whitney U *
Wilcoxon rank-sum
Non-parametric alternative to t-test. Compare 2 groups on ordinal/non-normal data CAT (2 groups)
ORD / non-normal
Do entrepreneurs vs non-entrepreneurs differ on Likert-scale "experience" rating (1-5)?
IV: Start business? (Y/N) → DV: Likert rating
p < .05
No normality needed. Compares rank distributions. Better than chi-square for ordinal data as it preserves order information.
Kruskal-Wallis *
Non-parametric ANOVA
Non-parametric alternative to one-way ANOVA for 3+ groups on ordinal/non-normal data CAT (3+ groups)
ORD / non-normal
Do Likert culture ratings differ across 4 departments? p < .05
Follow up with Dunn's post-hoc + Bonferroni correction. Use when ANOVA assumptions are violated.
B
Test of Association / Relationships DBA801: Preparatory Content + Multivariate Analyses
Method When / Purpose Variable Types Business Example Key Thresholds & Caveats
Correlation
Pearson r
Find general association between 2 continuous variables. Course: "measures the strength and direction of the linear relationship" CONTCONT Relationship between advertising expenditures and sales revenue
r ranges from −1 to +1. Positive = both increase together. Negative = one increases, other decreases.
|r| .1=small .3=med .5=large
Only detects LINEAR relationships. Correlation ≠ causation. Both vars must be ~normal. Sensitive to outliers.
Spearman ρ *
Rank correlation
Non-parametric correlation for ordinal or non-normal data ORDORD
or non-normal CONT
Relationship between employee rank and satisfaction rating (Likert scale) Same benchmarks as Pearson
No normality needed. Detects monotonic (not just linear) relationships. Use for Likert scales.
Chi-Square
χ² test of independence
Find association between 2 nominal variables. Course: "measures the significance of the association between two variables by comparing observed with expected frequencies" CATCAT Is customer demographic (age group) associated with product type purchased? Or: CCC group (A/B/C/D) × Renewal (yes/no)
Cross-tabulation table. Compare observed vs expected counts.
p < .05 Expected freq ≥ 5
Does NOT show direction or strength — report Cramér's V for effect size. Use Fisher's Exact if expected < 5. Loses info on ordinal data.
Simple Regression
Predict DV from 1 IV + establish magnitude + direction. Course: "analyzes the relationship between one dependent variable and a single independent variable" CONT
CONT
Predict sales based on temperature (ice cream example from course)
Y = β₀ + β₁X. β₁ positive = positive relationship
p < .05
Check: linearity, normality of residuals, homoscedasticity. R² = % of variance explained. β sign tells direction.
Multiple Regression
OLS regression
Predict DV from multiple IVs. Course: "examines the relationship between one dependent and multiple independent variables" MIX (multiple)
CONT
Predict GPA from hours studying, attendance, and extracurricular activities
Y = β₀ + β₁X₁ + β₂X₂ + ... Each β shows unique contribution
VIF < 5
Check multicollinearity (VIF), residual normality, homoscedasticity. Report β weights for relative importance. Rule of thumb: n ≥ 50 + 8k (k = IVs).
Multivariate Regression
Multiple IVs predicting multiple DVs. Course distinguishes this from "multiple regression" MIX
→ Multiple CONT
Predict cardiovascular disease, diabetes, AND hypertension from diet, exercise, and stress simultaneously
Course: combine DV items into composite OR test separately
R² per DV
Different from "multiple regression" (1 DV). Can create composite DV by averaging items, or analyze DVs separately.
Hierarchical Regression *
Enter IVs in theoretically-ordered blocks to test incremental contribution MIX (blocks)
CONT
Block 1: Demographics (age, gender) → Block 2: Job factors (tenure, role). Does Block 2 add explanatory power?
ΔR² shows how much EXTRA variance each block explains
ΔR² sig at p < .05
Block order must be theory-driven. Very common in management/OB research. Addresses "above and beyond" questions.
Logistic Regression
Predict a binary outcome. Course: "dependent variable is binary (win/lose, yes/no, success/failure)" MIX
BIN
Predict whether employee will leave company (yes/no) from age, salary, satisfaction, tenure, commute
S-shaped logistic curve maps to probability 0–1. Course uses R Commander.
Odds Ratio (OR) AIC (lower = better)
OR > 1 = increases odds, OR < 1 = decreases odds. No normality assumption. Check Hosmer-Lemeshow fit. n ≥ ~10 events per predictor.
M
Mediation & Moderation DBA801: Constructing Statistical Models
Method When / Purpose Variable Setup Business Example (from course) Key Thresholds & Caveats
Mediation
"Why / How?"
A mediator explains the mechanism by which IV affects DV. Course: "a middle construct explains the relationship between variables" IV → MEDIATOR → DV Employee morale → Work-life balance → Productivity. The effect of morale on productivity is partially explained by work-life balance (ABC Solutions example)
Also: advertising → brand awareness → sales
Indirect effect CI excludes 0
Use bootstrapping (5000+ samples) for indirect effect. Course warns: "do not start research with a mediating variable" — establish direct effect first. Used in "mature studies."
Moderation
"When / For whom?"
A moderator changes the strength/direction of the IV→DV relationship. Course: "the relationship is strong or weak for different groups" IV × MODERATOR → DV Weekly parties → Employee engagement, moderated by Age. Younger employees show stronger effect than older ones (XYZ Corp example)
Also: engagement → satisfaction, moderated by compensation level
Interaction term p < .05
Mean-center variables before creating interaction. Use simple slopes to probe. Course: "requires larger sample size and statistical power than simple regression." Plan in advance.
+
Advanced & Multivariate Methods DBA801: Enhancing Statistical Models + Multivariate Analyses
Method When / Purpose Data Setup Business Example Key Thresholds & Caveats
Factor Analysis
EFA / CFA
Data reduction — reduce many variables into manageable constructs. Course: "a multivariate technique... creating a mathematical model to identify patterns" Many CONT / ORD items
→ Latent factors
Reduce variables (price, quality, brand, advertising) into key constructs driving consumer behavior
Course: marketing team condensing variables into "brand reputation" + "perceived quality"
Factor loading > 0.707 Eigenvalue > 1
Course emphasis: loading ≥ 0.707 means ≥50% variance captured → keep. Below 0.707 → revise/remove. n ≥ 5-10 per item. Use EFA first, CFA to confirm (not on same data).
SEM
Structural Equation Modeling
Test complex models with latent variables, multiple paths simultaneously. Course: "performed when we have multiple constructs, multiple IVs, and multiple DVs" MIX (latent + observed)
MIX
Full model: Leadership → PS → Performance, with PD moderating (your DBA801 paper). Course uses SmartPLS
Inner model (paths) + Outer model (indicators)
AVE > .50 VIF < 3
Course steps: (1) Data quality + Cronbach α, (2) Estimate model, (3) Inner/outer model, (4) Reliability + AVE, (5) Discriminant validity (Fornell-Larcker), (6) VIF for collinearity, (7) Path coefficients, (8) R².
Conjoint Analysis
Determine relative importance of product attributes. Course: "used to determine the relative importance of different attributes of a product or service" Attribute profiles
→ Preference ranking
How much do customers value screen size vs battery life vs camera vs price vs brand for a phone?
Steps: select attributes → orthogonal design → rank profiles → utility function
Utility scores per attribute
Orthogonal design ensures attributes aren't correlated. Commonly used in product development and pricing research.
Cluster Analysis
Group similar observations into segments. Course: "group data based on similar characteristics or behavior" Multiple CONT / MIX
→ Segments
Segment customers by purchasing behavior. Course: growing importance in ML and AI
K-means or Hierarchical clustering in R Commander
Silhouette > .5 good
No single "correct" solution. Standardize variables first. Different from factor analysis: FA groups variables, Cluster groups observations.
Time Series / ARIMA
Forecast values over time. Course: "ARIMA used for time series data forecasting; simple regression is NOT suitable for such data" Time-ordered CONT
→ Future values
Forecast monthly sales for next 12 months. Stock prices, weather patterns, economic indicators
Components: Autoregression + Differencing + Moving Averages
AIC for model selection
Handles trends, seasonality, autocorrelation that regular regression cannot. Course distinguishes cross-sectional vs panel data. Fixed effects vs random effects for panel data.
R
Reliability & Validity DBA801: Measurement and Sampling + Multivariate Analyses (SEM)
Measure What It Tests Business Example Key Thresholds & Notes
Cronbach's Alpha
Internal consistency — do items in a scale consistently measure the same thing? Course: "evaluates the degree to which all items on a test or scale are related" Do 5 items measuring "job satisfaction" yield consistent responses?
Course: expressed as a number between 0 and 1
α ≥ 0.707 strong α > .95 suspicious
Course uses the 0.707 threshold (same as factor loading). α > .95 may mean item redundancy. Report "if item deleted" values.
Test-Retest Reliability
Stability over time — same test, same people, different times. Course: "consistency between two measurements conducted under different circumstances" Administer employee performance test twice under different conditions → consistent results? r > .70 acceptable
High correlation between test 1 and test 2 = reliable measure.
Interrater Reliability
Agreement between raters. Course: "measures agreement between raters rating the same phenomenon" Multiple interviewers rate same candidate — do they agree? Cohen's κ > .60
High agreement = reliable. Low = need rater training or clearer criteria.
Face Validity
Does the measure look legitimate on surface? Course: "how genuine a result appears based solely on its appearance" Ask potential customers to rate a new logo's appearance on 1-10
Subjective assessment. Weakest form of validity — necessary but not sufficient. No technical methods required.
Internal Validity
Can we establish cause and effect? Course: "whether a certain intervention can produce the intended outcome" Training program → improved sales? Was it the training or other factors?
Course threats: history/maturation, measurement error, regression to mean, attrition, environmental factors. Controlled by randomization + control groups.
External Validity
Generalizability. Course: "extent to which research findings can be extended to a larger, more diverse population" Can results from 200 employees at one company apply to all companies?
Requires representative sampling. Consider: sample demographics, setting similarity, time period.
AVE
Average Variance Extracted
Convergent validity in SEM — do items capture enough variance of the construct? Do PS scale items converge on the "psychological safety" construct? AVE ≥ .50
Used in SEM/PLS-SEM. AVE < .50 = more error than signal. Course mentions this in SmartPLS steps.
Discriminant Validity
Fornell-Larcker
Are constructs truly distinct from each other? Are "psychological safety" and "trust" measuring different things? √AVE > inter-construct r
Course (SEM section): compare √AVE of each construct against cross-loadings and squared correlations. HTMT < .85 is a better modern alternative*.

Course framework reminder — the 3 model groups: (A) Test of Differences = T-test, ANOVA family · (B) Test of Relationships = Correlation, Chi-square, Regression, Logistic Regression, SEM · (C) Cluster Model = Cluster Analysis, Factor Analysis

Key concepts from course: True Score Theory: X = T + Er + Es (observed = true score + random error + systematic error). Reduce systematic error via literature review, training collectors, pilot tests, multiple measures.

Sample size rules of thumb: n ≥ 30 per group for parametric tests. Regression: n ≥ 50 + 8k (k = predictors). SEM: n ≥ 200. PLS-SEM: 10× max paths pointing at any construct.

Effect sizes matter more than p-values. Course: "Simply looking at a low p-value may not necessarily indicate meaningful results." Also: "Big data sets make p-values useless as almost every model will be significant."

Type I error = reject true H₀ (false positive). Type II error = fail to reject false H₀ (missed opportunity). α = .05 means 5% chance of Type I error.

The Likert debate: Single Likert item = ordinal → non-parametric tests (Mann-Whitney*, Spearman*). Mean of multiple Likert items = often treated as continuous → parametric OK (Norman, 2010).

(*) Methods not in DBA801 material but widely used: Paired t-test, Spearman ρ, Mann-Whitney U, Kruskal-Wallis, Hierarchical Regression. These are standard in published business research — safe to reference but be transparent about sourcing.

Course tools: R Commander (ANOVA, Regression, Logistic Regression, Factor Analysis, Cluster) + SmartPLS (SEM) + Qualtrics (Survey design) + Excel/SPSS (Data entry)