Misplaced Pages

Structural equation modeling: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 20:02, 3 July 2023 editHelpingSEM (talk | contribs)32 edits Estimation of free parameters: expanding for fundamental understanding← Previous edit Revision as of 22:33, 3 July 2023 edit undoHelpingSEM (talk | contribs)32 edits Evaluation of models via model fit: extending this section of the discussion as has been requested since about 2019Next edit →
Line 70: Line 70:
One common problem is that a coefficient’s estimated value may be underidentified because it is insufficiently constrained by the model and data. No unique best-estimate exists unless the model and data together sufficiently constrain or restrict a coefficient’s value. For example, the magnitude of a single data correlation between two variables is insufficient to provide estimates of a reciprocal pair of modeled effects between those variables. The correlation might be accounted for by one of the reciprocal effects being stronger than the other effect, or the other effect being stronger than the one, or by effects of equal magnitude. Underidentified effect estimates can be rendered identified by introducing additional model and/or data constraints. For example, reciprocal effects can be rendered identified by constraining one effect estimate to be double, triple, or equivalent to, the other effect estimate,<ref name="Hayduk96" /> but the resultant estimates will only be trustworthy if the additional model constraint corresponds to the world’s structure. Data on a third variable that directly causes only one of a pair of reciprocally causally connected variables can also assist identification.<ref name="Rigdon95" /> Constraining a third variable to not directly cause one of the reciprocally-causal variables breaks the symmetry otherwise plaguing the reciprocal effect estimates because that third variable must be more strongly correlated with the variable it causes directly than with the variable at the “other” end of the reciprocal which it impacts only indirectly.<ref name="Rigdon95"/> Notice that this again presumes the properness of the model’s causal specification – namely that there really is a direct effect leading from the third variable to the variable at this end of the reciprocal effects and no direct effect on the variable at the “other end" of the reciprocally connected pair of variables. Theoretical demands for null/zero effects provide helpful constraints assisting estimation, though theories often fail to clearly report which effects are allegedly nonexistent. One common problem is that a coefficient’s estimated value may be underidentified because it is insufficiently constrained by the model and data. No unique best-estimate exists unless the model and data together sufficiently constrain or restrict a coefficient’s value. For example, the magnitude of a single data correlation between two variables is insufficient to provide estimates of a reciprocal pair of modeled effects between those variables. The correlation might be accounted for by one of the reciprocal effects being stronger than the other effect, or the other effect being stronger than the one, or by effects of equal magnitude. Underidentified effect estimates can be rendered identified by introducing additional model and/or data constraints. For example, reciprocal effects can be rendered identified by constraining one effect estimate to be double, triple, or equivalent to, the other effect estimate,<ref name="Hayduk96" /> but the resultant estimates will only be trustworthy if the additional model constraint corresponds to the world’s structure. Data on a third variable that directly causes only one of a pair of reciprocally causally connected variables can also assist identification.<ref name="Rigdon95" /> Constraining a third variable to not directly cause one of the reciprocally-causal variables breaks the symmetry otherwise plaguing the reciprocal effect estimates because that third variable must be more strongly correlated with the variable it causes directly than with the variable at the “other” end of the reciprocal which it impacts only indirectly.<ref name="Rigdon95"/> Notice that this again presumes the properness of the model’s causal specification – namely that there really is a direct effect leading from the third variable to the variable at this end of the reciprocal effects and no direct effect on the variable at the “other end" of the reciprocally connected pair of variables. Theoretical demands for null/zero effects provide helpful constraints assisting estimation, though theories often fail to clearly report which effects are allegedly nonexistent.


=== Evaluation of models via model fit === === Model Assessment ===
{{More citations needed section|date=February 2019}}


Model assessment depends on the theory, the data, the model, and the estimation strategy. Hence model assessments consider:
An important task is to examine how well an estimated model “fits” the available data. The output from SEM programs includes a matrix reporting the relationships between the observed variables that would be observed if the estimated effects in the model actually controlled the observed variables’ values. The “fit” of a model reports match or mismatch between the model-implied relationships (often covariances) and the observed relationships among the variables. Large and significant differences between the data and the model’s implications signal problems. The probability accompanying a χ2 (chi-square) test is the probability that the data could arise by random sampling variations if the estimated model constituted the real underlying population forces. A small χ2 probability reports it would be unlikely for the current data to have arisen if the model structure constituted the real population causal forces – with the remaining differences attributed to random sampling variations.
* '''whether the data contain reasonable measurements of appropriate variables''',
* '''whether the modeled case are causally homogeneous''', (It makes no sense to estimate one model if the data cases reflect two or more different causal networks.)
Numerous fit indices quantify how closely a model fits the data but the SEM literature is divided on how to respond to varying amounts of ill fit.<ref name="Barrett07"> Barrett, P. (2007). “Structural equation modeling: Adjudging model fit.” Personality and Individual Differences. 42 (5): 815-824. </ref> One complication is that the size or amount of ill fit is not assuredly coordinated with the severity or nature of the issues producing the data inconsistency.<ref name="Hayduk14a">Hayduk, L.A. (2014a) “Seeing perfectly-fitting factor models that are causally misspecified: Understanding that close-fitting models can be worse.” Educational and Psychological Measurement. 74 (6): 905-926. doi: 10.1177/0013164414527449 </ref> Models with different causal structures which fit the data identically well have been called equivalent models.<ref name="Kline16">Kline, Rex. (2016) Principles and Practice of Structural Equation Modeling (4th ed). New York, Guilford Press. ISBN 978-1-4625-2334-4</ref> Such models are data-fit-equivalent though not causally equivalent, so at least one of the so-called equivalent models must be inconsistent with the world’s structure. If there is a perfect 1.0 correlation between variables X and Y and the model claims X causes Y, there will be perfect fit between the data and the model’s implication. But that model might not match the world because Y might cause X, or both X and Y might be responding to common-cause Z, or the world might contain a mixture of these effects (such as a common cause plus an effect of Y on X). The perfect fit of the X causes Y model does not guarantee the model’s structure corresponds to the world’s structure – maybe it does, maybe it doesn’t. And getting closer to perfect fit similarly does not guarantee the model is getting closer to matching the world’s structure maybe it is, maybe it isn’t.
* w'''hether the model appropriately represents the theory or features of interest''', (Models are unpersuasive if they omit features required by a theory, or contain coefficients inconsistent with that theory.)
* '''whether the estimates are statistically justifiable''', (Substantive assessments may be devastated: by violating assumptions, by using an inappropriate estimator, and/or by encountering non-convergence of iterative estimators.)
* '''the substantive reasonableness of the estimates''', (Negative variances, and correlations exceeding 1.0 or -1.0, are impossible. Statistically possible estimates that are inconsistent with theory may also challenge theory, and our understanding.)
* '''the remaining consistency, or inconsistency, between the model and data'''. (The estimation process minimizes the differences between the model and data but important and informative differences may remain.)


Research claiming to test or “investigate” a theory requires attending to beyond-chance model-data inconsistency. Estimation adjusts the model’s free coefficients to provide the best possible fit to the data. The output from SEM programs includes a matrix reporting the relationships among the observed variables that would be observed if the estimated model effects actually controlled the observed variables’ values. The “fit” of a model reports match or mismatch between the model-implied relationships (often covariances) and the corresponding observed relationships among the variables. Large and significant differences between the data and the model’s implications signal problems. The probability accompanying a χ2 (chi-square) test is the probability that the data could arise by random sampling variations if the estimated model constituted the real underlying population forces. A small χ2 probability reports it would be unlikely for the current data to have arisen if the modeled structure constituted the real population causal forces – with the remaining differences attributed to random sampling variations.
This logical difficulty is especially pronounced whenever a structural equation model is significantly inconsistent with the data,<ref name="Barrett07"/> because there is no general justification for why a researcher should “accept” a causally wrong model, rather than correct detected misspecifications. Several forces continue to propagate reporting “close” fit. Dag Sorbom reported that when someone asked Karl Joreskog, the developer of the first structural equation modeling program: “Why have you then added GFI?” to your LISREL program, Joreskog replied “Well, users threaten us saying they would stop using LISREL if it always produces such large chi-squares. So we had to invent something to make people happy. GFI serves that purpose.”
<ref name="CDS01"> Cudeck, R; du Toit R.; Sorbom, D. (editors) (2001) Structural Equation Modeling: Present and Future: Festschrift in Honor of Karl Joreskog. Scientific Software International: Lincolnwood, IL.</ref> The χ2 evidence of model-data inconsistency was too statistically solid to be disregarded, but the GFI and other fit indices could still distract from unwelcome evidence of model-data inconsistency.


If a model remains inconsistent with the data despite selecting optimal coefficient estimates, an honest research response reports and attends to this evidence (often a significant model χ2 test).<ref name="Hayduk14b"/> Beyond-chance model-data inconsistency challenges both the coefficient estimates and the model’s capacity for adjudicating the model’s structure, irrespective of whether the inconsistency originates in problematic data, inappropriate statistical estimation, or incorrect model specification.
Replication is unlikely to detect misspecified models which inappropriately-fit the data. If the replicate data is within random variations of the original data, the same incorrect coefficient placements that provided inappropriate-fit to the original data will likely also inappropriately-fit the replicate data. Replication helps detect issues such as data mistakes, but is especially weak at detecting misspecifications after exploratory model modification – as when confirmatory factor analysis (CFA) is applied to a random second-half of data following exploratory factor analysis (EFA) of first-half data.
Coefficient estimates in data-inconsistent (“failing”) models are interpretable, as reports of how the world would appear to someone believing a model that conflicts with the available data. The estimates in data-inconsistent models do not necessarily become “obviously wrong” by becoming statistically strange, or wrongly signed according to theory. The estimates may even closely match a theory’s requirements but the remaining data inconsistency renders the match between the estimates and theory unable to provide succor. Failing models remain interpretable, but only as interpretations that conflict with available evidence.


A modification index is an estimate of how much a model’s fit to the data would “improve” (but not necessarily how much the model’s structure would improve) if a specific currently-fixed model coefficient were freed for estimation. Researchers confronting data-inconsistent models can easily free coefficients the modification indices report as likely to produce substantial improvements in fit. This simultaneously introduces a substantial risk of moving from a causally-wrong-and-failing model to a causally-wrong-but-fitting model because improved data-fit does not provide assurance that the freed coefficients are substantively reasonable or world matching. The original model may contain causal misspecifications such as incorrectly directed effects, or incorrect assumptions about unavailable variables, and such problems cannot be corrected by adding coefficients to the current model. Consequently, such models remain misspecified despite the closer fit provided by additional coefficients. Fitting yet worldly-inconsistent models are especially likely to arise if a researcher committed to a particular model (for example a factor model having a desired number of factors) gets an initially-failing model to fit by inserting measurement error covariances “suggested” by modification indices.
A cautionary instance was provided by Browne, MacCallum, Kim, Anderson, and Glaser (2002) who addressed the mathematics behind why the χ2 test can have (though it does not always have) considerable power to detect model misspecification.<ref name="BMKAG02"> Browne, M.W.; MacCallum, R.C.; Kim, C.T.; Andersen, B.L.; Glaser, R. (2002) “When fit indices and residuals are incompatible.” Psychological Methods. 7: 403-421.</ref> They presented a factor model as acceptable despite that model being significantly inconsistent with their data. Incorporating an overlooked experimental feature provided a model fitting the same data and contradicting the original model.<ref name="HP-RCLB05">Hayduk, L. A.; Pazderka-Robinson, H.; Cummings, G.G.; Levers, M-J. D.; Beres, M. A. (2005) “Structural equation model testing and the quality of natural killer cell activity measurements.” BMC Medical Research Methodology. 5 (1): 1-9. doi: 10.1186/1471-2288-5-1. Note the correction of .922 to .992, and the correction of .944 to .994 in the Hayduk, et al. Table 1.</ref> The fault was not in the math of the indices or in the over-sensitivity of χ2 testing. The fault was in forgetting, neglecting, or overlooking, that the amount of ill fit cannot be trusted to correspond to the nature or seriousness of problems in a model’s specification.<ref name="Hayduk14a"/>

Reporting fit-index values to distract from evidence of model-data inconsistency introduces discipline-wide costs. The discipline pays the opportunity cost of not having pursued a structurally improved understanding of the discipline’s data.
“Accepting” failing models as “close enough” is also not a reasonable alternative. A cautionary instance was provided by Browne, MacCallum, Kim, Anderson, and Glaser who addressed the mathematics behind why the χ2 test can have (though it does not always have) considerable power to detect model misspecification.<ref name="BMKAG02"> Browne, M.W.; MacCallum, R.C.; Kim, C.T.; Andersen, B.L.; Glaser, R. (2002) “When fit indices and residuals are incompatible.” Psychological Methods. 7: 403-421.</ref> The probability accompanying a χ2 test is the probability that the data could arise by random sampling variations if the current model, with its optimal estimates, constituted the real underlying population forces. A small χ2 probability reports it would be unlikely for the current data to have arisen if the current model structure constituted the real population causal forces – with the remaining differences attributed to random sampling variations. Browne, McCallum, Kim, Andersen, and Glaser presented a factor model they viewed as acceptable despite the model being significantly inconsistent with their data according to χ2. The fallaciousness of their claim that close-fit should be treated as good enough was demonstrated by Hayduk, Pazkerka-Robinson, Cummings, Levers and Beres<ref name="HP-RCLB05">Hayduk, L. A.; Pazderka-Robinson, H.; Cummings, G.G.; Levers, M-J. D.; Beres, M. A. (2005) “Structural equation model testing and the quality of natural killer cell activity measurements.” BMC Medical Research Methodology. 5 (1): 1-9. doi: 10.1186/1471-2288-5-1. Note the correction of .922 to .992, and the correction of .944 to .994 in the Hayduk, et al. Table 1. </ref> who demonstrated a fitting model for Browne, et al.’s own data by incorporating an experimental feature Browne, et al. overlooked. The fault was not in the math of the indices or in the over-sensitivity of χ2 testing. The fault was in Browne, MacCallum, and the other authors forgetting, neglecting, or overlooking, that the amount of ill fit cannot be trusted to correspond to the nature, location, or seriousness of problems in a model’s specification.<ref name="Hayduk14a"/>

Many researchers tried to justify switching to fit-indices, rather than testing their models, by claiming that χ2 increases (and hence χ2’s probability decreases) with increasing sample size (N). There are two mistakes in discounting χ2 on this basis. First, for proper models, χ2 does not increase with increasing N,<ref name="Hayduk14b"/> so if χ2 increases with N that itself is a sign that something is detectably problematic. And second, for models that are detectably misspecified, χ2’s increase with N provides the good-news of increasing statistical power to detect model misspecification (namely power to detect Type II error). Some kinds of important misspecifications cannot be detected by χ2,<ref name="Hayduk14a"/> so any amount of ill fit beyond what might be reasonably produced by random variations warrants report and consideration.<ref name="Barrett07"/><ref name="Hayduk14b"/> The χ2 model test, possibly adjusted,<ref name="SB94"/> is the strongest available structural equation model test.

Numerous fit indices quantify how closely a model fits the data but all fit indices suffer from the logical difficulty that the size or amount of ill fit is not trustably coordinated with the severity or nature of the issues producing the data inconsistency.<ref name="Hayduk14a"/> Models with different causal structures which fit the data identically well, have been called equivalent models.<ref name="Kline16"/> Such models are data-fit-equivalent though not causally equivalent, so at least one of the so-called equivalent models must be inconsistent with the world’s structure. If there is a perfect 1.0 correlation between X and Y and we model this as X causes Y, there will be perfect fit and zero residual error. But the model may not match the world because Y may actually cause X, or both X and Y may be responding to a common cause Z, or the world may contain a mixture of these effects (e.g. like a common cause plus an effect of Y on X), or other causal structures. The perfect fit does not tell us the model’s structure corresponds to the world’s structure, and this in turn implies that getting closer to perfect fit does not necessarily correspond to getting closer to the world’s structure – maybe it does, maybe it doesn’t. This makes it incorrect for a researcher to claim that even perfect model fit implies the model is correctly causally specified. For even moderately complex models, precisely equivalently-fitting models are rare. Models almost-fitting the data, according to any index, unavoidably introduce additional potentially-important yet unknown model misspecifications. These models constitute a greater research impediment.

This logical weakness renders all fit indices “unhelpful” whenever a structural equation model is significantly inconsistent with the data,<ref name="Barrett07"/> but several forces continue to propagate fit-index use. For example, Dag Sorbom reported that when someone asked Karl Joreskog, the developer of the first structural equation modeling program, “Why have you then added GFI?” to your LISREL program, Joreskog replied “Well, users threaten us saying they would stop using LISREL if it always produces such large chi-squares. So we had to invent something to make people happy. GFI serves that purpose.”<ref name=“Sorbom2001”/> The χ2 evidence of model-data inconsistency was too statistically solid to be dislodged or discarded, but people could at least be provided a way to distract from the “disturbing” evidence. Career-profits can still be accrued by developing additional indices, reporting investigations of index behavior, and publishing models intentionally burying evidence of model-data inconsistency under an MDI (a mound of distracting indices). There seems no general justification for why a researcher should “accept” a causally wrong model, rather than attempting to correct detected misspecifications. And some portions of the literature seems not to have noticed that “accepting a model” (on the basis of “satisfying” an index value) suffers from an intensified version of the criticism applied to “acceptance” of a null-hypothesis. Introductory statistics texts usually recommend replacing the term “accept” with “failed to reject the null hypothesis” to acknowledge the possibility of Type II error. A Type III error arises from “accepting” a model hypothesis when the current data are sufficient to reject the model.

The considertions relevant to using fit indices include checking:
#
# whether data concerns have been addressed (to ensure data mistakes are not driving model-data inconsistency);
#
# whether criterion values for the index have been investigated for models structured like the researcher’s model (e.g. index criterion based on factor structured models are only appropriate if the researcher’s model actually is factor structured);
#
# whether the kinds of potential misspecifications in the current model correspond to the kinds of misspecifications on which the index criterion are based (e.g. criteria based on simulation of omitted factor loadings may not be appropriate for misspecification resulting from failure to include appropriate control variables);
#
# whether the researcher knowingly agrees to disregard evidence pointing to the kinds of misspecifications on which the index criteria were based. (If the index criterion is based on simulating a missing factor loading or two, using that criterion acknowledges the researcher’s willingness to accept a model missing a factor loading or two.);
#
# whether the latest, not outdated, index criteria are being used (because the criteria for some indices tightened over time);
#
# whether satisfying criterion values on pairs of indices are required (e.g. Hu and Bentler<ref name="HB99">Hu, L.; Bentler,P.M. (1999) “Cutoff criteria for fit indices in covariance structure analysis: Conventional criteria versus new alternatives.” Structural Equation Modeling. 6: 1-55. </ref> report that some common indices function inappropriately unless they are assessed together.);
# whether a model test is, or is not, available. (A χ2 value, degrees of freedom, and probability will be available for models reporting indices based on χ2.)
#
# and whether the researcher has considered both alpha (Type I) and beta (Type II) errors in making their index-based decisions (E.g. if the model is significantly data-inconsistent, the “tolerable” amount of inconsistency is likely to differ in the context of medical, business, social and psychological contexts.).


The considerations relevant to assessing fit include checking:
1) whether data concerns have been addressed (To ensure data mistakes are not driving model-data inconsistency.);
2) whether criterion values for the index have been investigated for models structured like the researcher’s model (Index criterion based on factor structured models are only appropriate if the researcher’s model actually is factor structured.);
3) whether the kinds of potential misspecifications in the current model correspond to the kinds of misspecifications on which the index criterion are based (Criteria based on simulation of omitted factor loadings may not be appropriate for misspecifications resulting from failure to include appropriate control variables.);
4) whether the researcher knowingly agrees to disregard evidence pointing to the kinds of misspecifications on which the index criteria were based. (If the index criterion is based on missing a factor loading or two, using that criterion acknowledges the researcher’s willingness to accept a model missing a factor loading or two.);
5) whether the latest, not outdated, index criteria are being used (The criteria for the indices are subject to debate.{{sfn|MacCallum|Austin|2000|p=218-219}} and have tightened over time.<ref name="HB99"/>);
6) whether satisfying criterion values on pairs of indices are required (Hu and Bentler<ref name="HB99">Hu, L.; Bentler,P.M. (1999) “Cutoff criteria for fit indices in covariance structure analysis: Conventional criteria versus new alternatives.” Structural Equation Modeling. 6: 1-55. </ref> report that some common indices function inappropriately unless they are assessed together.);
7) whether a model test is, or is not, available. (A χ2 value, degrees of freedom, and probability will be available for models reporting indices based on χ2.)
and 8), whether the researcher has considered both alpha (Type I) and beta (Type II) errors in making their index-based decisions (The “tolerable” amount of data-inconsistency is likely to be lower in medical contexts.).


Some of the more commonly used fit statistics include Some of the more commonly used fit statistics include
Line 99: Line 117:
** A fundamental test of fit used in the calculation of many other fit measures. It is a function of the discrepancy between the observed covariance matrix and the model-implied covariance matrix. Chi-square increases with sample size only if the model is detectably misspecified.<ref name="Hayduk14b"> Hayduk, L.A. (2014b) “Shame for disrespecting evidence: The personal consequences of insufficient respect for structural equation model testing. BMC: Medical Research Methodology, 14 (124): 1-10 DOI 10.1186/1471-2288-14-24 http://www.biomedcentral.com/1471-2288/14/124 </ref> ** A fundamental test of fit used in the calculation of many other fit measures. It is a function of the discrepancy between the observed covariance matrix and the model-implied covariance matrix. Chi-square increases with sample size only if the model is detectably misspecified.<ref name="Hayduk14b"> Hayduk, L.A. (2014b) “Shame for disrespecting evidence: The personal consequences of insufficient respect for structural equation model testing. BMC: Medical Research Methodology, 14 (124): 1-10 DOI 10.1186/1471-2288-14-24 http://www.biomedcentral.com/1471-2288/14/124 </ref>
* ] (AIC) * ] (AIC)
** A test of relative model fit: The preferred model is the one with the lowest AIC value. ** An index of relative model fit: The preferred model is the one with the lowest AIC value.
** <math>\mathit{AIC} = 2k - 2\ln(L)\,</math> ** <math>\mathit{AIC} = 2k - 2\ln(L)\,</math>
** where ''k'' is the number of ]s in the ], and ''L'' is the maximized value of the ] of the model. ** where ''k'' is the number of ]s in the ], and ''L'' is the maximized value of the ] of the model.
* ] (RMSEA) * ] (RMSEA)
**Fit index where a value of zero indicates the best fit.{{sfn|Kline|2011|p=205}} While the guideline for determining a "close fit" using RMSEA is highly contested,{{sfn|Kline|2011|p=206}} most researchers concur that an RMSEA of .1 or more indicates poor fit.{{sfn|Hu|Bentler|1999|p=11}}<ref name="Browne1993" /> **Fit index where a value of zero indicates the best fit.{{sfn|Kline|2011|p=205}} Guidelines for determining a "close fit" using RMSEA are highly contested.{{sfn|Kline|2011|p=206}}
* ] (SRMR) * ] (SRMR)
** The SRMR is a popular absolute fit indicator. Hu and Bentler (1999) suggested .08 or smaller as a guideline for good fit.{{sfn|Hu|Bentler|1999|p=27}} Kline (2011) suggested .1 or smaller as a guideline for good fit. ** The SRMR is a popular absolute fit indicator. Hu and Bentler (1999) suggested .08 or smaller as a guideline for good fit.{{sfn|Hu|Bentler|1999|p=27}}
* ] (CFI) * ] (CFI)
**In examining baseline comparisons, the CFI depends in large part on the average size of the correlations in the data. If the average correlation between variables is not high, then the CFI will not be very high. A CFI value of .95 or higher is desirable.{{sfn|Hu|Bentler|1999|p=27}} **In examining baseline comparisons, the CFI depends in large part on the average size of the correlations in the data. If the average correlation between variables is not high, then the CFI will not be very high. A CFI value of .95 or higher is desirable.{{sfn|Hu|Bentler|1999|p=27}}


For each measure of fit, a decision as to what represents a good-enough fit between the model and the data reflects the researcher's modeling objective (perhaps challenging some else's model, or improving measurement); whether or not the model is to be claimed as having been tested; and whether the researcher is comfortable "disregarding" evidence of the index-documented degree of ill fit.<ref name="Hayduk14b" /> The following table provides references documenting these, and other, features for some common indices: the RMSEA (Root Mean Square Error of Approximation), SRMR (Standardized Root Mean Squared Residual), CFI (Confirmatory Fit Index), and the TLI (the Tucker-Lewis Index). Additional indices such as the AIC (Akaike Information Criterion) can be found in most SEM introductions.<ref name="Kline16"/> For each measure of fit, a decision as to what represents a good-enough fit between the model and the data reflects the researcher's modeling objective (perhaps challenging someone else's model, or improving measurement); whether or not the model is to be claimed as having been "tested"; and whether the researcher is comfortable "disregarding" evidence of the index-documented degree of ill fit.<ref name="Hayduk14b" />
{| class="wikitable" {| class="wikitable"
|+ '''Features of Fit Indices''' |+ '''Features of Fit Indices'''
Line 170: Line 188:
|<ref name="Barrett07"/> |<ref name="Barrett07"/>
|} |}
<ref name="Barrett07"> Barrett, P. (2007). “Structural equation modeling: Adjudging model fit.” Personality and Individual Differences. 42 (5): 815-824. </ref>
<ref name="BC92"> Browne, M.W.; Cudeck, R. (1992) “Alternate ways of assessing model fit.” Sociological Methods and Research. 21(2): 230-258.</ref> <ref name="BC92"> Browne, M.W.; Cudeck, R. (1992) “Alternate ways of assessing model fit.” Sociological Methods and Research. 21(2): 230-258.</ref>
<ref name="S90">Steiger, J. H. (1990) “Structural Model Evaluation and Modification: An Interval Estimation Approach”. Multivariate Behavioral Research 25:173-180.</ref> <ref name="S90">Steiger, J. H. (1990) “Structural Model Evaluation and Modification: An Interval Estimation Approach”. Multivariate Behavioral Research 25:173-180.</ref>

Revision as of 22:33, 3 July 2023

Form of causal modeling that fit networks of constructs to data For the journal, see Structural Equation Modeling (journal).
An example structural equation model
Figure 1. An example structural equation model after estimation. Latent variables are sometimes indicated with ovals while observed variables are shown in rectangles. Residuals and variances are sometimes drawn as double-headed arrows (shown here) or single arrows and a circle (as in Figure 2). The latent IQ variance is fixed at 1 to provide scale to the model. Figure 1 depicts measurement errors influencing each indicator of latent intelligence and each indicator of latent achievement. Neither the indicators nor the measurement errors of the indicators are modeled as influencing the latent variables.
An example structural equation model pre-estimation
Figure 2. An example structural equation model before estimation. Similar to Figure 1 but without standardized values and fewer items. Because intelligence and academic performance are merely imagined or theory-postulated variables, their precise scale values are unknown, though the model specifies that each latent variable’s values must fall somewhere along the observable scale possessed by one of the indicators. The 1.0 effect connecting a latent to an indicator specifies that each real unit increase or decrease in the latent variable’s value results in a corresponding unit increase or decrease in the indicator’s value. It is hoped a good indicator has been chosen for each latent, but the 1.0 values do not signal perfect measurement because this model also postulates that there are other unspecified entities causally impacting the observed indicator measurements, thereby introducing measurement error. This model postulates that separate measurement errors influence each of the two indicators of latent intelligence, and each indicator of latent achievement. The unlabeled arrow pointing to academic performance acknowledges that things other than intelligence can also influence academic performance.

Structural equation modeling (SEM) is a label for a diverse set of methods used by scientists doing both observational and experimental research. SEM is used mostly in the social and behavioral sciences but it is also used in epidemiology, business, and other fields. A definition of SEM is difficult without reference to technical language, but a good starting place is the name itself.

SEM involves a model representing how various aspects of some phenomenon are thought to causally connect to one another. Structural equation models often contain postulated causal connections among some latent variables (variables thought to exist but which can’t be directly observed). Additional causal connections link those latent variables to observed variables whose values appear in a data set. The causal connections are represented using equations but the postulated structuring can also be presented using diagrams containing arrows as in Figures 1 and 2. The causal structures imply that specific patterns should appear among the values of the observed variables. This makes it possible to use the connections between the observed variables’ values to estimate the magnitudes of the postulated effects, and to test whether or not the observed data are consistent with the requirements of the hypothesized causal structures.

The boundary between what is and is not a structural equation model is not always clear but SE models often contain postulated causal connections among a set of latent variables (variables thought to exist but which can’t be directly observed, like an attitude, intelligence or mental illness) and causal connections linking the postulated latent variables to variables that can be observed and whose values are available in some data set. Variations among the styles of latent causal connections, variations among the observed variables measuring the latent variables, and variations in the statistical estimation strategies result in the SEM toolkit including confirmatory factor analysis, confirmatory composite analysis, path analysis, multi-group modeling, longitudinal modeling, partial least squares path modeling, latent growth modeling and hierarchical or multilevel modeling.

SEM researchers use computer programs to estimate the strength and sign of the coefficients corresponding to the modeled structural connections, for example the numbers connected to the arrows in Figure 1. Because a postulated model such as Figure 1 may not correspond to the worldly forces controlling the observed data measurements, the programs also provide model tests and diagnostic clues suggesting which indicators, or which model components, might introduce inconsistency between the model and observed data. Criticisms of SEM methods hint at: disregard of available model tests, problems in the model’s specification, a tendency to accept models without considering external validity, and potential philosophical biases.

A great advantage of SEM is that all of these measurements and tests occur simultaneously in one statistical estimation procedure, where all the model coefficient are calculated using all information from the observed variables. This means the estimates are more accurate than if a researcher were to calculate each part of the model separately.

History

Structural equation modeling (SEM) began differentiating itself from correlation and regression when Sewall Wright provided explicit causal interpretations for a set of regression-style equations based on a solid understanding of the physical and physiological mechanisms producing direct and indirect effects among his observed variables. The equations were estimated like ordinary regression equations but the substantive context for the measured variables permitted clear causal, not merely predictive, understandings. O. D. Duncan introduced SEM to the social sciences in his 1975 book and SEM blossomed in the late 1970’s and 1980’s when increasing computing power permitted practical model estimation. In 1987 Hayduk provided the first book-length introduction to structural equation modeling with latent variables, and this was soon followed by Bollen’s popular text (1989).

Different yet mathematically related modeling approaches developed in psychology, sociology, and economics. Early Cowles Commission work on simultaneous equations estimation centered on Koopman and Hood's (1953) algorithms from transport economics and optimal routing, with maximum likelihood estimation, and closed form algebraic calculations, as iterative solution search techniques were limited in the days before computers. The convergence of two of these developmental streams (factor analysis from psychology, and path analysis from sociology via Duncan) produced the current core of SEM. One of several programs Karl Jöreskog developed at Educational Testing Services, LISREL embedded latent variables (which psychologists knew as the latent factors from factor analysis) within path-analysis-style equations (which sociologists inherited from Wright and Duncan). The factor-structured portion of the model incorporated measurement errors which permitted measurement-error-adjustment, though not necessarily error-free estimation, of effects connecting different postulated latent variables.

Traces of the historical convergence of the factor analytic and path analytic traditions persist as the distinction between the measurement and structural portions of models; and as continuing disagreements over model testing, and whether measurement should precede or accompany structural estimates. Viewing factor analysis as a data-reduction technique deemphasizes testing, which contrasts with path analytic appreciation for testing postulated causal connections – where the test result might signal model misspecification. The friction between factor analytic and path analytic traditions continue to surface in the literature and on SEMNET – a free listserve circulating SEM postings to more than 3,000 registrants, supported by a University of Alabama listserv. The SEMNET archive can be searched for user-specified topics, and many of the 2023 updates to this Misplaced Pages page were previously circulated on SEMNET for knowledgeable review and comment.

Wright's path analysis influenced Hermann Wold, Wold’s student Karl Jöreskog, and Jöreskog’s student Claes Fornell, but SEM never gained a large following among U.S. econometricians, possibly due to fundamental differences in modeling objectives and typical data structures. The prolonged separation of SEM’s economic branch led to procedural and terminological differences, though deep mathematical and statistical connections remain. The economic version of SEM can be seen in SEMNET discussions of endogeneity, and in the heat produced as Judea Pearl’s approach to causality via directed acyclic graphs (DAG’s) rubs against economic approaches to modeling. Discussions comparing and contrasting various SEM approaches are available but disciplinary differences in data structures and the concerns motivating economic models make reunion unlikely. Pearl extended SEM from linear to nonparametric models, and proposed causal and counterfactual interpretations of the equations. Nonparametric SEMs permit estimating total, direct and indirect effects without making any commitment to linearity of effects or assumptions about the distributions of the error terms.

SEM analyses are popular in the social sciences because computer programs make it possible to estimate complicated causal structures, but the complexity of the models introduces substantial variability in the quality of the results. Some, but not all, results are obtained without the "inconvenience" of understanding experimental design, statistical control, the consequences of sample size, and other features contributing to good research design.


General SEM Steps and Considerations

The following considerations apply to the construction and assessment of many structural equation models.

Model Specification

Building or specifying a model requires attending to:

  • the set of variables to be employed,
  • what is known about the variables,
  • what is presumed or hypothesized about the variables’ causal connections and disconnections,
  • what the researcher seeks to learn from the modeling,
  • and the cases for which values of the variables will be available (kids? workers? companies? countries? cells? accidents? cults?).

Structural equation models attempt to mirror the worldly forces operative for causally homogeneous cases – namely cases enmeshed in the same worldly causal structures but whose values on the causes differ and who therefore possess different values on the outcome variables. Causal homogeneity can be facilitated by case selection, or by segregating cases in a multi-group model. A model’s specification is not complete until the researcher specifies:

  • which effects and/or correlations/covariances are to be included and estimated,
  • which effects and other coefficients are forbidden or presumed unnecessary,
  • and which coefficients will be given fixed/unchanging values (e.g. to provide measurement scales for latent variables as in Figure 2).

The latent level of a model is composed of endogenous and exogenous variables. The endogenous latent variables are the true-score variables postulated as receiving effects from at least one other modeled variable. Each endogenous variable is modeled as the dependent variable in a regression-style equation. The exogenous latent variables are background variables postulated as causing one or more of the endogenous variables and are modeled like the predictor variables in regression-style equations. Causal connections among the exogenous variables are not explicitly modeled but are usually acknowledged by modeling the exogenous variables as freely correlating with one another. The model may include intervening variables – variables receiving effects from some variables but also sending effects to other variables. As in regression, each endogenous variable is assigned a residual or error variable encapsulating the effects of unavailable and usually unknown causes. Each latent variable, whether exogenous or endogenous, is thought of as containing the cases’ true-scores on that variable, and these true-scores causally contribute valid/genuine variations into one or more of the observed/reported indicator variables.

The LISREL program assigned Greek names to the elements in a set of matrices to keep track of the various model components. These names became relatively standard notation, though the notation has been extended and altered to accommodate a variety of statistical considerations. Texts and programs “simplifying” model specification via diagrams or by using equations permitting user-selected variable names, re-convert the user’s model into some standard matrix-algebra form in the background. The “simplifications” are achieved by implicitly introducing default program “assumptions” about model features with which users supposedly need not concern themselves. Unfortunately, these default assumptions easily obscure model components that leave unrecognized issues lurking within the model’s structure, and underlying matrices.

Two main components of models are distinguished in SEM: the structural model showing potential causal dependencies between endogenous and exogenous latent variables, and the measurement model showing the causal connections between the latent variables and the indicators. Exploratory and confirmatory factor analysis models, for example, focus on the causal measurement connections, while path models more closely correspond to SEMs latent structural connections.

Modelers specify each coefficient in a model as being free to be estimated, or fixed at some value. The free coefficeints may be postulated effects the researcher wishes to test, background correlations among the exogenous variables, or the variances of the residual or error variables providing additional variations in the endogenous latent variables. The fixed coefficients may be values like the 1.0 values in Figure 2 that provide a scales for the latent variables, or values of 0.0 which assert causal disconnections such as the assertion of no-direct-effects (no arrows) pointing from Academic Achievement to any of the four scales in Figure 1. SEM programs provide estimates and tests of the free coefficients, while the fixed coefficients contribute importantly to testing the overall model structure. Various kinds of constraints between coefficients can also be used. The model specification depends on what is known from the literature, the researcher's experience with the modeled indicator variables, and the features being investigated by using the speific model structure.

There is a limit to how many coefficients can be estimated in a model. If there are fewer data points than the number of estimated coefficients, the resulting model is said to be "unidentified" and no coefficient estimates can be obtained. Reciprocal effect, and other causal loops, may also interfere with estimation.

Estimation of Free Model Coefficients

Model coefficients fixed at zero, 1.0, or other values, do not require estimation because they already have specified values. Estimated values for free model coefficients are obtained by maximizing fit to, or minimizing difference from, the data relative to what the data’s features would be if the free model coefficients took on the estimated values. The model’s implications for what the data should look like for a specific set of coefficient values depends on: a) the coefficients’ locations in the model (e.g. which variables are connected/disconnected), b) the nature of the connections between the variables (covariances or effects; with effects often assumed to be linear), c) the nature of the error or residual variables (often assumed to be independent of, or causally-disconnected from, many variables), and d) the measurement scales appropriate for the variables (interval level measurement is often assumed).

A stronger effect connecting two latent variables implies the indicators of those latents should be more strongly correlated. Hence, a reasonable estimate of a latent’s effect will be whatever value best matches the correlations between the indicators of the corresponding latent variables – namely the estimate-value maximizing the match with the data, or minimizing the differences from the data. With maximum likelihood estimation, the numerical values of all the free model coefficients are individually adjusted (progressively increased or decreased from initial start values) until they maximize the likelihood of observing the sample data – whether the data are the variables’ covariances/correlations, or the cases’ actual values on the indicator variables. Ordinary least squares estimates are the coefficient values that minimize the squared differences between the data and what the data would look like if the model was correctly specified, namely if all the model’s estimated features correspond to real worldly features.

The appropriate statistical feature to maximize or minimize to obtain estimates depends on the variables’ levels of measurement (estimation is generally easier with interval level measurements than with nominal or ordinal measures), and where a specific variable appears in the model (e.g. endogenous dichotomous variables create more estimation difficulties than exogenous dichotomous variables). Most SEM programs provide several options for what is to be maximized or minimized to obtain estimates the model’s coefficients. The choices often include maximum likelihood estimation (MLE), full information maximum likelihood (FIML), ordinary least squares (OLS), weighted least squares (WLS), diagonally weighted least squares (DWLS), and two stage least squares.

One common problem is that a coefficient’s estimated value may be underidentified because it is insufficiently constrained by the model and data. No unique best-estimate exists unless the model and data together sufficiently constrain or restrict a coefficient’s value. For example, the magnitude of a single data correlation between two variables is insufficient to provide estimates of a reciprocal pair of modeled effects between those variables. The correlation might be accounted for by one of the reciprocal effects being stronger than the other effect, or the other effect being stronger than the one, or by effects of equal magnitude. Underidentified effect estimates can be rendered identified by introducing additional model and/or data constraints. For example, reciprocal effects can be rendered identified by constraining one effect estimate to be double, triple, or equivalent to, the other effect estimate, but the resultant estimates will only be trustworthy if the additional model constraint corresponds to the world’s structure. Data on a third variable that directly causes only one of a pair of reciprocally causally connected variables can also assist identification. Constraining a third variable to not directly cause one of the reciprocally-causal variables breaks the symmetry otherwise plaguing the reciprocal effect estimates because that third variable must be more strongly correlated with the variable it causes directly than with the variable at the “other” end of the reciprocal which it impacts only indirectly. Notice that this again presumes the properness of the model’s causal specification – namely that there really is a direct effect leading from the third variable to the variable at this end of the reciprocal effects and no direct effect on the variable at the “other end" of the reciprocally connected pair of variables. Theoretical demands for null/zero effects provide helpful constraints assisting estimation, though theories often fail to clearly report which effects are allegedly nonexistent.

Model Assessment

Model assessment depends on the theory, the data, the model, and the estimation strategy. Hence model assessments consider:

  • whether the data contain reasonable measurements of appropriate variables,
  • whether the modeled case are causally homogeneous, (It makes no sense to estimate one model if the data cases reflect two or more different causal networks.)
  • whether the model appropriately represents the theory or features of interest, (Models are unpersuasive if they omit features required by a theory, or contain coefficients inconsistent with that theory.)
  • whether the estimates are statistically justifiable, (Substantive assessments may be devastated: by violating assumptions, by using an inappropriate estimator, and/or by encountering non-convergence of iterative estimators.)
  • the substantive reasonableness of the estimates, (Negative variances, and correlations exceeding 1.0 or -1.0, are impossible. Statistically possible estimates that are inconsistent with theory may also challenge theory, and our understanding.)
  • the remaining consistency, or inconsistency, between the model and data. (The estimation process minimizes the differences between the model and data but important and informative differences may remain.)

Research claiming to test or “investigate” a theory requires attending to beyond-chance model-data inconsistency. Estimation adjusts the model’s free coefficients to provide the best possible fit to the data. The output from SEM programs includes a matrix reporting the relationships among the observed variables that would be observed if the estimated model effects actually controlled the observed variables’ values. The “fit” of a model reports match or mismatch between the model-implied relationships (often covariances) and the corresponding observed relationships among the variables. Large and significant differences between the data and the model’s implications signal problems. The probability accompanying a χ2 (chi-square) test is the probability that the data could arise by random sampling variations if the estimated model constituted the real underlying population forces. A small χ2 probability reports it would be unlikely for the current data to have arisen if the modeled structure constituted the real population causal forces – with the remaining differences attributed to random sampling variations.

If a model remains inconsistent with the data despite selecting optimal coefficient estimates, an honest research response reports and attends to this evidence (often a significant model χ2 test). Beyond-chance model-data inconsistency challenges both the coefficient estimates and the model’s capacity for adjudicating the model’s structure, irrespective of whether the inconsistency originates in problematic data, inappropriate statistical estimation, or incorrect model specification. Coefficient estimates in data-inconsistent (“failing”) models are interpretable, as reports of how the world would appear to someone believing a model that conflicts with the available data. The estimates in data-inconsistent models do not necessarily become “obviously wrong” by becoming statistically strange, or wrongly signed according to theory. The estimates may even closely match a theory’s requirements but the remaining data inconsistency renders the match between the estimates and theory unable to provide succor. Failing models remain interpretable, but only as interpretations that conflict with available evidence.

A modification index is an estimate of how much a model’s fit to the data would “improve” (but not necessarily how much the model’s structure would improve) if a specific currently-fixed model coefficient were freed for estimation. Researchers confronting data-inconsistent models can easily free coefficients the modification indices report as likely to produce substantial improvements in fit. This simultaneously introduces a substantial risk of moving from a causally-wrong-and-failing model to a causally-wrong-but-fitting model because improved data-fit does not provide assurance that the freed coefficients are substantively reasonable or world matching. The original model may contain causal misspecifications such as incorrectly directed effects, or incorrect assumptions about unavailable variables, and such problems cannot be corrected by adding coefficients to the current model. Consequently, such models remain misspecified despite the closer fit provided by additional coefficients. Fitting yet worldly-inconsistent models are especially likely to arise if a researcher committed to a particular model (for example a factor model having a desired number of factors) gets an initially-failing model to fit by inserting measurement error covariances “suggested” by modification indices.

“Accepting” failing models as “close enough” is also not a reasonable alternative. A cautionary instance was provided by Browne, MacCallum, Kim, Anderson, and Glaser who addressed the mathematics behind why the χ2 test can have (though it does not always have) considerable power to detect model misspecification. The probability accompanying a χ2 test is the probability that the data could arise by random sampling variations if the current model, with its optimal estimates, constituted the real underlying population forces. A small χ2 probability reports it would be unlikely for the current data to have arisen if the current model structure constituted the real population causal forces – with the remaining differences attributed to random sampling variations. Browne, McCallum, Kim, Andersen, and Glaser presented a factor model they viewed as acceptable despite the model being significantly inconsistent with their data according to χ2. The fallaciousness of their claim that close-fit should be treated as good enough was demonstrated by Hayduk, Pazkerka-Robinson, Cummings, Levers and Beres who demonstrated a fitting model for Browne, et al.’s own data by incorporating an experimental feature Browne, et al. overlooked. The fault was not in the math of the indices or in the over-sensitivity of χ2 testing. The fault was in Browne, MacCallum, and the other authors forgetting, neglecting, or overlooking, that the amount of ill fit cannot be trusted to correspond to the nature, location, or seriousness of problems in a model’s specification.

Many researchers tried to justify switching to fit-indices, rather than testing their models, by claiming that χ2 increases (and hence χ2’s probability decreases) with increasing sample size (N). There are two mistakes in discounting χ2 on this basis. First, for proper models, χ2 does not increase with increasing N, so if χ2 increases with N that itself is a sign that something is detectably problematic. And second, for models that are detectably misspecified, χ2’s increase with N provides the good-news of increasing statistical power to detect model misspecification (namely power to detect Type II error). Some kinds of important misspecifications cannot be detected by χ2, so any amount of ill fit beyond what might be reasonably produced by random variations warrants report and consideration. The χ2 model test, possibly adjusted, is the strongest available structural equation model test.

Numerous fit indices quantify how closely a model fits the data but all fit indices suffer from the logical difficulty that the size or amount of ill fit is not trustably coordinated with the severity or nature of the issues producing the data inconsistency. Models with different causal structures which fit the data identically well, have been called equivalent models. Such models are data-fit-equivalent though not causally equivalent, so at least one of the so-called equivalent models must be inconsistent with the world’s structure. If there is a perfect 1.0 correlation between X and Y and we model this as X causes Y, there will be perfect fit and zero residual error. But the model may not match the world because Y may actually cause X, or both X and Y may be responding to a common cause Z, or the world may contain a mixture of these effects (e.g. like a common cause plus an effect of Y on X), or other causal structures. The perfect fit does not tell us the model’s structure corresponds to the world’s structure, and this in turn implies that getting closer to perfect fit does not necessarily correspond to getting closer to the world’s structure – maybe it does, maybe it doesn’t. This makes it incorrect for a researcher to claim that even perfect model fit implies the model is correctly causally specified. For even moderately complex models, precisely equivalently-fitting models are rare. Models almost-fitting the data, according to any index, unavoidably introduce additional potentially-important yet unknown model misspecifications. These models constitute a greater research impediment.

This logical weakness renders all fit indices “unhelpful” whenever a structural equation model is significantly inconsistent with the data, but several forces continue to propagate fit-index use. For example, Dag Sorbom reported that when someone asked Karl Joreskog, the developer of the first structural equation modeling program, “Why have you then added GFI?” to your LISREL program, Joreskog replied “Well, users threaten us saying they would stop using LISREL if it always produces such large chi-squares. So we had to invent something to make people happy. GFI serves that purpose.” The χ2 evidence of model-data inconsistency was too statistically solid to be dislodged or discarded, but people could at least be provided a way to distract from the “disturbing” evidence. Career-profits can still be accrued by developing additional indices, reporting investigations of index behavior, and publishing models intentionally burying evidence of model-data inconsistency under an MDI (a mound of distracting indices). There seems no general justification for why a researcher should “accept” a causally wrong model, rather than attempting to correct detected misspecifications. And some portions of the literature seems not to have noticed that “accepting a model” (on the basis of “satisfying” an index value) suffers from an intensified version of the criticism applied to “acceptance” of a null-hypothesis. Introductory statistics texts usually recommend replacing the term “accept” with “failed to reject the null hypothesis” to acknowledge the possibility of Type II error. A Type III error arises from “accepting” a model hypothesis when the current data are sufficient to reject the model.

The considertions relevant to using fit indices include checking:

  1. whether data concerns have been addressed (to ensure data mistakes are not driving model-data inconsistency);
  2. whether criterion values for the index have been investigated for models structured like the researcher’s model (e.g. index criterion based on factor structured models are only appropriate if the researcher’s model actually is factor structured);
  3. whether the kinds of potential misspecifications in the current model correspond to the kinds of misspecifications on which the index criterion are based (e.g. criteria based on simulation of omitted factor loadings may not be appropriate for misspecification resulting from failure to include appropriate control variables);
  4. whether the researcher knowingly agrees to disregard evidence pointing to the kinds of misspecifications on which the index criteria were based. (If the index criterion is based on simulating a missing factor loading or two, using that criterion acknowledges the researcher’s willingness to accept a model missing a factor loading or two.);
  5. whether the latest, not outdated, index criteria are being used (because the criteria for some indices tightened over time);
  6. whether satisfying criterion values on pairs of indices are required (e.g. Hu and Bentler report that some common indices function inappropriately unless they are assessed together.);
  7. whether a model test is, or is not, available. (A χ2 value, degrees of freedom, and probability will be available for models reporting indices based on χ2.)
  8. and whether the researcher has considered both alpha (Type I) and beta (Type II) errors in making their index-based decisions (E.g. if the model is significantly data-inconsistent, the “tolerable” amount of inconsistency is likely to differ in the context of medical, business, social and psychological contexts.).


Some of the more commonly used fit statistics include

  • Chi-square
    • A fundamental test of fit used in the calculation of many other fit measures. It is a function of the discrepancy between the observed covariance matrix and the model-implied covariance matrix. Chi-square increases with sample size only if the model is detectably misspecified.
  • Akaike information criterion (AIC)
    • An index of relative model fit: The preferred model is the one with the lowest AIC value.
    • A I C = 2 k 2 ln ( L ) {\displaystyle {\mathit {AIC}}=2k-2\ln(L)\,}
    • where k is the number of parameters in the statistical model, and L is the maximized value of the likelihood of the model.
  • Root Mean Square Error of Approximation (RMSEA)
    • Fit index where a value of zero indicates the best fit. Guidelines for determining a "close fit" using RMSEA are highly contested.
  • Standardized Root Mean Squared Residual (SRMR)
    • The SRMR is a popular absolute fit indicator. Hu and Bentler (1999) suggested .08 or smaller as a guideline for good fit.
  • Comparative Fit Index (CFI)
    • In examining baseline comparisons, the CFI depends in large part on the average size of the correlations in the data. If the average correlation between variables is not high, then the CFI will not be very high. A CFI value of .95 or higher is desirable.

The following table provides references documenting these, and other, features for some common indices: the RMSEA (Root Mean Square Error of Approximation), SRMR (Standardized Root Mean Squared Residual), CFI (Confirmatory Fit Index), and the TLI (the Tucker-Lewis Index). Additional indices such as the AIC (Akaike Information Criterion) can be found in most SEM introductions. For each measure of fit, a decision as to what represents a good-enough fit between the model and the data reflects the researcher's modeling objective (perhaps challenging someone else's model, or improving measurement); whether or not the model is to be claimed as having been "tested"; and whether the researcher is comfortable "disregarding" evidence of the index-documented degree of ill fit.

Features of Fit Indices
RMSEA SRMR CFI
Index Name Root Mean Square Error of
 Approximation
Standardized Root Mean

Squared Residual

Confirmatory Fit Index
Formula RMSEA = sq-root((χ²- d)/(d(N-1)))
Basic References
Factor Model proposed wording

for critical values

.06 wording?
NON-Factor Model proposed wording

for critical values

References proposing revised/changed,

disagreements over critical values

References indicating two-index or paired-index

criteria are required

Index based on χ² Yes No Yes
References recommending against use

of this index

Model modification

The model may need to be modified in order to more closely match the world's structure and thereby improve the fit. Many programs provide modification indices which report the change in χ² that would result from freeing fixed parameters: usually through adding a coeficient to a model which is currently set to zero. Modifications that improve model fit may be flagged as potential changes that can be made to the model. Modifications to a model are changes to the theory claimed to be true. Modifications therefore must make sense in terms of the theory being tested, or be acknowledged as limitations of that theory. Changes to measurement model are effectively claims that the items/data are impure indicators of the latent variables specified by theory.

A modification index is an estimate of how much a model’s chi-square fit to the data would “improve” (but not necessarily how much the model’s structure would improve) if a specific currently-fixed model coefficient were freed for estimation. Researchers confronting data-inconsistent models can easily free coefficients the modification indices report as likely to produce substantial fit improvements. This simultaneously introduces a substantial risk of moving from a causally-wrong-and-failing model to a causally-wrong-but-fitting model because improved data-fit does not provide assurance that the freed coefficients are substantively reasonable or world matching. The original model may contain causal misspecifications such as incorrectly directed effects, or incorrect assumptions about unavailable variables, which may not be correctable through addition of coefficients to the current model. Consequently, these models remain misspecified despite their closer fit. Fitting yet worldly-inconsistent models are especially likely to arise if a researcher committed to a particular model (for example a factor model having a desired number of factors) converts an initially-failing model into fitting by inserting measurement error covariances “suggested” by modification indices. MacCallum (1986) demonstrated that “even under favorable conditions, models arising from specification serchers must be viewed with caution.” Model misspecification may sometimes be corrected by insertion of coefficients suggested by the modification indices, but many more corrective possibilities are raised by employing a few indicators of similar-yet-importantly-different latent variables.

Sample size and power

While researchers agree that large sample sizes are required to provide sufficient statistical power and precise estimates using SEM, there is no general consensus on the appropriate method for determining adequate sample size. Considerations for determining sample size include the number of observations per parameter, the number of observations required for fit indexes to perform adequately, and the number of observations per degree of freedom. Researchers have proposed guidelines based on simulation studies, professional experience, and mathematical formulas. Sample size requirements to achieve a particular significance and power in SEM hypothesis testing are similar to requirements for similar sized multiple regressions.

In the past, researchers frequently justified switching to employing fit-indices, rather than testing their models, by claiming χ2 increases (and hence χ2’s probability decreases) with increasing sample size (N). There are two mistakes in discounting χ2 on this basis. First, for proper models, χ2 does not increase with increasing N, so if χ2 increases with N that itself is a sign something is detectably problematic. Second, for models that are detectably misspecified, χ2’s increase with N provides the good statistical news of increasing power to detect model misspecification (namely power to detect Type II error). Some important misspecifications cannot be detected by χ2 irrespective of N, so any amount of ill fit beyond what might be reasonably produced by random variations warrants report and consideration. The χ2 model test, possibly adjusted, is the strongest available structural equation model test.

Coordinating the Multiple Model Assessment Components

Assessing models and their coefficient estimates depends on the data, the theory, the model, and the estimation strategy. Hence model assessments consider:

    whether the data contain reasonable measurements of appropriate variables,
    whether the modeled case are causally homogeneous, It makes no sense to estimate one model if the data cases reflect two or more different causal networks.
    whether the model appropriately represents the theory or features of interest, Models are unpersuasive if they omit features required by a theory, or contain coefficients inconsistent with that theory.
    whether the estimates are statistically justifiable, Substantive assessments may be devastated: by violating assumptions, by using an inappropriate estimator, and/or by encountering non-convergence of iterative estimators.
    the substantive reasonableness of the estimates, Negative variances, and correlations exceeding 1.0 or -1.0, are impossible. Statistically possible estimates that are inconsistent with theory may also challenge theory, and our understanding.

and any remaining inconsistency between the model and data.

The estimation process minimizes the differences between the model and data but important and informative differences may remain. Research claiming to test or “investigate” a theory requires attending to beyond-chance model-data inconsistency. Estimation adjusts the model’s free coefficients to provide the best possible fit to the data. If a model remains inconsistent with the data despite selection of optimal coefficient estimates, an honest research response reports and attends to this evidence. Beyond-chance model-data inconsistency challenges both the coefficient estimates and the model’s capacity for adjudicating the model’s structure, irrespective of whether the inconsistency originates in problematic data, inappropriate statistical estimation, or incorrect model specification.

Coefficient estimates in data-inconsistent (“failing”) models are interpretable, as reports of how the world would appear to someone believing a model that conflicts with the available data. The estimates in data-inconsistent models do not necessarily become “obviously wrong” by becoming statistically strange, or wrongly-signed according to theory. The estimates may even closely match a theory’s requirements, but the remaining data inconsistency renders the match between the estimates and theory unpersuasive. Failing models remain interpretable, but only as interpretations that conflict with available evidence, even if the model is the best of several alternative models.

Caution should be taken when making claims of causality even when experimentation or time-ordered studies have been done. The term causal model must be understood to mean "a model that conveys causal assumptions", not necessarily a model that produces validated causal conclusions. Collecting data at multiple time points and using an experimental or quasi-experimental design can help rule out certain rival hypotheses but even a randomized experiment cannot rule out all such threats to causal inference. No research design, no matter how clever, can fully ascertain causal structures.

Advanced uses

SEM-specific software

Numerous software packages exist for fitting structural equation models. LISREL was the first such software, initially released in the 1970s. Frequently used software implementations among researchers include Mplus, R packages lavaan and sem, LISREL, OpenMx, SPSS AMOS, and Stata. Barbara M. Byrne published multiple instructional books for using a variety of these softwares as part of the Society of Multivariate Experimental Psychology's Multivariate Applications book series.

Scholars consider it good practice to report which software package and version was used for SEM analysis because they have different capabilities and may use slightly different methods to perform similarly named techniques.

See also

References

  1. Salkind, Neil J. (2007). "Intelligence Tests". Encyclopedia of Measurement and Statistics. doi:10.4135/9781412952644.n220. ISBN 978-1-4129-1611-0.
  2. Boslaugh, S.; McNutt, L-A. (2008). “Structural Equation Modeling”. Encyclopedia of Epidemiology. doi 10.4135/9781412953948.n443, ISBN 978-1-4129-2816-8.
  3. Shelley, M. C. (2006). “Structural Equation Modeling”. Encyclopedia of Educational Leadership and Administration. doi 10.4135/9781412939584.n544, ISBN 978-0-7619-3087-7.
  4. ^ Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Second edition. New York: Cambridge University Press.
  5. Kline, Rex B. (2016). Principles and practice of structural equation modeling (4th ed.). New York. ISBN 978-1-4625-2334-4. OCLC 934184322.{{cite book}}: CS1 maint: location missing publisher (link)
  6. ^ Hayduk, L. (1987) Structural Equation Modeling with LISREL: Essentials and Advances. Baltimore, Johns Hopkins University Press. ISBN 0-8018-3478-3
  7. Bollen, Kenneth A. (1989). Structural equations with latent variables. New York: Wiley. ISBN 0-471-01171-1. OCLC 18834634.
  8. Kaplan, David (2009). Structural equation modeling: foundations and extensions (2nd ed.). Los Angeles: SAGE. ISBN 978-1-4129-1624-0. OCLC 225852466.
  9. Curran, Patrick J. (2003-10-01). "Have Multilevel Models Been Structural Equation Models All Along?". Multivariate Behavioral Research. 38 (4): 529–569. doi:10.1207/s15327906mbr3804_5. ISSN 0027-3171. PMID 26777445. S2CID 7384127.
  10. Tarka, Piotr (2017). "An overview of structural equation modeling: Its beginnings, historical development, usefulness and controversies in the social sciences". Quality & Quantity. 52 (1): 313–54. doi:10.1007/s11135-017-0469-8. PMC 5794813. PMID 29416184.
  11. MacCallum & Austin 2000, p. 209.
  12. Wright, Sewall. (1921) “Correlation and causation”. Journal of Agricultural Research. 20: 557-585.
  13. Wright, Sewall. (1934) “The method of path coefficients”. The Annals of Mathematical Statistics. 5 (3): 161-215. doi: 10.1214/aoms/1177732676.
  14. Wolfle, L.M. (1999) “Sewall Wright on the method of path coefficients: An annotated bibliography” Structural Equation Modeling: 6(3):280-291.
  15. Duncan, Otis Dudley. (1975). Introduction to Structural Equation Models. New York: Academic Press. ISBN 0-12-224150-9.
  16. ^ Bollen, K. (1989). Structural Equations with Latent Variables. New York, Wiley. ISBN 0-471-01171-1.
  17. Jöreskog, Karl; Gruvaeus, Gunnar T.; van Thillo, Marielle. (1970) ACOVS: A General Computer Program for Analysis of Covariance Structures. Princeton, N.J.; Educational Testing Services.
  18. ^ Jöreskog, Karl Gustav; van Thillo, Mariella (1972). "LISREL: A General Computer Program for Estimating a Linear Structural Equation System Involving Multiple Indicators of Unmeasured Variables" (PDF). Research Bulletin: Office of Education. ETS-RB-72-56 – via US Government.
  19. ^ Jöreskog, Karl; Sorbom, Dag. (1976) LISREL III: Estimation of Linear Structural Equation Systems by Maximum Likelihood Methods. Chicago: National Educational Resources, Inc.
  20. Hayduk, L.; Glaser, D.N. (2000) “Jiving the Four-Step, Waltzing Around Factor Analysis, and Other Serious Fun”. Structural Equation Modeling. 7 (1): 1-35.
  21. Hayduk, L.; Glaser, D.N. (2000) “Doing the Four-Step, Right-2-3, Wrong-2-3: A Brief Reply to Mulaik and Millsap; Bollen; Bentler; and Herting and Costner”. Structural Equation Modeling. 7 (1): 111-123.
  22. Westland, J.C. (2015). Structural Equation Modeling: From Paths to Networks. New York, Springer.
  23. Christ, Carl F. (1994). "The Cowles Commission's Contributions to Econometrics at Chicago, 1939-1955". Journal of Economic Literature. 32 (1): 30–59. ISSN 0022-0515. JSTOR 2728422.
  24. Imbens, G.W. (2020). “Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics”. Journal of Economic Literature. 58 (4): 11-20-1179.
  25. ^ Bollen, K.A.; Pearl, J. (2013) “Eight myths about causality and structural equation models.” In S.L. Morgan (ed.) Handbook of Causal Analysis for Social Research, Chapter 15, 301-328, Springer. doi:10.1007/978-94-007-6094-3_15
  26. Borsboom, D.; Mellenbergh, G. J.; van Heerden, J. (2003). “The theoretical status of latent variables.” Psychological Review, 110 (2): 203–219. https://doi.org/10.1037/0033-295X.110.2.203 }
  27. ^ Cite error: The named reference Kline16 was invoked but never defined (see the help page).
  28. ^ Rigdon, E. (1995). “A necessary and sufficient identification rule for structural models estimated in practice.” Multivariate Behavioral Research. 30 (3): 359-383.
  29. ^ Hayduk, L. (1996) LISREL Issues, Debates, and Strategies. Baltimore, Johns Hopkins University Press. ISBN 0-8018-5336-2
  30. ^ Hayduk, L.A. (2014b) “Shame for disrespecting evidence: The personal consequences of insufficient respect for structural equation model testing. BMC: Medical Research Methodology, 14 (124): 1-10 DOI 10.1186/1471-2288-14-24 http://www.biomedcentral.com/1471-2288/14/124
  31. Browne, M.W.; MacCallum, R.C.; Kim, C.T.; Andersen, B.L.; Glaser, R. (2002) “When fit indices and residuals are incompatible.” Psychological Methods. 7: 403-421.
  32. Hayduk, L. A.; Pazderka-Robinson, H.; Cummings, G.G.; Levers, M-J. D.; Beres, M. A. (2005) “Structural equation model testing and the quality of natural killer cell activity measurements.” BMC Medical Research Methodology. 5 (1): 1-9. doi: 10.1186/1471-2288-5-1. Note the correction of .922 to .992, and the correction of .944 to .994 in the Hayduk, et al. Table 1.
  33. ^ Cite error: The named reference Hayduk14a was invoked but never defined (see the help page).
  34. ^ Barrett, P. (2007). “Structural equation modeling: Adjudging model fit.” Personality and Individual Differences. 42 (5): 815-824.
  35. ^ Satorra, A.; and Bentler, P. M. (1994) “Corrections to test statistics and standard errors in covariance structure analysis”. In A. von Eye and C. C. Clogg (Eds.), Latent variables analysis: Applications for developmental research (pp. 399-419). Thousand Oaks, CA: Sage.
  36. Cite error: The named reference “Sorbom2001” was invoked but never defined (see the help page).
  37. ^ Hu, L.; Bentler,P.M. (1999) “Cutoff criteria for fit indices in covariance structure analysis: Conventional criteria versus new alternatives.” Structural Equation Modeling. 6: 1-55.
  38. Kline 2011, p. 205.
  39. Kline 2011, p. 206.
  40. ^ Hu & Bentler 1999, p. 27.
  41. ^ Steiger, J. H.; and Lind, J. (1980) “Statistically Based Tests for the Number of Common Factors.” Paper presented at the annual meeting of the Psychometric Society, Iowa City.
  42. ^ Steiger, J. H. (1990) “Structural Model Evaluation and Modification: An Interval Estimation Approach”. Multivariate Behavioral Research 25:173-180.
  43. ^ Browne, M.W.; Cudeck, R. (1992) “Alternate ways of assessing model fit.” Sociological Methods and Research. 21(2): 230-258.
  44. Loehlin, J. C. (2004). Latent Variable Models: An Introduction to Factor, Path, and Structural Equation Analysis. Psychology Press.
  45. MacCallum, Robert (1986). "Specification searches in covariance structure modeling". Psychological Bulletin. 100: 107–120. doi:10.1037/0033-2909.100.1.107.
  46. Hayduk, L. A.; Littvay, L. (2012) “Should researchers use single indicators, best indicators, or multiple indicators in structural equation models?” BMC Medical Research Methodology, 12 (159): 1-17. doi: 10,1186/1471-2288-12-159
  47. ^ Quintana & Maxwell 1999, p. 499.
  48. ^ Westland, J. Christopher (2010). "Lower bounds on sample size in structural equation modeling". Electron. Comm. Res. Appl. 9 (6): 476–487. doi:10.1016/j.elerap.2010.07.003.
  49. Chou, C. P.; Bentler, Peter (1995). "Estimates and tests in structural equation modeling". In Hoyle, Rick (ed.). Structural equation modeling: Concepts, issues, and applications. Thousand Oaks, CA: Sage. pp. 37–55.
  50. Bentler, P. M; Chou, Chih-Ping (2016). "Practical Issues in Structural Modeling". Sociological Methods & Research. 16 (1): 78–117. doi:10.1177/0049124187016001004. S2CID 62548269.
  51. MacCallum, Robert C; Browne, Michael W; Sugawara, Hazuki M (1996). "Power analysis and determination of sample size for covariance structure modeling". Psychological Methods. 1 (2): 130–49. doi:10.1037/1082-989X.1.2.130.
  52. Pearl, Judea (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press. ISBN 978-0-521-77362-1.
  53. Rosseel, Yves (2012-05-24). "lavaan: An R Package for Structural Equation Modeling". Journal of Statistical Software. 48 (2): 1–36. doi:10.18637/jss.v048.i02. Retrieved 27 January 2021.
  54. Narayanan, A. (2012-05-01). "A Review of Eight Software Packages for Structural Equation Modeling". The American Statistician. 66 (2): 129–138. doi:10.1080/00031305.2012.708641. ISSN 0003-1305. S2CID 59460771.
  55. "Barbara Byrne Award for Outstanding Book or Edited Volume | SMEP". smep.org. Retrieved 2022-10-25.
  56. Kline 2011, p. 79-88.

Cite error: A list-defined reference named "Browne1993" is not used in the content (see the help page).
Cite error: A list-defined reference named "bollen-pearl2013" is not used in the content (see the help page).

Cite error: A list-defined reference named "Westland2015" is not used in the content (see the help page).

Bibliography

Further reading

External links

Statistics
Descriptive statistics
Continuous data
Center
Dispersion
Shape
Count data
Summary tables
Dependence
Graphics
Data collection
Study design
Survey methodology
Controlled experiments
Adaptive designs
Observational studies
Statistical inference
Statistical theory
Frequentist inference
Point estimation
Interval estimation
Testing hypotheses
Parametric tests
Specific tests
Goodness of fit
Rank statistics
Bayesian inference
Correlation
Regression analysis
Linear regression
Non-standard predictors
Generalized linear model
Partition of variance
Categorical / Multivariate / Time-series / Survival analysis
Categorical
Multivariate
Time-series
General
Specific tests
Time domain
Frequency domain
Survival
Survival function
Hazard function
Test
Applications
Biostatistics
Engineering statistics
Social statistics
Spatial statistics
Least squares and regression analysis
Computational statistics
Correlation and dependence
Regression analysis
Regression as a
statistical model
Linear regression
Predictor structure
Non-standard
Non-normal errors
Decomposition of variance
Model exploration
Background
Design of experiments
Numerical approximation
Applications
Categories: