Model-Data Fit using Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and The Sample-Size-Adjusted BIC

The study determined if the 1PL, 2PL, 3PL and 4PL item response theory models best fit the data from the 2016 NECO Mathematics objective tests. Ex-post facto design was adopted for the study. The population for the study comprised 1,022,474 candidates who enrolled and sat for June/July SSCE 2016 NECO Mathematics Examination. The sample comprised 276,338 candidates who sat for the examination in three purposively Geo political zones in Nigeria (i.e., S/West, S/East and N/West). The research instruments used for the study were Optical Marks Record Sheets for the National Examination Council (NECO) June/July 2016 SSCE Mathematics objectives items. The responses of the tests were scored dichotomously. Data collected were analyzed using 2loglikelihood chi-square. The results of the likelihood ratio test revealed that 2PL fitted the data better than 1PL was statistically significant (χ 2 (59) = 820636.1, p < 0.05); the 2PL model fitted the data better than the 1PL model; 3PL model fitted the data better than the 2PL model and the result showed that the 4PL model fitted the data better than the 3PL model and the Likelihood ratio test that 4PL model fitted the data better than 3PL model was statistically significant, (χ 2 (60)=216159.2, p<0.05). The study concluded that four-parameter logistic model fitted the 2016 NECO Mathematics test items.


Background
Assessment is critical in the development of strong student performance when using a classroom-designed instrument. This is because measuring variables is one of the processes in the research process (Eluwa & Abang, 2011). Item response theory (IRT), a new approach of assessing a test instrument's psychometric properties, is a current measuring methodology. It has gained traction and, as a result, has established itself as a prominent measurement framework. To produce a psychometrically sound cognitive assessment, a rigorous instrument development method is required. According to Ojerinde (2013), a latent trait or ability attribute is a variable that is not readily evident but has a quantifiable impact on detectable attributes. Standard measures, which are the test items, can be used to make inferences about the presence or extent of specific traits based on the perception of these attributes. The objects are supposed to have a direct connection to the latent attribute, and the items are considered restrictively independent. The response to an object should be completely captured by this ability characteristic. Any correlation between the items is due to their regular reliance on the authorized latent characteristic, and there should be no covariance among item responses along any other latent measurement for the premise of objectivity to be met. The item response theory, which is probabilistic in technique, is a subsequent conceptualisation of the essential connection between the answer to an item and the ability controlled by a person. The general of the IRT Model is Thus, e is the constant 2.718. while 1.70 is the scaling factor, b is the difficulty coefficient; c is the discrimination parameter; L= a (θ-b) is the logistic deviate (logit); θ is the ability level". Item response theory (IRT) is a hypothesis that describes how The degree of separation between items is depicted by the a-parameter, also known as item discrimination. The slope of the item characteristic curve at the point of inflection is what it's called (Harris, 1989). The a-paremeter can be negative (-) or positive (+), with 2.0 being the most frequent number for multiple choice questions. In a wide range of scenarios, an item with a low a-parameter discriminates ineffectively (ability). With a higher a-value, the item differentiates well, but only over a narrow range of a-values (ability). Items with an a-parameter of 0.80 or less are considered bad. At the time of reflection, the bigger the a-parameter, the more clearly the items discriminate among examinees. The a-value can be computed numerically or with the aid of a computer program. The b-parameters are item difficulty, also known as item threshold parameter. This is the affectation point on the ability scale, or the point on the scale when examinees have a 50% chance of responding appropriately to an item. The range of difficulty value can be -3 to +3 when a scale is scaled with a mean of 0 and standard deviation of 1.0. Items with high b estimates are tough, and even if their values are low, steep test participants have a moderate likelihood of answering effectively to an item (Harris, 1989in Ojerinde, Popoola, Ojo & Onyeneho, 2012. The speculating record or lower asymptote respect is the cparameter, often known as a pseudo-chance parameter. An examinee who has no understanding what the best option is in a multiple-choice question can succeed in reacting successfully by speculating sporadically. Theoretically, guessing can range from 0.00 to 1.0, although it is rarely exactly ≤0.3. Item response theory, in practice, involves applying curve-fitting techniques to observed proportions of category replies in the hopes that the fit is good enough to encourage faith in the model being fitted, according to Baker (1992). However, Garcia-Pérez and Frary (1991) pointed out that this approach has the fundamental contradiction of measuring fit after parameters have been set in order to fit the model as closely as possible. In other words, there is no technique for independently verifying the appropriateness of modelled response functions, making it impossible to validate the fit of IRT models to data (Garcia-Pérez & Frary, 1991). Examining both the model's goodness of fit and the number of parameters modeled to obtain that fit is a more advanced approach of picking a model than depending solely on fit statistics (Sclove, 1987). This is commonly accomplished by employing a penalty term that increases in size as the number of parameters in the fitted models increases.
A significant problem of modern science is the choosing of the most appropriate model to describe the events under investigation. Because statisticians are inherently involved in this activity, it's not unexpected that several statistical techniques to dealing with this critical issue have been proposed throughout the years. Model selection has been thoroughly researched from both a frequentist and a Bayesian standpoint. In the literature, many approaches for determining the "best model" from a group of contenders have been proposed. The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are two of the most extensively utilized model selection families of indicators (BIC). The AIC is a Kullback-Leibler Divergence-based information-theoretic indicator that essentially measures the information lost by a given model. As a result, the AIC criterion is based on the idea that the less information a model loses, the higher its quality is. The BIC criteria is based on Bayesian theory and is aimed to maximize the posterior probability of a model given the data.
The Bayesian information criterion (BIC) or Schwarz criterion (also SBC, SBIC) is a criterion for selecting a model from a finite set of options in statistics. It is closely related to the Akaike information criteria and is based in part on the likelihood function (AIC). It is possible to increase the likelihood of fitting a model by adding parameters, however this may result in overfitting. By inserting a penalty term for the number of parameters in the model, the BIC overcomes this problem.

Mathematically
The BIC is an asymptotic finding based on the assumption that the data distribution is exponential in nature. Let x = the observed data; n = the number of data points in x, or the number of observations, or the sample size; and k = the number of free parameters to be estimated. p(x|k) = the probability of the observed data given the number of parameters; or, the likelihood of the parameters given the dataset; if the estimated model is a linear regression, k is the number of regressors, including the intercept; if the estimated model is a linear regression, k is the number of regressors, including the intercept; if the estimated model is a linear regression, k is the number of regressors, including the intercept; L is the maximum likelihood function value for the calculated model.
The formula for the BIC is is the error variance.
The error variance in this case is defined as (3) The Bayesian information criterion's characteristics are listed bellow. 1. It is unrelated to the past or the prior is "ambiguous" (a constant). 2. It can be used to assess the parameterized model's accuracy in forecasting data.
3. It penalizes the model's complexity, which is defined as the number of parameters in the model. 4. It's almost the same as the minimum description length criterion, but with a minus sign. 5. It can be used to determine the amount of clusters to utilize based on the inherent complexity of a dataset. 6. It shares a lot of similarities with other penalized likelihood criteria like RIC and the Akaike information criterion.
In this vein, the Akaike information criterion (AIC) (Akaike, 1977) provides a model selection method that has been used in IRT on occasion (e.g., Wilson, 1992). According to one study (Sclove, 1987), the AIC is not asymptotically consistent, and it is questionable if the AIC is suitable for comparing models with a variety of parameters, such as local category models and cumulative border models. The ideal observer index (IOI) is an AIC extension that uses the probability ratio of two comparison models instead of a single baseline model (Levine, Drasgow, Williams, McCusker, & Thomasson, 1992). This index is more appropriate than the AIC when comparing models with different sorts of parameters. Wainer and Wright (1980) noted, "It appears that the Rasch model gives fairly reasonable assessments of ability and difficulty even when its assumption of equal slopes is very lightly approximated." Furthermore, "it appears that if the number of items is extremely large, then inferences about an examinee's ability based on his total test score will be very much the same whether" the Rasch model or the 3PL model is used, according to Lord and Novick (1968).
The application context, as well as one's philosophy of whether data should match the model or vice versa, should be addressed while deciding between the one parameter, two parameter, and three parameter models (e.g., sample size, instrument characteristics and considerations, assumption tenability, political realities, etc.). Given that the one parameter (1PL) model is the most restricting of the three, there has been a lot of research concerning how to use it when it mismatches. Forsyth, Saisangjan, and Gilmer (1981), for example, investigated the Rasch model's robustness when the dimensionality and constant assumptions were violated. It was assumed that some examinees would guess because their empirical data came from a multiple-choice question type test. "Even when the model's assumptions are not met, the Rasch model yields reasonably invariant item parameter and ability estimates," Forsyth et al. concluded. Simulated data was utilized by Dinero and Haertel (1977) to achieve similar conclusions. Wainer and Wright (1980) noted, "It appears that the Rasch model gives extremely excellent assessments of ability and difficulty." The evaluation of model-data fit is divided into two stages: 1. The data test is fitted to the four IRT models, and the models' fitness to the data set is compared. The model that fits the data the best is named the model that fits the data. To achieve this purpose, a variety of strategies might be used. According to Oguoma, Metibemu, and Okoye (2016), the Chi-square difference test and information indices are relevant measures (Finch & French, 2015).
Model complexity is penalized in information indices, which are simply measurements of variation that a model does not explain. Among the most well-known of these indices are the Akaike information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), and the sample-size-adjusted BIC (SBIC; Enders & Tofihi, 2008). These information indices are derived using the chi-square value of -2loglikelihood and interpreted in such a way that the model with the lower value fits the data better. The chi-square and likelihood ratio goodness of fit tests are also used to test the null hypothesis that two nested models provide the same fit to a set of data.
The models under investigation differ, as shown by a statistically significant probability. The model must be appropriate for the data, which is one of the most basic requirements in the use of parametric IRT models. This requires picking the right model and evaluating its fit (Edelen & Reeve, 2007). When choosing the right model, the number of item answer categories is the first thing to think about. The 1, 2, 3, and 4 IRT models can be used with dichotomous data.

Objective
Determine if the 1PL, 2PL, 3PL and 4PL item response theory models best fit the data from the 2016 NECO Mathematics Objective tests.

METHOD
The study employed ex-post facto design. The population for the study comprised candidates who sat for June/July SSCE 2016 NECO Mathematics Examination. The sample comprised 276,338 candidates who sat for the examination in three purposively Geo political zones in Nigeria (i.e South-West, South-East and North-West). The research instruments used for the study were Optical Marks Record Sheets for the National Examination Council (NECO) June/July 2016 SSCE Mathematics objectives items. The responses of the tests were scored dichotomously. Data collected were analyzed using 2loglikelihood chi-square. Table 1 shows the model-data fit assessment of 2016 NECO Mathematics test items. The table shows that when the fitness of 1PL and 2PL models' data were compared, the result showed that the 2PL had AIC =17289402, SABIC =17290282, BIC =17290664 values that were less than the AIC =18109920, SABIC =18110368, BIC =18110562 values of the 1PL. In addition, the Likelihood ratio test revealed that 2PL fitted the data better than 1PL was statistically significant (χ 2 (59) = 820636.1, p < 0.05). These results showed that the 2PL model fitted the data better than the 1PL model. The results showed that the 3PL model fitted the data better than the 2PL model (3PL model's AIC = 17120304, SABIC = 17121624, BIC = 17122196 values were respectively less than the 2PL model's AIC =17289402, SABIC =17290282, BIC =17290664; the Likelihood ratio test that 3PL model fitted the data better than 2PL model was statistically significant, (χ 2 (60) = 169158.5, p < 0.05)). Furthermore, in search for a better model for the test data, the fitness of 3PL model to the 2016 NECO Mathematics test items were in turn compared to the fitness of 4PL model to the test data. The results showed that the 4PL model fitted the data better than the 3PL model (4PL model's AIC = 16904265, SABIC = 17121624, BIC = 17122196 values were respectively less than the 3PL model's AIC = 17120304, SABIC = 16906025, BIC = 16906788; the Likelihood ratio test that 4PL model fitted the data better than 3PL model was statistically significant, (χ 2 (60) = 216159.2, p < 0.05)). The result revealed that unidimensional four-parameter logistic model fitted the 2016 NECO Mathematics test items. Thus, the test was calibrated using fourparameter logistic model.

Discussion of Findings
The 2PL model suited the data better than the 1PL model, according to the study's findings. Yen (1981) used simulation research to create several data sets based on various models and assess the fit of the 1PL, 2PL, and 3PL models to these data. When she utilized the 3PL model to generate data, she discovered that the 2PL model suited the data almost as well as the 3PL model, despite the fact that the item parameter estimates were not the same. It was difficult for the 2PL to simulate a nonzero lower asymptote when an item was challenging and had a moderate to high discrimination, she said. She concluded that, while the 2PL model performed almost as well as the 3PL model in modeling answer vectors, sample dependency may be observed when discrimination parameters for difficult items are calculated with lowproficiency examinees.
Mokobi and Adedoyin (2014) employed MULTILOG to examine item level and model fit statistics in a three-parameter logistic model with 2010 Botswana Junior Certificate Examination Mathematics paper one in a similar study. In order to test item fit to 1PL, 2PL, and 3PL models, the researchers used X2 goodness of fit statistics. The findings revealed that 10 things suit the 1PL model, 11 items fit the 2PL model, and 24 items fit the 3PL model.
The 3PL model suited the data better than the 2PL model, according to the findings. Using the more sophisticated 3PL model, on the other hand, resulted in a better fit than the 2PL model. However, in light of the increased model complexity, this is not regarded a significant improvement in fit. According to De Ayala (2009), the 3PL model fits much better than the 2PL and 1PL models, although it does not result in a significant fit improvement over either model. Additionally, Orlando and Thissen (2000) analyzed model-data fit from fixed format tests, the results showed that the three parameter logistic models combined with the generalized partial credit model among various were the best match. The logistic model with only one parameter has the most misfit elements. Three fit statistics are compared. Finally, the study's findings revealed that the 4PL model better fit the data than the 3PL model. Because models with more parameters tend to fit a data set better in general than models with fewer parameters, it's a good idea to factor in the extra parameters when evaluating model-data fit. One can decrease the tendency toward model over parameterization by considering the number of parameters necessary to attain a specific degree of fit (De Ayala, 2009).

CONCLUSION
Based on the findings, the study concluded that by examining the invariance of item parameter estimates across random calibration subsamples, the likelihood ratio and the AIC and BIC statistics were best approaches for assessing model-level fit and 4-PL was found to fit the test data. The following recommendations were made based on the findings of this study. 1. That the selection of best item response theory model should depend on assessing item fit statistics as the first step to apply item response theory with confidence. 2. That the use of more than one item response programs will provide the choice of the best program that provide more useful information about the real data set. 3. That the use of unidimensionality tests such as stout's t-statistics, factor analysis and conditional item covariance should be adopted in model data fit. 4. That item response theory be further used by test developers, researchers and stakeholders to better understand how to develop psychometrically sound measures.