Methods’ details for the `bnclassify` package

All notation and acronyms used here are introduced in vignette("overview", package="bnclassify").

See the remaining vignettes:

vignette("overview", package="bnclassify") provides details on the implemented methods.
?bnclassify provides a a consise overview of the package, listing main functionalities and functions.
vignette("introduction", package="bnclassify") provides details on the implemented methods.

The CL-ODE algorithm by adapts the Chow-Liu algorithm in order to find the maximum likelihood TAN model in time quadratic in n. Since the same method can be used to find ODE models which maximize decomposable penalized log-likelihood scores, uses it to maximize Akaike’s information criterion (AIC) and BIC . While maximizing likelihood will always render a TAN, i.e., a network with n − 1 augmenting arcs, maximizing penalized log-likelihood may render a FAN, since the inclusion of some arcs might degrade the penalized log-likelihood score.

Note that when data is incomplete does not necessarily return the optimal (with respect to penalized log-likelihood) ODE. Namely, that requires the computationally expensive calculation of the sufficient statistics N_ijk which maximize parameter likelihood; instead, approximates these statistics with the heuristic (see Section ).

TAN HC and TAN HC SP may evaluate equivalent structures at each step. Adding valid arcs X_i → X_j and X_j → X_i results in identical structures because of tree structure of the features subgraph. Namely, |Pa(X) \ C| ≤ 1 for each X and thus we can only add the arc X_i → X_j if Pa(X) = {C}. Thus, adding an arc X_i → X_j introduces no v-structures into the network, and both X_i → X_j and X_j → X_i only remove the independence between X_i and X_j. The two obtained networks thus correspond to identical factorizations of the joint distribution.

To avoid scoring equivalent structures, at each step we selected the X_i → X_j such that X_i (that is, its column name in the data set) is alphabetically before X_j. A preferable implementation would be to select the arc randomly.

only handles discrete features. With fully observed data, it estimates the parameters with maximum likelihood or Bayesian estimation, according to Equation 2 in the “overview” vignette, with a single α for all local distributions. With incomplete data it uses and substitutes N_⋅j⋅ in Equation 2 in the “overview” vignette with $N_{i j \cdot} = \sum_{k = 1}^{r_i} N_{i j k}$, i.e., with the count of instances in which Pa(X_i) = j and X_i is observed:

The MANB parameter estimation method corresponds to exact Bayesian model averaging over the naive Bayes models obtained from all 2ⁿ subsets of the n features, yet it is computed in time linear in n. The implementation in follows the online appendix of , extending it to the cases where α ≠ 1 in Equation~().

The estimate for a particular parameter θ_ijk^MANB is:

where $P(\mathcal{G}_{C \nbigCI X_i} \mid \mathcal{D})$ is the local posterior probability of an arc from C to X_i, whereas $P(\mathcal{G}_{C \bigCI X_i}) = 1 - P(\mathcal{G}_{C \nbigCI X_i} \mid \mathcal{D})$ is that of the absence of such an arc (which is equivalent to omitting X_i from the model), while θ_ijk and θ_ik are the Bayesian parameter estimates obtained with Equation~() given the corresponding structures (i.e., with and without the arc from C to X_i).

Using Bayes’ theorem,

Assuming a Dirichlet prior with hyperparameter α = 1 in Equation~, Equation~(6) and Equation~(7) in the online appendix of give formulas for $P(\mathcal{D} \mid \mathcal{G}_{C \nbigCI X_i})$ and $P(\mathcal{D} \mid \mathcal{G}_{C \bigCI X_i})$:

where $N_{i \cdot k} = \sum_{j=1}^{r_C} N_{ijk}$. Noting that the above are special cases of Equation~(8) in , we can generalize this for any hyperparameter α > 0 as follows:

and

Following , asumes that the local prior probability of an arc from the class to a feature X_i, $P(\mathcal{G}_{C \nbigCI X_i})$, is given by the user. The prior of a naive Bayes structure 𝒢, with arcs from the class to a out of n, features and no arcs to the remaining n − a features is, then,

Note that computes the above in logarithmic space to reduce numerical errors.

The WANBIA method updates naive Bayes’ parameters with a single exponent `weight’ per feature. The weights are computed by optimizing either the conditional log-likelihood or the mean root squared error of the predictions. implements the conditional log-likelihood optimization as described in the original paper, namely optimizing it with the L-BFGS algorithm, with its gradient g given by

where the probabilities are those estimated with maximum likelihood, i.e., without taking weights into account, whereas P(c ∣ x; w) takes weights into account. This corresponds to discriminative learning of parameters, as a discriminative, rather than generative, objective function is optimized.

If X_i is unobserved for some instance j, that is, x_i^(j)= , then we replace P(X_i = x_i^(j) ∣ c^(j)) and P(X_i = x_i^(j) ∣ c) with 1 in Equation (as a leaf in the Bayesian network, an unobserved X_i does not affect conditional log-likelihood).

The AWNB parameter estimation method is intended for the naive Bayes but in it can be applied to any model. It exponentiates the conditional probability of a predictor,

reducing or maintaining its effect on the class posterior, since w_i ∈ [0, 1] (note that a weight w_i = 0 omits X_i from the model, rendering it independent from the class.). This is equivalent to updating parameters of θ_ijk given by Equation~() as

and plugging those estimates into Equation 1 in the “overview” vignette. Weights w_i are computed as

$$w_i = \frac{1}{M}\sum_{t=1}^M \frac{1}{\sqrt{d_{ti}}},$$

where M is the number of bootstrap subsamples from 𝒟 and d_ti is the minimum testing depth of X_i in an unpruned classification tree learned from the t-th subsample (d_ti = 0 if X_i is omitted from t-th tree).

implements prediction for augmented naive Bayes models with complete data. This amounts to multiplying the corresponding entries in the local distributions and is done in logarithmic space, applying the log-sum-exp trick before normalizing, in order to reduce the chance of underflow.

With incomplete data this cannot be done and therefore uses the package to perform exact inference. Such inference is time-consuming and, therefore, wrapper algorithms can be very slow when applied on incomplete data sets.

Methods’ details for the bnclassify package

Methods’ details for the `bnclassify` package