Methods’ details for the bnclassify package

All notation and acronyms used here are introduced in vignette("overview", package="bnclassify").

See the remaining vignettes:

The CL-ODE algorithm by adapts the Chow-Liu algorithm in order to find the maximum likelihood TAN model in time quadratic in n. Since the same method can be used to find ODE models which maximize decomposable penalized log-likelihood scores, uses it to maximize Akaike’s information criterion (AIC) and BIC . While maximizing likelihood will always render a TAN, i.e., a network with n − 1 augmenting arcs, maximizing penalized log-likelihood may render a FAN, since the inclusion of some arcs might degrade the penalized log-likelihood score.

Note that when data is incomplete does not necessarily return the optimal (with respect to penalized log-likelihood) ODE. Namely, that requires the computationally expensive calculation of the sufficient statistics Nijk which maximize parameter likelihood; instead, approximates these statistics with the heuristic (see Section ).

TAN HC and TAN HC SP may evaluate equivalent structures at each step. Adding valid arcs Xi → Xj and Xj → Xi results in identical structures because of tree structure of the features subgraph. Namely, |Pa(X) \ C| ≤ 1 for each X and thus we can only add the arc Xi → Xj if Pa(X) = {C}. Thus, adding an arc Xi → Xj introduces no v-structures into the network, and both Xi → Xj and Xj → Xi only remove the independence between Xi and Xj. The two obtained networks thus correspond to identical factorizations of the joint distribution.

To avoid scoring equivalent structures, at each step we selected the Xi → Xj such that Xi (that is, its column name in the data set) is alphabetically before Xj. A preferable implementation would be to select the arc randomly.

only handles discrete features. With fully observed data, it estimates the parameters with maximum likelihood or Bayesian estimation, according to Equation 2 in the “overview” vignette, with a single α for all local distributions. With incomplete data it uses and substitutes Nj in Equation 2 in the “overview” vignette with $N_{i j \cdot} = \sum_{k = 1}^{r_i} N_{i j k}$, i.e., with the count of instances in which Pa(Xi) = j and Xi is observed:

The MANB parameter estimation method corresponds to exact Bayesian model averaging over the naive Bayes models obtained from all 2n subsets of the n features, yet it is computed in time linear in n. The implementation in follows the online appendix of , extending it to the cases where α ≠ 1 in Equation~().

The estimate for a particular parameter θijkMANB is:

where $P(\mathcal{G}_{C \nbigCI X_i} \mid \mathcal{D})$ is the local posterior probability of an arc from C to Xi, whereas $P(\mathcal{G}_{C \bigCI X_i}) = 1 - P(\mathcal{G}_{C \nbigCI X_i} \mid \mathcal{D})$ is that of the absence of such an arc (which is equivalent to omitting Xi from the model), while θijk and θik are the Bayesian parameter estimates obtained with Equation~() given the corresponding structures (i.e., with and without the arc from C to Xi).

Using Bayes’ theorem,

Assuming a Dirichlet prior with hyperparameter α = 1 in Equation~, Equation~(6) and Equation~(7) in the online appendix of give formulas for $P(\mathcal{D} \mid \mathcal{G}_{C \nbigCI X_i})$ and $P(\mathcal{D} \mid \mathcal{G}_{C \bigCI X_i})$:

where $N_{i \cdot k} = \sum_{j=1}^{r_C} N_{ijk}$. Noting that the above are special cases of Equation~(8) in , we can generalize this for any hyperparameter α > 0 as follows:

and

Following , asumes that the local prior probability of an arc from the class to a feature Xi, $P(\mathcal{G}_{C \nbigCI X_i})$, is given by the user. The prior of a naive Bayes structure 𝒢, with arcs from the class to a out of n, features and no arcs to the remaining n − a features is, then,

Note that computes the above in logarithmic space to reduce numerical errors.

The WANBIA method updates naive Bayes’ parameters with a single exponent `weight’ per feature. The weights are computed by optimizing either the conditional log-likelihood or the mean root squared error of the predictions. implements the conditional log-likelihood optimization as described in the original paper, namely optimizing it with the L-BFGS algorithm, with its gradient g given by

where the probabilities are those estimated with maximum likelihood, i.e., without taking weights into account, whereas P(c ∣ x; w) takes weights into account. This corresponds to discriminative learning of parameters, as a discriminative, rather than generative, objective function is optimized.

If Xi is unobserved for some instance j, that is, xi(j)= , then we replace P(Xi = xi(j) ∣ c(j)) and P(Xi = xi(j) ∣ c) with 1 in Equation (as a leaf in the Bayesian network, an unobserved Xi does not affect conditional log-likelihood).

The AWNB parameter estimation method is intended for the naive Bayes but in it can be applied to any model. It exponentiates the conditional probability of a predictor,

reducing or maintaining its effect on the class posterior, since wi ∈ [0, 1] (note that a weight wi = 0 omits Xi from the model, rendering it independent from the class.). This is equivalent to updating parameters of θijk given by Equation~() as

and plugging those estimates into Equation 1 in the “overview” vignette. Weights wi are computed as

$$w_i = \frac{1}{M}\sum_{t=1}^M \frac{1}{\sqrt{d_{ti}}},$$

where M is the number of bootstrap subsamples from 𝒟 and dti is the minimum testing depth of Xi in an unpruned classification tree learned from the t-th subsample (dti = 0 if Xi is omitted from t-th tree).

implements prediction for augmented naive Bayes models with complete data. This amounts to multiplying the corresponding entries in the local distributions and is done in logarithmic space, applying the log-sum-exp trick before normalizing, in order to reduce the chance of underflow.

With incomplete data this cannot be done and therefore uses the package to perform exact inference. Such inference is time-consuming and, therefore, wrapper algorithms can be very slow when applied on incomplete data sets.