bnclassify: Learning Bayesian Network Classifiers

Introduction

Bayesian network classifiers are competitive performance classifiers with the added benefit of interpretability. Their simplest member, the naive Bayes (NB) , is well-known . More elaborate models exist, taking advantage of the Bayesian network formalism for representing complex probability distributions. The tree augmented naive Bayes and the averaged one-dependence estimators (AODE) are among the most prominent.

A Bayesian network classifier is simply a Bayesian network applied to classification, that is, the prediction of the probability P(c ∣ x) of some discrete (class) variable C given some features X. The package already provides state-of-the art algorithms for learning Bayesian networks from data. Yet, learning classifiers is specific, as the implicit goal is to estimate P(c ∣ x) rather than the joint probability P(x, c). Thus, specific search algorithms, network scores, parameter estimation and inference methods have been devised for this setting. In particular, many search algorithms consider a restricted space of structures such as that of augmented naive Bayes models. Unlike with general Bayesian networks, it makes sense to omit a feature Xi from the model as long as the estimation of / is no better than that of P(c ∣ x \ xi). Discriminative scores, related to the estimation of / rather than /, are used to learn both structure and parameters . Some of the prominent classifiers are ensembles of networks, and there are even heuristics applied at inference time, such as the lazy elimination technique . Many of these methods are at best available in standalone implementations published alongside the original papers.

The package implements state-of-the-art algorithms for learning structure and parameters. The implementation is efficient enough to allow for time-consuming discriminative scores on relatively large data sets. It provides utility functions for prediction and inference, model evaluation with network scores and cross-validated estimation of predictive performance, and model analysis, such as querying structure type or graph plotting via the \CRANpkg{igraph} package. It integrates with the and packages for straightforward use in machine learning pipelines. Currently it supports only discrete variables. The functionalities are illustrated in an introductory vignette, while an additional vignette provides details on the implemented methods. It includes over 200 unit and integration tests that give a code coverage of 94 percent.

The rest of this paper is structured as follows. Section provides background on Bayesian network classifiers. Section describes the implemented functionalities. Section illustrates usage with a synthetic data set. Section discusses implementation while Section briefly surveys related software. Finally, Section concludes and outlines future work.

A Bayesian network classifier is a Bayesian network used for predicting a discrete class variable C. It assigns x, an observation of n predictor variables (features) X = (X1, …, Xn), to the most probable class:

$$ c^* = \argmax_c P(c \mid \mathbf{x}) = \argmax_c P(\mathbf{x}, c).$$

The classifier factorizes P(x, c) according to a Bayesian network ℬ = ⟨𝒢, θ. 𝒢 is a directed acyclic graph with a node for each variable in (X, C), encoding conditional independencies: a variable X is independent of its nondescendants in 𝒢 given the values pa(x) of its parents. 𝒢 thus factorizes the joint into local (conditional) distributions over subsets of variables:

$$P(\mathbf{x}, c) = P(c \mid \mathbf{pa}(c)) \prod_{i=1}^{n} P(x_i \mid \mathbf{pa}(x_i)).$$

Local distributions P(C ∣ pa(c)) and P(Xi ∣ pa(xi)) are specified by parameters θ(C, pa(c)) and θ(Xi, pa(xi)), with θ = {θ(C, pa(c)), θ(X1, pa(x1)), …, θ(Xn, pa(xn))}. It is common to assume each local distribution has a parametric form, such as the multinomial, for discrete variables, and the Gaussian for real-valued variables.

We learn from a data set 𝒟 = {(x1, c1), …, (xN, cN)} of N observations of X and C. There are two main approaches to learning the structure / from 𝒟: a) testing for conditional independence among triplets of sets of variables and b) searching a space of possible structures in order to optimize a network quality score. Under assumptions such as a limited number of parents per variable, approach a) can produce the correct network in polynomial time . On the other hand, finding the optimal structure even with at most two parents per variable is NP-hard . Thus, heuristic search algorithms, such as greedy hill-climbing, are commonly used . Ways to reduce model complexity, in order to avoid overfitting the training data 𝒟, include searching in restricted structure spaces and penalizing dense structures with appropriate scores.

Common scores in structure learning are the penalized log-likelihood scores, such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC) . They measure the model’s fitting of the empirical distribution / adding a penalty term that is a function of structure complexity. They are decomposable with respect to 𝒢, allowing for efficient search algorithms. Yet, with limited N and a large n, discriminative scores based on /, such as conditional log-likelihood and classification accuracy, are more suitable to the classification task . These, however, are not decomposable according to 𝒢. While one can add a complexity penalty to discriminative scores , they are instead often cross-validated to induce preference towards structures that generalize better, making their computation even more time demanding.

For Bayesian network classifiers, a common structure space is that of augmented naive Bayes models (see Figure ), factorizing P(X, C) as

with C ∈ Pa(Xi) for all Xi and Pa(C) = ∅. Models of different complexity arise by extending or shrinking the parent sets Pa(Xi), ranging from the NB with Pa(Xi) = {C} for all Xi, to those with a limited-size Pa(Xi) , to those with unbounded Pa(Xi) . While the NB can only represent linearly separable classes , more complex models are more expressive . Simpler models, with sparser Pa(Xi), may perform better with less training data, due to their lower variance, yet worse with more data as the bias due to wrong independence assumptions will tend to dominate the error.

The algorithms that produce the above structures are generally instances of greedy hill-climbing , with arc inclusion and removal as their search operators. Some add node inclusion or removal, thus embedding feature selection within structure learning. Alternatives include the adaptation of the Chow-Liu algorithm to find the optimal one-dependence estimator (ODE) with respect to decomposable penalized log-likelihood scores in time quadratic in n. Some structures, such as NB or AODE, are fixed and thus require no search.

Given 𝒢, learning θ in order to best approximate the underlying / is straightforward. For discrete variables Xi and Pa(Xi), Bayesian estimation can be obtained in closed form by assuming a Dirichlet prior over θ. With all Dirichlet hyper-parameters equal to α,

where Nijk is the number of instances in 𝒟 such that Xi = k and pa(xi) = j, corresponding to the j-th possible instantiation of pa(xi), Nj is the number of instances in which pa(xi) = j, while ri is the cardinality of Xi. α = 0 in yields the maximum likelihood estimate of θijk. With incomplete data, the parameters of local distributions are no longer independent and we cannot separately maximize the likelihood for each Xi as in . Optimizing the likelihood requires a time-consuming algorithm like expectation maximization which only guarantees convergence to a local optimum.

While the NB can separate any two linearly separable classes given the appropriate /, learning by approximating / cannot recover the optimal / in some cases . Several methods learn a weight wi ∈ [0, 1] for each feature and then update θ as

A wi < 1 reduces the effect of Xi on the class posterior, with wi = 0 omitting Xi from the model, making weighting more general than feature selection. The weights can be found by maximizing a discriminative score or computing the usefulness of a feature in a decision tree . Mainly applied to naive Bayes models, a generalization for augmented naive Bayes classifiers has been recently developed .

Another parameter estimation method for the naive Bayes is by means of Bayesian model averaging over the 2n possible naive Bayes structures with up to n features . It is computed in time linear in n and provides the posterior probability of an arc from C to Xi.

Computing / for a fully observed / means multiplying the corresponding θ. With an incomplete /, however, exact inference requires summing over parameters of the local distributions and is NP-hard in the general case , yet can be tractable with limited-complexity structures. The AODE ensemble computes / as the average of the Pi(c ∣ x) of the n base models. A special case is the lazy elimination heuristic which omits xi from if P(xi ∣ xj) = 1 for some xj.

Functionalities

The package has four groups of functionalities:

  1. Learning network structure and parameters
  2. Analyzing the model
  3. Evaluating the model
  4. Predicting with the model

Learning is split into two separate steps, the first being structure learning and the second, optional, parameter learning. The obtained models can be evaluated, used for prediction or analyzed. The following provides a brief overview of this workflow. For details on some of the underlying methods please see the /.

Structures

The learning algorithms produce the following network structures:

  • Naive Bayes (NB) (Figure )
  • One-dependence estimators (ODE)
    • Tree-augmented naive Bayes (TAN) (Figure )
    • Forest-augmented naive Bayes (FAN) (Figure )
  • k-dependence Bayesian classifier (k-DB)
  • Semi-naive Bayes (SNB)(Figure )
  • Averaged one-dependence estimators (AODE)

shows some of these structures and their factorizations of /. We use k-DB in the sense meant by rather than that by , as we impose no minimum on the number of augmenting arcs. SNB is the only structure whose complexity is not a priori bounded: the feature subgraph might be complete in the extreme case.

Algorithms

Each structure learning algorithm is implemented by a single R function. lists these algorithms along with the corresponding structures that they produce, the scores they can be combined with, and their R functions. Below we provide their abbreviations, references, brief comments and illustrate function calls.

Fixed structure

We implement two algorithms:

  • NB
  • AODE
The NB and AODE structures are fixed given the number of variables, and thus no search is required to estimate them from data. For example, we can get a NB structure with

where is the name of the class variable C and the dataset containing observations of C and /.

Optimal ODEs with decomposable scores

We implement one algorithm:

  • Chow-Liu for ODEs (CL-ODE; )

Maximizing log-likelihood will always produce a TAN while maximizing penalized log-likelihood may produce a FAN since including some arcs can degrade such a score. With incomplete data our implementation does not guarantee the optimal ODE as that would require computing maximum likelihood parameters. The arguments of the function are the network score to use and, optionally, the root for features’ subgraph:

Greedy hill-climbing with global scores

The package implements five algorithms:

These algorithms use the cross-validated estimate of predictive accuracy as a score. Only the FSSJ and BSEJ perform feature selection. The arguments of the corresponding functions include the number of cross-validation folds and the minimal absolute score improvement required for continuing the search:

Parameters

The package only handles discrete features. With fully observed data, it estimates the parameters with maximum likelihood or Bayesian estimation, according to Equation , with a single α for all local distributions. With incomplete data it uses and substitutes Nj in with $N_{i j \cdot} = \sum_{k = 1}^{r_i} N_{i j k}$, i.e., with the count of instances in which Pa(Xi) = j and Xi is observed.

We implement two methods for weighted naive Bayes parameter estimation:

  • Weighting attributes to alleviate naive Bayes’ independence assumption (WANBIA)
  • Attribute-weighted naive Bayes (AWNB)

And one method for estimation by means of Bayesian model averaging over all NB structures with up to n features:

  • Model averaged naive Bayes (MANB)

It makes little sense to apply WANBIA, MANB and AWNB to structures other than NB. WANBIA, for example, learns the weights by optimizing the conditional log-likelihood of the NB. Parameter learning is done with the function. For example,

computes Bayesian parameter estimates with α = 1 (the argument) for all local distributions, and updates them with the MANB estimation obtained with a 0.5 prior probability for each class-to-feature arc.

Utilities

Single-structure-learning functions, as opposed to those that learn an ensemble of structures, return an S3 object of class . The following functions can be invoked on such objects:

  • Plot the network:
  • Query model type: , , , , …
  • Query model properties: , , , …
  • Convert to a object:

Ensembles are of type and only and model type queries can be applied to such objects. Fitting the parameters (by calling ) of a produces a object. In addition to all functions, the following are meaningful:

  • Predict class labels and class posterior probability:
  • Predict joint distribution:
  • Network scores:
  • Cross-validated accuracy:
  • Query model properties:
  • Parameter weights: ,

The above functions for can also be applied to an ensemble with fitted parameters.

Documentation

This vignette provides /. Calling gives a more /. The / presents more /. The / provides /.

Usage example

The available functionalities can be split into four groups:

  1. Learning network structure and parameters
  2. Analyzing the model
  3. Evaluating the model
  4. Predicting with the model

We illustrate these functionalities with the synthetic data set with six features. We begin with a simple example for each functionality group and then elaborate on the options in the following sections. We first load the package and the dataset,

library(bnclassify)
data(car)

then learn a naive Bayes structure and its parameters,

nb <- nb('class', car) 
nb <- lp(nb, car, smooth = 0.01) 

get the number of arcs in the network,

narcs(nb)
[1] 6

get the 10-fold cross-validation estimate of accuracy,

cv(nb, car, k = 10) 
[1] 0.8588107

and finally classify the entire data set

p <- predict(nb, car)
head(p) 
[1] unacc unacc unacc unacc unacc unacc
Levels: unacc acc good vgood

Learning

The functions for structure learning, shown in , correspond to the different algorithms. They all receive the name of the class variable and the data set as their first two arguments, which are then followed by optional arguments. The following runs the CL-ODE algorithm with the AIC score, followed by the FSSJ algorithm to learn another model:

ode_cl_aic <- tan_cl('class', car, score = 'aic')   
suppressWarnings(RNGversion("3.5.0"))
set.seed(3)
fssj <- fssj('class', car, k = 5, epsilon = 0)

The function is a shorthand for learning structure and parameters in a single step,

ode_cl_aic <- bnc('tan_cl', 'class', car, smooth = 1, dag_args = list(score = 'aic'))

where the first argument is the name of the structure learning function while and optional arguments go in .

Analyzing

Printing the model, such as the above object, provides basic information about it.

ode_cl_aic

  Bayesian network classifier

  class variable:        class 
  num. features:   6 
  num. arcs:   9 
  free parameters:   131 
  learning algorithm:    tan_cl 

While plotting the network is especially useful for small networks, printing the structure in the and format may be more useful for larger ones:

ms <- modelstring(ode_cl_aic)
strwrap(ms, width = 60)
[1] "[class] [buying|class] [doors|class] [persons|class]"
[2] "[maint|buying:class] [safety|persons:class]"         
[3] "[lug_boot|safety:class]"                             

We can query the type of structure; lets us access the conditional probability tables (CPTs); while lists the features:

is_ode(ode_cl_aic)
[1] TRUE
params(nb)$buying
       class
buying         unacc          acc         good        vgood
  low   0.2132243562 0.2317727320 0.6664252607 0.5997847478
  med   0.2214885458 0.2994740131 0.3332850521 0.3999077491
  high  0.2677680077 0.2812467451 0.0001448436 0.0001537515
  vhigh 0.2975190903 0.1875065097 0.0001448436 0.0001537515
length(features(fssj))
[1] 5

For example, has selected five out of six features.

provides the MANB posterior probabilities for arcs from the class to each of the features:

manb <- lp(nb, car, smooth = 0.01, manb_prior = 0.5)
round(manb_arc_posterior(manb))
  buying    maint    doors  persons lug_boot   safety 
       1        1        0        1        1        1 

With the posterior probability of 0% for the arc from to , and 100% for all others, MANB renders independent from the class while leaving the other features’ parameters unaltered. We can see this by printing out the CPTs:

params(manb)$doors 
       class
doors   unacc  acc good vgood
  2      0.25 0.25 0.25  0.25
  3      0.25 0.25 0.25  0.25
  4      0.25 0.25 0.25  0.25
  5more  0.25 0.25 0.25  0.25
all.equal(params(manb)$buying, params(nb)$buying)
[1] TRUE

For more functions for querying a structure with parameters () see . For a structure without parameters (), see .

Evaluating

Several scores can be computed:

logLik(ode_cl_aic, car)
'log Lik.' -13307.59 (df=131)
AIC(ode_cl_aic, car)
[1] -13438.59

The function estimates the predictive accuracy of one or more models with a single run of stratified cross-validation. In the following we assess the above models produced by NB and CL-ODE algorithms:

suppressWarnings(RNGversion("3.5.0"))
set.seed(0)
cv(list(nb = nb, ode_cl_aic = ode_cl_aic), car, k = 5, dag = TRUE)
        nb ode_cl_aic 
 0.8582303  0.9345913 

Above, is the desired number of folds, and evaluates structure and parameter learning, while keeps the structure fixed and evaluates just the parameter learning. The output gives 86% and 93% accuracy estimates for NB and CL-ODE, respectively. and packages provide additional options for evaluating predictive performance, such as different metrics, and is integrated with both (see the /).

Predicting

As shown above, we can predict class labels with . We can also get the class posterior probabilities:

pp <- predict(nb, car, prob = TRUE)
# Show class posterior distributions for the first six instances of car
head(pp)
         unacc          acc         good        vgood
[1,] 1.0000000 2.171346e-10 8.267214e-16 3.536615e-19
[2,] 0.9999937 6.306269e-06 5.203338e-12 5.706038e-19
[3,] 0.9999908 9.211090e-06 5.158884e-12 4.780777e-15
[4,] 1.0000000 3.204714e-10 1.084552e-15 1.015375e-15
[5,] 0.9999907 9.307467e-06 6.826088e-12 1.638219e-15
[6,] 0.9999864 1.359469e-05 6.767760e-12 1.372573e-11

Implementation

With complete data, implements prediction for augmented naive Bayes models as well as for ensembles of such models. It multiplies the corresponding / in logarithmic space, applying the log-sum-exp trick before normalizing, to reduce the chance of underflow. On instances with missing entries, it uses the package to perform exact inference, which is noticeably slower. Network plotting is implemented by the package. Some functions are implemented in C++ with for efficiency. The package is extensively tested, with over 200 unit and integrated tests that give a 94% code coverage.

Conclusion

The package implements several state-of-the art algorithms for learning Bayesian network classifiers. It also provides features such as model analysis and evaluation. It is reasonably efficient and can handle large data sets. We hope that will be useful to practitioners as well as researchers wishing to compare their methods to existing ones.

Future work includes handling real-valued feature via conditional Gaussian models. Straightforward extensions include adding flexibility to the hill-climbing algorithm, such as restarts to avoid local minima.