Databases with missing data

The aim of this project is to study query answering over database with missing data, where the missingness is described by a graph of missingness. One of our attempt is to reduce this problem to query answering over block-independant probabilistic databases.

Query answering over block-independent probabilistic databases

The different notions of query answering for a numerical query q (including Boolean queries: 0 or 1) over a BIPDB \(D\):

the expected value defined by \(E(q(D))\)
a most probable answer is an possible answer having the highest probability
a answer on a most probable class is an answer on a most probable distribution of the tuples. In this case, the possible worlds that have the same distribution of tuples are considered as equivalent : we say that they form a class.

Results

Most probable answer and answer on the most probable class are different even with simple cases, where the two notions take a single value. We show it on a simple example while querying the sum. It also shows that the most probable answer is different from the expected value.
The distribution of the values in the most probable class follows the one defined by the probability, when the probability is uniform among the blocks and the number of block is a multiple of a common denominator of the value's probabilities. Here, a sketch of the proof. In the following, we say that such a BIPDB is uniform and balanced.
The expected value and the answer on the most probable class are equals on uniform and balanced BIPDB containing one column for the average and the sum.

On going work

Can we extend the result about the distribution of the value in the most probable class to BIPDB that are not uniform but balanced for the blocks sharing the same probability ? In this case, what can we say about the expected value and the answer on the most probable class, are they still equals ?
How to compute the most probable classes for an uniform BIPDB that is not necessarily balanced ?
What about the expected value and the answer on the most probable class on BIPDB with two columns ? Simple projection to obtain a new BIPDB, then apply what is above for a single column ?