Dear Colleagues,
Here is the basic idea resulting from our skype call:

-1. Parallel implementation: let the query diameter be D_q. Then have 
each processor take a volume in space that covers m * D_q by m * D_q by m * D_q distance units
for m >> 1. 
To make sure nothing is lost, the volumes should overlap by  D_q in every dimension.

0. Start with query consisting of k points in 3-space or with pairwise distances.

1. Let's say the minimum distance between two points in a query is m.

2. Find the level in the an octtree or multilayered grid structure such that the diameter
of nodes is less than m. Call that level Lmin.
Set L to Lmin or above. We'll decide that based on some cost model.

centroid c1 and centroid c2  s1 inside node having centroid c1.
s2 is in c2.
dist(c1, c2) - epsilon > dist(s1, s2).
But d1 is the query distance and d1 + epsilon > dist(s1, s2).
So in this case both s1 and s2 could be in c1 and  
both s1 and s2 could be in c2.

If we go down one level c11, c12, c13, c14 are the new centroids
c21, c22, c23, c24. 
Say s1 has fallen into c12 and s2 has fallen into c21.
Or perhaps s2 is in c11.


3. For each node n at level L set n to be the "anchor" (the anchor should be (if I remember correctly but Fabio please correct me if wrong) the centroid of the query points).
  Find other nodes whose centroids are within the appropriate distance of the anchor +/- the error and radius
  while True
   a) Perform the basic join algorithm (Amir's algorithm) on the centroids at level L
   b) Estimate the result size based on the number of stars within the nodes (my previous email)
   c) if  estimate in result size would indicate too much work to do the join, 
      then L += 1 / then go down another level
      else exit while loop
  end while
Perform star by star join in the voxels present

4. The main thing to figure out is c. So our cost model is that the cost of a join is the 
cost of the  cross-product calculation (as for a pure nested loop).
There are two costs: the cost of expanding a level
and the cost of joins on stars. If at level L there is already just one star per voxel, then clearly no need to go down in levels (by the way this implies that the grids/octtrees need to record how many stars are in them). How can going down a level help?

Suppose node n1 has count(n1) stars and node n2 has count(n2) stars, then the cross-product calculation provided n1 and n2 partially match will entail count(n1) * count(n2) comparisons. Suppose now
that we descend a level and we eliminate the farthest voxels. In the two dimensional case, that will eliminate 2 out of 16 possible pairs or 1/8 of the data, thus saving 1/8 * count(n1) * count(n2).
The cost is approximately linear in count(n1) + count(n2) if the tree must be expanded. The above was just an example. To determine how much we can save, let's say that the distance between the centroids of n1 and n2 is c and we are looking for distance d and the error is epsilon and the diameter of the nodes is Diam. Nodes n1 and n2 partially match means that c - (Diam + epsilon) <= d <= c + (Diam+epsilon)

Then when we go down a level, the diameter goes to Diam/2 and the centroids change in the x and y direction by Diam/(2*sqrt(2)) in the 2D case. So for example the farthest subnodes are something like c+Diam apart. For those subnodes to be partial matches, c + Diam - (Diam/2 + epsilon) <= d
<= c + Diam + (Diam/2 + epsilon). The point is that it's likely we'll have some subnodes that don't match provided c and d are not too close. On the other hand, if c = d, then we'd have to do down two levels to see any benefit.

The bottom line: compute the partial match predicate for each subnode and for that portion of sub-node pairs that are eliminated, call it p, ask whether p * count(n1) * count(n2) > count(n1) + count(n2) in the case that you need to build the subtree.

Please let me know if any of this is unclear.

Best,
Dennis