From dennisshasha@yahoo.com Sat Apr 10 07:07:44 2004
Return-Path: <dennisshasha@yahoo.com>
Received: from cs.nyu.edu (cs.nyu.edu [128.122.80.78])
	by griffin.cs.nyu.edu (8.12.10+Sun/8.12.10) with ESMTP id i3AB7Sgi017199
	for <shasha@griffin.cs.nyu.edu>; Sat, 10 Apr 2004 07:07:28 -0400 (EDT)
Received: from web10705.mail.yahoo.com (web10705.mail.yahoo.com [216.136.130.213])
	by cs.nyu.edu (8.12.11/8.12.11) with SMTP id i3AB7Qd5014446
	for <shasha@cs.nyu.edu>; Sat, 10 Apr 2004 07:07:26 -0400 (EDT)
Message-ID: <20040410110725.74699.qmail@web10705.mail.yahoo.com>
Received: from [62.3.235.96] by web10705.mail.yahoo.com via HTTP; Sat, 10 Apr 2004 04:07:25 PDT
Date: Sat, 10 Apr 2004 04:07:25 -0700 (PDT)
From: dennis shasha <dennisshasha@yahoo.com>
Subject: antipole paper with edits
To: alfredo@alpha.dipmat.unict.it, alfredo@etna.dipmat.unict.it,
        ferro@dipmat.unict.it
Cc: shasha@cs.nyu.edu
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="0-831932121-1081595245=:72599"
X-Scanned-By: MIMEDefang 2.39
Content-Length: 81934
Status: O

--0-831932121-1081595245=:72599
Content-Type: text/plain; charset=us-ascii
Content-Id: 
Content-Disposition: inline

Cari Alfredi,
My comments are here. Please respond to
my normal nyu address. I can't get to that this
weekend because of the Easter holiday.
Happy Easter,
Dennis

__________________________________
Do you Yahoo!?
Yahoo! Tax Center - File online by April 15th
http://taxes.yahoo.com/filing.html
--0-831932121-1081595245=:72599
Content-Type: text/plain; name="AntipoleRevised.tex"
Content-Description: AntipoleRevised.tex
Content-Disposition: inline; filename="AntipoleRevised.tex"

	 alfredo@alpha.dipmat.unict.it alfredo@etna.dipmat.unict.it
	 FERRO@dipmat.unict.it
	Dear Colleagues,
	Please ask for a few more days (you should not think
	that my turnaround is always guaranteed to be so short:)
	to incorporate my many mostly minor comments.
	My comments begin with Dennis:
	I have reviewed neither the figures nor the pseudo-code.
	Alfredino, I'm depending on you for that.
	Note that we don't deal with the height-balanced issue.
\documentclass[12pt]{article}
\usepackage{amsmath}
\usepackage{graphics}
%\usepackage{latexsym}
\usepackage{graphicx}
\usepackage{epsfig}
\usepackage{amsfonts}
\usepackage{amssymb}
%\usepackage{layout}
%\usepackage{setspace}
\overfullrule=5pt

\setlength{\oddsidemargin}{0.2mm}
\setlength{\evensidemargin}{-0.2mm}
\setlength{\textwidth}{16.6 cm}
\setlength{\textheight}{24.2 cm}
\setlength{\topmargin}{-2.1 cm}
%\setlength{\headsep}{30pt}
\setlength{\unitlength}{0.3 mm}
\baselineskip = 18pt
%\setstretch{1.6}
\linespread{1.6}
%\topmargin -15mm
%\textwidth 162mm
%\textheight 230mm
%\oddsidemargin 2mm
%\evensidemargin 2mm



\newcommand{\A}{{\cal A}}
\newcommand{\hh}{{\cal H}}
\newcommand{\s}{{\cal S}}
\newcommand{\W}{{\cal W}}
\newcommand{\BH}{\mathbf B(\cal H)}
\newcommand{\KH}{\cal  K(\cal H)}
\newcommand{\Real}{\mathbb{R}}
\newcommand{\Complex}{\mathbb C}
\newcommand{\Field}{\mathbb F}
\newcommand{\RPlus}{[0,\infty)}
\newcommand{\abs}[1]{\left\vert#1\right\vert}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\seq}[1]{\left<#1\right>}
\newcommand{\eps}{\varepsilon}
\newcommand{\To}{\longrightarrow}
\newcommand{\RE}{\operatorname{Re}}
\newcommand{\IM}{\operatorname{Im}}
\newcommand{\Poly}{{\cal{P}}(E)}
\newcommand{\EssD}{{\cal{D}}}
\newtheorem{thm}{Theorem}[section]
\newtheorem{defn}{Definition}[section]
\newcommand{\ine}{\mathsf{ \textsc{is\_not\_empty}}}
\newcommand{\dps}{\mathsf{ \textsc{PAGE\_SIZE}}}
\newcommand{\upcat}{\mathsf{ \textsc{update\_dataset\_catalog}}}
\newcommand{\rmf}{\mathsf{ \textsc{remove\_first}}}
\newcommand{\pt}{\mathsf{ \textsc{APPROX\_1\_MEDIAN}}}
\newcommand{\win}{\mathsf{ \textsc{1-MEDIAN}}}
\newcommand{\app}{\mathsf{ \textsc{APPROX\_ANTIPOLE}}}
\newcommand{\appb}{\mathsf{ \textsc{ADAPTED\_APPROX\_ANTIPOLE}}}
\newcommand{\bapt}{\mathsf{ \textsc{BUILD}}}
\newcommand{\baptE}{\mathsf{ \textsc{BUILD\_EUCLIDEAN}}}
\newcommand{\EuclidAclu}{\mathsf{ \textsc{2D\_ANTIPOLE\_CLUSTERING}}}
\newcommand{\EuclidAcluBS}{\mathsf{ \textsc{2D\_ANTIPOLE\_CLUSTERING\_BS}}}
\newcommand{\chap}{\mathsf{ \textsc{CHECK}}}
\newcommand{\rs}{\mathsf{ \textsc{RANGE\_SEARCH}}}
\newcommand{\diag}{\mathsf{ \textsc{DIAGONAL}}}
\newcommand{\diagB}{\mathsf{ \textsc{ANTIPOLE\_DIAGONAL}}}
\newcommand{\diagC}{\mathsf{ \textsc{2D\_DIAGONAL}}}
\newcommand{\vc}{\mathsf{ \textsc{VISIT\_CL}}}
\newcommand{\mclu}{\mathsf{ \textsc{MAKE\_CLUSTER}}}
\newcommand{\winb}{\mathsf{ \textsc{LOCAL\_WINNER}}}
\newcommand{\ap}{\mathsf{ \textsc{FIND\_ANTIPOLE}}}
\newcommand{\rot}{\mathsf{ \textsc{ROTATE\_SET}}}
\newcommand{\knn}{\mathsf{ \textsc{K\_NEAREST\_NEIGHBOR}}}
\newcommand{\knnvisit}{\mathsf{ \textsc{KNN\_VISIT}}}
\newcommand{\ck}{\mathsf{ \textsc{KNN\_CHECK}}}
\newcommand{\kvc}{\mathsf{ \textsc{KNN\_VISIT\_CLUSTER}}}
\newcommand{\hi}{\mathsf{ \textsc{HEAP\_INSERT}}}
\newcommand{\hem}{\mathsf{ \textsc{HEAP\_EXTRACT\_MAX}}}
\newcommand{\bp}{\mathsf{ \textsc{BEST\_PATH\_SEARCH}}}
\newcommand{\aknn}{\mathsf{ \textsc{APPROXIMATED\_KNN}}}
\newcommand{\secMEM}{\mathsf{ \textsc{BUILD\_SM}}}
\newcommand{\maken}{\mathsf{ \textsc{MAKE\_NODE}}}
\newcommand{\mclub}{\mathsf{ \textsc{MAKE\_CLUSTER\_SECMEM}}}
\newcommand{\Diam}{\mathsf{ \textsc{DIAMETER}}}
\newcommand{\upadd}{\mathsf{ \textsc{SET\_CHILD\_ADDRESS}}}
\newcommand{\leafpage}{\mathsf{ \textsc{L\_PAGE}}}
\newcommand{\diskpred}{\mathsf{ \textsc{I\_PRED}}}
\newcommand{\diskpage}{\mathsf{ \textsc{I\_PAGE}}}
\newcommand{\nodesize}{\mathsf{ \textsc{node\_size}}}
\newcommand{\allnode}{\mathsf{ \textsc{ALLOCATE\_L\_PAGE}}}
\newcommand{\adata}{\mathsf{ \textsc{ADD\_DATA}}}
\newcommand{\aclust}{\mathsf{ \textsc{ANTICLUSTAL}}}
\newcommand{\malign}{\mathsf{ \textsc{MERGE\_ALIGNMENT}}}
\newcommand{\alignclu}{\mathsf{ \textsc{ALIGN\_CLUSTER}}}
\newcommand{\gprof}{\mathsf{ \textsc{GET\_PROFILE}}}
\newcommand{\aseq}{\mathsf{ \textsc{ALIGN\_SEQUENCES}}}
\newcommand{\optal}{\mathsf{ \textsc{OPTIMAL\_ALIGNMENT}}}
\newcommand{\alproseq}{\mathsf{ \textsc{ALIGN\_PROFILE\_vs\_SEQUENCE}}}
\newcommand{\cm}{\mathsf{ \textsc{CLOSEST\_MATCH}}}
\newcommand{\rle}{\mathsf{ \textsc{REMOVE\_LONG\_EDGES}}}
\newcommand{\ri}{\mathsf{ \textsc{REMOVE\_INTERSECTIONS}}}
\newcommand{\riv}{\mathsf{ \textsc{REMOVE\_INTERSECTION\_VISIT}}}
\newcommand{\rb}{\mathsf{ \textsc{REASSIGN\_BACKBONE}}}
\newcommand{\ord}{\preceq}
\newcommand{\sord}{\prec}
\newcommand{\NN}{\mathbb{N}}
\newcommand{\RR}{\mathbb{R}}
\newcommand{\spz}{\vspace{5mm}}
\newcommand{\ovr}[1]{\overrightarrow{#1}}
\newcommand{\rnk}{\varrho}
\newcommand{\pa}{\pi_{a}}
\newcommand{\paa}[1]{\pa^{(#1)}}
\newcommand{\pii}[1]{\pi^{(#1)}}
\newcommand{\pp}[1]{p^{(#1)}(\ovr{\pa},b)}
\newcommand{\spa}{3 \paa{0} + 2\paa{1} + \paa{2}}
\newcommand{\constr}{C_{\ovr{q_{a}}}}
\newcommand{\bu}{\bullet}
\newcommand{\ci}{\circ}
\newcommand{\Sm}{\mu_{S}}
\newcommand{\Ss}{\sigma_{S}}
\newcommand{\eo}{\epsilon_{out}}
\newcommand{\id}{\noindent \underline{\em{Proof.}} \mbox{ } }
\newcommand{\eod}{\hfill{$\blacksquare$} \\ }
\newcommand{\dist}{\mathit{dist}}
\newcommand{\Cluster}{\mathit{Cluster}}
\newcommand{\List}{\mathit{List}}
\newcommand{\Radius}{\mathit{Radius}}
\newcommand{\Size}{\mathit{Size}}
\newcommand{\Rad}{\mathit{Rad}}
\newcommand{\myleft}{\mathit{left}}
\newcommand{\myright}{\mathit{right}}
\newcommand{\Leaf}{\mathit{Leaf}}
\newcommand{\Tdim}{\mathit{Tdim}}
\newcommand{\setT}{\mathit{SetT}}
%\topmargin -15mm
%\textwidth 162mm
%\textheight 230mm
%\oddsidemargin 2mm
%\evensidemargin 2mm

%\linespread{1.6}
\begin{document}
%\layout
\title{Antipole Tree Indexing to Support Range Search and K-Nearest Neighbor Search in Metric Spaces}
\author{D. Cantone$^1$, A. Ferro$^1$, A. Pulvirenti$^1$, D. Reforgiato$^1$, and D. Shasha$^2$
\\ $^1$Dipartimento di Matematica e Informatica \\
Universit\`a degli Studi di Catania \\
e-mail: \{cantone, ferro, apulvirenti, diegoref\}@dmi.unict.it \\
$^2$Computer Science Department \\
New York University \\ e-mail: shasha@cs.nyu.edu}
\date{}
\maketitle \abstract{ Range and $k$-nearest neighbor searching are
core  problems in pattern recognition, where one is given a
database $D$ of objects in a metric space $M$ and a query object
$q$ in $M$, and one wants to find those objects in $D$ that are
similar to $q$. In range searching one looks for the objects  of
$D$ within some threshold distance to $q$. In $k$-nearest neighbor
searching, the $k$ elements of $D$ closest to $q$ must be
produced. These problems can obviously be solved with a linear
number of distance calculations by comparing the query object
against every object in the database. The goal however is to solve
the problem much faster. We combine and extend ideas from the
M-Tree, the Multi-Vantage Point structure, and the FQ- tree to
create a new structure of the ``bisector tree'' class called the
Antipole Tree. Bisection is based on the proximity to an
"antipole" pair of elements generated by a suitable linear
randomized tournament. The final winners $a,b$ of this tournament
are far enough apart to approximate the diameter of  the splitting
set. If $\dist(a,b)$ is larger than the chosen cluster diameter
threshold, then the cluster is split. The proposed data structure
is an indexing scheme suitable for (exact and approximate) best
match searching on generic metric  spaces. Comparing Antipole Tree
with existing structures such as List of Clusters, M-trees and
others shows an improvement in many cases.
\\
\\
\noindent {\bf Index terms:}
Indexing methods, Similarity Measures, Information Search and Retrieval.
}
\section{Introduction}
Clustering is a basic tool in data mining whose goal it is to
partition the data set in classes according to a certain metric.
Clustering algorithms proposed in the literature include:
BIRCH~\cite{ZRL96}, DBSCAN~\cite{EKSX96}, CLIQUE~\cite{AGGR98},
BIRCH*~\cite{GRGPF99}, WaveClusters~\cite{SCZ20},
CURE~\cite{GRS98}, and CLARANS~\cite{NH02}.  Those techniques
focus less on the efficiency of data retrieval than the
MVP-Tree~\cite{BO99}, M-Tree~\cite{CPZ97},
SLIM-Tree~\cite{TTSF99}, FQ-Tree~\cite{BCMW94}, List of
Clusters~\cite{CN2000}, SAT~\cite{Nav2002} (see~\cite{CNBM01} for
a survey).
Very relevant works on Euclidean spaces include~\cite{AWYE99,berchtold96xtree,Bey99,Sum00a,Sum00a,GIM99}.\\
We combine ideas from these structures together with special
randomized techniques coming from the approximate algorithms
community~\cite{BCCCH99}, to design a simple and efficient
indexing scheme called {\em Antipole Tree}. This data structure is
	Dennis: indexing structure called an {\em Antipole Tree}
able to support range queries and $k$-nearest neighbor queries
in general metric spaces. \\
The Antipole Tree belongs to the class of ``bisector
trees''~\cite{CNBM01,KM83,NVZ92} that are binary trees where each
node represents a set of elements to be clustered. The process
begins by partitioning the input set of elements according to
their proximity to two randomly chosen centers $c_1$, $c_2$. The
resulting partitions are then treated as a new input set
	Dennis: as new input sets
recursively. The process continues until no element is left. Each
node of the tree contains the
centers and the radii of the underling subsets of elements. \\
	Dennis: underlying
	Dennis: Throughout, please make these full paragraph breaks.
	Breaking  a line is not enough.
In order to produce a good pair of elements ($c_1$,$c_2$) we
make use of a simple randomized tournament among the
elements of the input set related to that described also in~\cite{BCCCH99}.
Our tournament is played as follows. At each round, the winners of
the previous round are randomly partitioned into subsets of a
fixed size $t$. Then a procedure finds each subset's
1-median\footnote{We recall that the 1-median of a set
of points $S$ in a metric space is an element of $S$ whose average
distance to all points of $S$ is minimum.}
and discards it.
Rounds are played until less than $2t$ elements are left.
The farthest pair of points in that set is
our antipole pair of elements.\\
	Dennis: Again, full paragraph breaks.
The paper is organized as follows. In the next section, we give
the basic definition of range search queries and $k$-nearest
	Dennis: range search and nearest neighbor queries
neighbor query in general metric spaces and briefly review
preceding works which are relevant for the problem we are dealing
	Dennis: relevant previous work with special emphasis...
with putting special emphasis on those structures which have been
shown to be the most effective: List of Clusters~\cite{CN2000},
M-Trees \cite{CPZ97} and MVP-Trees \cite{BO99}. In
section~\ref{antipolesection} we describe the Antipole Tree,
giving also in subsection~\ref{sec:algorithm1median}
and~\ref{sec:diameter} techniques to compute the approximate
1-Median and diameter in general metric spaces. In section
\ref{rangesearchsection}, we present a procedure for range search
	Dennis: for range searching on the...
query execution on the Antipole Tree. Section \ref{knnsection}
gives an algorithm for the exact k-nearest neighbor problem.
Section~\ref{ExpAnalysis} compares the Antipole Tree with List of
Clusters, M-Tree and MVP-Tree experimentally. In particular,
cluster diameter threshold tuning is discussed. An approximate
$k$-nearest neighbor algorithm is also introduced in
section~\ref{approxknn} and a comparison with the version for
approximate search of List of Clusters~\cite{BN2002} is given with
a precision-recall analysis. In section \ref{linearScan} we deal
with the problem of the curse of dimensionality. Indeed in high
dimension, linear scan for uniform data sets may become
competitive with the best searching algorithms. However most of
the real world data sets are non uniform. We successfully compare
our algorithm with linear scan in non uniform data sets of very
high dimensional Euclidean spaces.
Finally we summarize the results in section~\ref{concl}.\\
In Appendix A we show an efficient approximation scheme for the
Euclidean case.

\section{Basic Definitions and Related Works}
Let $X$ be a non-empty set of objects and let $d: X \times X \To  \Real$
be a function such that the following properties hold:
\begin{enumerate}
\item $(\forall \,x, \,y \, \in X)\,  d(x, y) \geq 0$
(positiveness);
\item $(\forall \, x, \, y \, \in X) \, d(x, y) \,=\, d( y, x)$
(symmetry);
\item $(\forall \,x \,\in \, X) \, d(x, x) = 0$ (reflexivity), \\
       $(\forall x, \, y\,  \, \in X) \, (x \neq y  \,\rightarrow \, d(x, y) > 0)$ (strict
positiveness);
\item $(\forall \,x,\, y,\, z \, \in \, X) \, d(x, y) \leq d(x, z) \,+\, d(z,
y)$ (triangle inequality);
\end{enumerate}
then the pair $(X,d)$ is called a \emph{metric space} and $d$ is called its
{\em metric function}.
Well known metric functions include Manhattan distance, Euclidean
distance, string edit distance, or the shortest path distance through
a graph.
Our goal is to build a low cost data structure for the range
search problem and $k$-nearest neighbor in metric spaces.
	Dennis: nearest neighbor searching
\begin{defn}(Range query).
Given a query object $q$, a database $D$, and a threshold $t$, the
Range Search Problem is to find all objects $\{o \in D| \dist(o,q)
\leq t\}$.
\end{defn}
\begin{defn}($k$-Nearest Neighbor query).
Given a query object $q$ and an
integer value $k > 0$, retrieve the $k$ closest elements to $q$ in
$S$.
\end{defn}
Our basic cost measure is the number of distance calculations since these are often expensive
in metric spaces, e.g. when computing the editing distance among strings.\\
Three main sources of ideas have contributed to our work. The
FQ-tree \cite{BCMW94}, an example of structure using pivots
	Dennis: of a structure using..
(see~\cite{CNBM01} for an extended survey), it organizes the items
of a collection ranging over a metric space into the leaves of a
tree data structure. Abstracting slightly from its published
	Dennis: Viewed abstractly, FQ-trees consist ...
description, FQ-trees consist of a vector of reference objects
$r_1, \cdots, r_k$ and a distance vector $v_o$ associated with
each object $o$ such that $v_o[i] = \dist(o,r_i)$. A query object
$q$ computes a distance to each reference object, thus obtaining a
$v_q$. Object $o$ cannot be within a threshold distance $t$ from
$q$ if for any $i$, $v_q[i] > v_o[i] + t$. That is, even if $o$ is
closer to $q$ than
$r_i$, $q$ cannot be closer to $o$ than $t$.\\
We use a similar idea except that our reference objects are the
centroids of clusters.\\
M-trees~\cite{CPZ97,CP98} are dynamically balanced trees. Nodes of
an M-tree store several items of the collection provided that they are
``close'' and ``not too numerous''. If one of these conditions is
violated, the node is split and a suitable sub-tree originating in
the node is recursively constructed. In the M-tree, each parent
node corresponds to a cluster with a radius and every child of
that node corresponds to a subcluster with a smaller radius. If a
centroid $x$ has a distance $\dist(x,q)$ from the query object and
the radius of the cluster is $r$, then the entire cluster
corresponding to $x$ can be discarded if $\dist(x,q) > t + r$.
\\ We take from the M-tree
the idea that a parent node corresponds to a cluster and its
children nodes are subclusters of that parent cluster. The main
difference is that our construction method is quite different and
that the clusters in the M-Tree must have a limited number of
elements. Furthermore our search strategy is  different
because our algorithm produces a binary tree data structure.
	Dennis: New paragraph
VP-trees~ \cite{Uh91,Y93} organize items coming from a metric
space into a binary tree. The items are stored both in the leaves
and in the internal nodes of the tree. The items stored in the
internal nodes are the ``vantage points''. To process a query
requires the computation of the distance between it and some of
	Dennis: between the query point
the vantage points.
The construction
of a VP-tree partitions a data set according to the distances
that the objects have with respect to a reference point. The
median value of these distances is used as a separator to
partition objects into two balanced subsets (those
as close or closer than the median and those farther than
the median). The same procedure
can recursively be applied to each of the two subsets.\\
The Multivantage-Point tree \cite{BO99} is an intellectual
descendant of the vantage point tree and the GNAT \cite{Bri95}
structure. MVP-Tree appears to be superior to the
	Dennis: The MVP-Tree
previous methods. The fundamental idea is that, given a point $p$,
one can partition all objects into $m$ partitions based on their
distances from $p$, where the first partition consists of those
points within distance $d_1$ from $p$, the second consists of
those points whose distance is greater than $d_1$ and less than or
equal to $d_2$, etc. Given two points, $p_a$ and $p_b$, the
partitions $a_1,\cdots, a_m$ based on $p_a$ and the partitions
$b_1, \cdots, b_m$ based on $p_b$ can be created. One can then
intersect all possible $a$- and $b$-partitions (i.e. $a_i$
intersect $b_j$ for $1 \leq i \leq m$ and $1 \leq j \leq m$) to
get $m^2$ partitions. In an MVP-tree, each node in the tree
corresponds to two objects (vantage points) and $m^2$ children,
where $m$ is a parameter of the construction algorithm and each
child corresponds to a partition. When searching for objects
within distance $t$ of query point $q$, the algorithm does the
following: given a parent node having vantage points $p_a$ and
$p_b$, if some partition $Z$ has the property that for every
object $z \in Z$, $\dist(z,p_a ) < d_z$ and $\dist(q,p_a ) > d_z +
t$, then $Z$ can be discarded. There are other reasons for
discarding clusters, also based on the triangle inequality. Using
multiple vantage points together with pre-computed distances
reduces the number of distance computations at query time.
Like the MVP-tree, our structure makes aggressive use of the triangle
inequality.\\
Another relevant recent work is due to Ch\'avez et
al.~\cite{CN2000}, the paper presents a structure called List of
Clusters. The method divides the space into clusters of bounded
diameter (or bounded number of elements) and returns a list of
such clusters. Clusters are generated one by one. Starting from a
random point a cluster with bounded diameter (or number of
objects) is constructed around it, then the algorithm select
	Dennis: algorithm selects
another point which could be the furthest from the previous one
	Dennis: which should be ???
and repeats recursively the method. It proceeds until  no more
	Dennis: forms a cluster of the same diameter as the first
	cluster and then repeats recursively
elements remain. The authors claim with analytical and
	Dennis: authors show
experimental results that their structure outperforms previous
methods and give also parameters to properly tune it.

Other sources of inspiration include~\cite{BK73,C99,FCCM99,G85,SW90,S77,TTSF99,Nav2002}.
\section{The Antipole Tree}\label{antipolesection}
Let ($S$, $\dist$) be a finite metric space and suppose that we
aim to split it into the minimum possible number of clusters whose
radius should not exceed a given threshold $\sigma$. This problem
has been studied by Hochbaum and
Maass~\cite{HM85} for Euclidean spaces.
Their approximation algorithm has been
improved by Gonzalez in~\cite{G91}. Similar ideas are used by
Feder and Greene~\cite{FG88} (see~\cite{Proc01} for an extended
survey on clustering methods in Euclidean spaces). \\
The Antipole clustering of bounded radius $\sigma$ is performed by
a recursive top down procedure starting from the given finite set
of points $S$ and checking at each step if a given splitting
condition $\Phi$ is satisfied. If this is not the case then
splitting is not performed, the given subset is a cluster, and a
centroid having distance approximatively less than $\sigma$ from
every other node in the cluster is computed
by the procedure described in section~\ref{sec:algorithm1median}.

Otherwise if $\Phi$ is satisfied then a pair of points $\{A,B\}$
of $S$ called the Antipole pair is generated by the algorithm
described in~\ref{sec:diameter} and is used to split $S$ into two
subsets $S_{A}$ and $S_{B}$
%($A \, \in \, S_A$ and $B \, \in \, S_B$)
obtained by assigning each point of $S$ to the subset containing
	Dennis: each point $p$
its closest endpoint of the Antipole $\{A,B\}$. The splitting
	Dennis: the endpoint closest to $p$
condition $\Phi$ states that $\dist(A,B)$ is greater than the
cluster diameter threshold corrected by the error coming from the
Euclidean case analysis described in Appendix~\ref{ediam}. Indeed
the diameter threshold is based on a statistical analysis of the
pairwise distances of the input set (see section~\ref{CHOSEDIAM})
which can be used to evaluate the intrinsic
dimension~\cite{CNBM01} of the metric space. The tree obtained by
the above procedure is called Antipole Tree, all nodes are
	Dennis: an Antipole Tree. All nodes
annotated with the Antipole endpoints with the corresponding
	Dennis: endpoints and the corresponding cluster radius; each leaf 
cluster radius and each leaf contains also the 1-median of the
corresponding final cluster. Its implementation is described in
section~\ref{APTreeInMS}
\subsection{1-Median} \label{sec:algorithm1median}
In this section we review a randomized algorithm for the approximate 1-median
selection~\cite{CCFP03}. It is based on a tournament played among the elements
	Dennis: \cite{CCFP03}, an important subroutine in our 
	Antipole Tree construction.
of the input set $S$. At each round, the elements which passed the
preceding turn are randomly partitioned into subsets, say
$X_{1},\ldots,X_{k}$. Then, each subset $X_{i}$ is locally
processed through a procedure which computes its exact 1-median
$x_{i}$. The elements $x_{1},\ldots,x_{k}$ move to the next round.
The tournament terminates when we are left with a single element
$\overline{x}$, the final winner. It is the element $\overline{x}$
which is chosen to approximate the exact 1-median in $S$.
	Dennis: The winner approximates the exact ...

A possible implementation of the above general method consists of
partitioning the elements at each round in subsets having the same
	Dennis: the elements (while there are at least some threshold
	number $t$ of them)
	at each round into subsets
size $t$, with the possible exception of one subset whose size
must lie between $t$ and $2t-1$. In addition, we can assume that
	Dennis: The iteration stops when
the iteration stops when the number of elements falls below a
given {\it threshold}. Finally the quadratic ``brute force''
	Dennis: the {\it threshold}.
	Dennis: At that point, the algorithm performs a quadratic
computation returns the exact 1-median of the residual set. This
	Dennis: This procedure
is summarized in Fig.~\ref{fig:algo}, where the local optimization
procedure $\win\:(X)$ returns the exact 1-median in $X$.


\begin{figure}[ht!]
\begin{center}
\fbox{ \scriptsize
\begin{minipage}{0.90\textwidth}
\begin{center}
{\bf \large The approximate 1-Median selection algorithm}
\end{center}
\begin{tabular}{c|c}
\begin{minipage}{0.45\textwidth}
\begin{tabbing}
 $\win\,(X)$ \\
\ \ \ \ \ \ \= \+
 1 \' \= {\bf for each} $x \in X$ {\bf do} \\
 2 \' \>  $\sigma_x \leftarrow \sum_{y \in X} d(x,y)$; \\
 3 \' \> Let $m \in X$ be an element such that  \\
   \' \> $\sigma_m = \min_{x \in X} (\sigma_x)$; \\
 4 \' \> {\bf return} $m$
\end{tabbing}
\end{minipage} &
%\vspace{1mm}
\begin{minipage}{0.45\textwidth}
\begin{tabbing}
 $\pt\,(S)$ \\
\ \ \ \ \ \ \= \+
1 \'  \= {\bf while} \= $|S| > $ {\it Threshold} {\bf do} \\
2 \'  \>             \>  $W \leftarrow \emptyset$; \\
3 \'  \>             \> {\bf while} \= $|S| \geq 2t$ {\bf do} \\
4 \'  \>             \>             \> Choose randomly a subset $T \subseteq S$, with $|T|=t$; \\
5 \'  \>             \>             \> $S \leftarrow S \setminus T$; \\
6 \'  \>             \>             \> $W \leftarrow W \cup \{ \win\,(T) \}$; \\
7 \'  \>             \> {\bf end while}; \\
8 \'  \>             \> $S \leftarrow W \cup \{\win\,(S)\}$; \\
9 \'  \> {\bf end while}; \\
10 \' \> {\bf return} $\win\,(S)$;
\end{tabbing}
\end{minipage}
\end{tabular}
\end{minipage}}
\end{center}
\caption{Pseudo-code of the approximate
algorithm.}\label{fig:algo}
\end{figure}
	Dennis: choose a random (uniform distribution, 
	without replacement) subset

Notice that, each random partitioning phase can be simplified
introducing efficient pseudo-randomization methods
(see~\cite{BCCCH99}).

%In particular, it is easily seen that in the case of
%uni-dimensional Euclidean spaces, the resulting approximate
%algorithm reduces exactly to the one given in \cite{BCCCH99}.
%%----------------------------------------------------------------------------
%%----------------------------------------------------------------------------


\subsubsection{Running time analysis of 1-Median computation} \label{sec:runtime1median}

	Dennis: These paragraphs can be greatly reduced.
	The following analysis shows that the approximation
	algorithm given in Fig ... is linear in the number
	of distances that need to be computed.
	Let a splitting factor ...
In this section we give the running-time analysis of the
approximate algorithm given in Fig.~\ref{fig:algo}.

The computational complexity of the algorithm coincides both in
the worst and average-cases, and it turns out to be linear in the
input size. Furthermore, the algorithm is distinguished by its
simplicity and hence it is expected to be very efficient from the
computational point of view.

Below we estimate the complexity of our algorithm, in terms of the
number of distances which need to be computed. Let a splitting
factor $t \in \NN$ be fixed, and let $n=t^r$, with $r \in \NN$, be
the cardinality of the input set.  Most of the work of the
algorithm is spent in the subroutine $\win$, computing the local
1-median element. The number $W(n)$ of times such subroutine is
executed by the algorithm is given by the following recurrence
equation:
\[
 W(n) = \left\{
 \begin{array}{ll}
 0 & \mbox{if} \; n=1, \\
 \frac{n}{t} + W(\frac{n}{t}) & \mbox{otherwise}, \\
 \end{array} \right.
\]
whose solution is immediately seen to be $W(n) = \frac{1}{t-1}(n-1)$, when
	Dennis: cut out "immediately ... be"
$n$ is a power of $t$.   At each call, procedure $\win$ performs
$\frac{t(t-1)}{2}$ distance computations. The following result is
therefore immediate when the $\mathit{threshold}$ is  $1$.
	Dennis: follows when
\begin{thm} \label{th1}
   Given an input set of cardinality $n=t^r$, with $r \in
   \NN$, our \\ $\pt$ performs exactly $\frac{t}{2}(n-1)$ distance computations. \eod
\end{thm}

In the general case, namely when $n$ is not an exact power of $t$,
the leading term (and its coefficient) in the asymptotic
expression of the number of distance computations is the same as
in Theorem~\ref{th1}.  Thus, it is immediate to see that if $t$
and \textit{threshold} are constants and the running time of each
call to procedure $\win(T)$ is also assumed to be constant when
$|T| < 2t$, then the algorithm turns out to be linear in its input
size.
	Dennis: don't need above last sentence.
	On the other hand, we should summarize empirical results
	which demonstrate that this gives a good approximation
	to the 1-median.
\subsection{The Diameter (Antipole) computation}\label{sec:diameter}
Let $(M,d)$ be a metric space with distance function $d : (M
\times M) \longmapsto \RR$ and let $S$ be a finite subset of $M$.
The {\em diameter computation problem} or {\em furthest pair
problem} is to find the pair of points $A,B$ in $S$ such that
$d(A,B) \geq d(x,y),\,\, \forall x,y \,\, in \, S$. \\
As reported in~\cite{indyk}, we can
	Dennis: As observed in...
construct a metric space where all distances among objects are set
to 1 except for one (randomly chosen) which is set to 2. In this
case every algorithm, which tries to give an approximation factor
	Dennis: any algorithm that tries...
greater than $1/2$ must examine all pairs, so a randomized algorithm
will not necessarily find that pair.\\
Nevertheless, we expect a good outcome in nearly all cases. Here
we introduce a randomized algorithm inspired by the one proposed
for the 1-median computation~\cite{CCFP03} and reviewed in the
preceding section. In this case,  each subset $X_{i}$ is locally
processed through a procedure $\winb$ which computes its exact
1-median $x_{i}$ and then returns the set $X_i$ without the
element $x_i$ we call it $\bar{X}_{i}$. The elements obtained from
	Dennis: We call the returned set $\bar{X}_{i}$.
$\bar{X}_{1}\, \cup \, \bar{X}_2\,\ldots\,
 \cup \,\bar{X}_{k}$
are used in the
next step. The tournament terminates when we obtain a single set
$\overline{X}$, from which we extract the final winners $A,B$ which are
the furthest points in $\bar{X}$.
$A,B$ is called the Antipole pair and  it represents the approximate diameter of the
	Dennis: the distance between the points represents
set $S$.\\
The pseudo-code of Antipole computation similar to that of the
	Dennis: of the Antipole computation
1-Median computation is given in Fig.~\ref{fig:Antipole}.
\begin{figure}[ht!]
\begin{center}
\fbox{ \scriptsize
\begin{minipage}{0.90\textwidth}
\begin{center}
{\bf \large The approximate Antipole selection algorithm}
\end{center}
\begin{tabular}{c|c}
\begin{minipage}{0.42\textwidth}
\vspace{-.2em}
\begin{tabbing}
$\winb(T)$ \\
\ \ \ \ \ \ \= \+
1 \' \= {\bf return} \= $T \setminus \win(T)$; \\
2 \' \= \textsc{end $\winb$}
\end{tabbing}
\mbox{}
\vspace{-.2em}
\begin{tabbing}
$\ap (T)$ \\
\ \ \ \ \ \ \= \+
1 \' \= {\bf return} \= $ {P_1,\,P_2} \in \, T$ {\it such that} \\
  \' \>              \> $\dist(P_1,P_2) \,\geq \, \dist(x,y) \,\forall x,y \,\in \,T $; \\
2 \' \> \textsc{end $\ap$}
\end{tabbing}
\end{minipage}
&
\begin{minipage}{0.42\textwidth}
\vspace{-.2em}
\begin{tabbing}
$\app (S)$ \\
\ \ \ \ \ \ \= \+
 1 \' \= {\bf while} \= $ |S| > $ {\it threshold} {\bf do} \\
2 \'    \>             \> {\it W $\leftarrow \, \emptyset$}; \\
3 \'    \>             \> {\bf while} \= $S$ {$ \geq  2*t$} {\bf do} \\ %{$ \neq \emptyset$} {\bf do} \\
4 \'    \>             \>             \> {\it Randomly choose a
set $T \subseteq
S \, : \, |T| \, = \, t$;} \\
5  \'      \>           \>          \> $S \leftarrow S \, \setminus \, T$; \\
6  \'      \>           \>          \> $W \leftarrow W \, \cup \,$ $ \{ \winb(T) \}$; \\
7  \'      \>           \> {\bf end while} \\
8  \'      \>           \> {$S \, \leftarrow W \, \cup \,$} $\{\winb(S)\}$;\\
9  \'      \> {\bf end while}  \\
10 \'      \>{\bf return} $\ap$(S); \\
11 \'      \>\textsc{end $\app$}
\end{tabbing}
\end{minipage}
\end{tabular}
\end{minipage}
}
\end{center}
\caption{The pseudocode of the Algorithm.} \label{fig:Antipole}
\end{figure}

\subsubsection{Running time analysis of Antipole computation}
The computational complexity of the algorithm is linear in the
input size. Below we estimate the complexity of our algorithm, in
terms of the number of distances which need to be computed. Let a
	Dennis: The number of distances computed is linear
	in the input size. Let a ...
splitting factor $t \in \NN$ be fixed. In this case,  most of the
work of the algorithm is spent in the subroutine $\winb$,
computing the local 1-median element. The number $W(n)$ of times
such subroutine is executed by the algorithm is given by the
following recurrence equation:
\[
 W(n) = \left\{
 \begin{array}{ll}
 0 & \mbox{if} \; n=1, \\
 \frac{n}{t} + W(\frac{n \times (t-1)}{t}) & \mbox{otherwise}, \\
 \end{array} \right.
\]
whose solution is $W(n)= \sum_{i=1}^{\log n} \frac{n \times
(t-1)^{i-1}}{t^{i}} \,= \, n \, =$ O($n$) and at each
call, it performs $\frac{t(t-1)}{2}$ distance computations. The
following result is therefore immediate when the
$\mathit{threshold}$ is  $t-1$.
\begin{thm} \label{th2}
   Given an input set $S$ of cardinality $n$, our $\app$ performs
   O($t^2 n$) distance computations. \eod
\end{thm}
Therefore  the algorithm is quadratic in the size of the
tournament but linear in the size of the input. When the size of the
tournament $t$ is a small constant (e.g. 5), the
complexity is linear.
	Dennis: Again, if we have empirical results showing
	this comes close to the diameter, that would be good.



%%----------------------------------------------------------------------------
%%----------------------------------------------------------------------------



\subsection{The Antipole Tree data structure in general metric spaces} \label{APTreeInMS}
The Antipole Tree data structure acts on a metric space
	Dennis: can apply to a metric...
$(X,\dist)$ where $\dist$ is the distance metric. The elements of
the metric space are inserted into a type called $object$.
An \textit{object} $O$ (Fig. \ref{OB} (a)) in the Antipole data
structure contains the following information: an element $x$, an
array $D_V$ storing the distances between itself and all its
ancestors (the antipole pairs) in the tree and a variable $D_C$
containing the distance from the centroid $C$ of $x$'s cluster. A
data set $S$ is a collection of
\textit{objects}. \\


\begin{figure}[ht!]
\begin{center}
\fbox{
\begin{tabular}{cc}
\includegraphics[width=30mm]{Ob.eps} & \includegraphics[width=30mm]{CLU.eps} \\
(a) & (b)
\end{tabular}
}\caption{(a) A generic object in the antipole data structure.
(b) A generic cluster in the antipole data structure.} \label{OB}
\end{center}
\end{figure}

Each cluster (Fig. \ref{OB} (b))  is a set of $objects$ storing the
following information:
\begin{itemize}
\item centroid, called $C$, which is the element that minimizes
the sum of the distances from the other objects; \item radius,
$\Radius$, containing the distance to the farthest object from
$C$; \item member list, $C_{\List}$ storing the catalogue of the
objects contained in the cluster; \item number of objects,
$\Size$, stored in the cluster.
\end{itemize}
The antipole data structure has internal nodes and leaf nodes.
\begin{itemize}
\item An internal node stores (i) the identities of two antipole
objects $A$ and $B$, called the antipole pair, of distance at
least $2\sigma$ apart, (ii) the radii $\Rad_{A}$ and $\Rad_{B}$ of
the two sub-sets ($S_A$, $S_B$ obtained by splitting $S$ based on
their proximity to $A$ and $B$, respectively), and (iii) pointers
to the left and right sub-trees $\myleft$ and $\myright$;
\item A leaf node stores a cluster.
\end{itemize}
\begin{figure}[ht!]
\begin{center}
\fbox{ {\scriptsize
\begin{minipage}[c]{160mm}
\begin{tabbing}
$\bapt$($S$, $\sigma$, $Q$) \\
\ \ \ \ \ \ \= \+
1 \' \= {\bf if} \= $Q$ $=$ $\emptyset$ {\bf then} \\
2 \' \>          \> $Q \leftarrow$ $\appb$($S$,$\sigma$); \\
3 \' \>          \> {\bf if} \= $Q \, = \, \emptyset$ {\bf then} // %nothing distant enough,
splitting condition $\Phi$ fails\\
4 \' \>          \>          \> $T.\Leaf \, \leftarrow \, TRUE$;\\
5 \' \>          \>          \> $T.\Cluster \, \leftarrow$ $\mclu$($S$);\\
6 \' \>          \>          \> {\bf return } $T$; \\
7 \' \>          \> {\bf end if}; \\
8 \' \= {\bf end if}; \\
9 \' \=  \{A,B\} $\leftarrow \, Q$; \\
10 \' \= $T.A \, \leftarrow \, A$;\\
11 \' \= $T.B \, \leftarrow \, B$;\\
12 \' \> $S_A \, \leftarrow \, \{O \in S | \dist(O,A) < \dist(O,B)\}$;\\
13 \' \> $S_B \, \leftarrow \, \{O \in S | \dist(O,B) \leq \dist(O,A)\}$;\\
14 \' \> {\bf for} \= {\bf each}  $\, O \, \in \, S$;\\
15 \' \>           \> $O.D_V \, \leftarrow \, O.D_V \, \cup \, \{(A, \dist(O,A)),\, (B,\dist(O,B))\}$; \\
16 \' \> {\bf end for each}; \\
17 \' \= $T.\Rad_A  \, \leftarrow \, \max_{O \, \in \, S_A} \dist(O,A)$; \\
18 \' \= $T.\Rad_B  \, \leftarrow \, \max_{O \, \in \, S_B} \dist(O,B)$; \\
19 \' \= $T.\myleft  \, \leftarrow$ $\bapt$($S_A$,$\sigma$,$\chap$($S_A$,$\sigma$,$A$));\\
20 \' \= $T.\myright \, \leftarrow$ $\bapt$($S_B$,$\sigma$,$\chap$($S_B$,$\sigma$,$B$));\\
21 \' \= {\bf return } $T$; \\
22 \' \textsc{end $\bapt$.}
\end{tabbing}
\end{minipage}}
}\caption{Antipole Algorithm.} \label{BuldAPTREE1}
\end{center}
\end{figure}
To build such a data structure $\bapt$ (see Fig.
\ref{BuldAPTREE1}) takes as input the data set $S$, the cluster
	Dennis: from the data set $S$ and a target cluster radius
	$\sigma$, begin a set $Q$ which is initialized to be empty.
radius $\sigma$ and a set $Q$ which is empty at the beginning. The
algorithm starts by checking if $Q$ is empty and if so it calls
$\appb$,  which returns an antipole pair. Then the antipole pair
is inserted into $Q$. Such an algorithm works as the one presented
	Dennis: the algorithm to find an antipole pair is as described
in the previous section  but if a pair
	Dennis: except if a pair
with distance greater than $2 \sigma$ is found it stops and returns such a pair. \\
	Dennis: the algorithm stops and returns that pair
Next the algorithm checks if the splitting condition is true. If
this is the case, the set $S$ is divided into $S_A$ and $S_B$
according with the distances from the antipole pair. Otherwise a
	Dennis: where the objects closer to $A$ are put in $S_A$
	and symmetrically for $B$.
cluster is generated.
The other subroutine used in $\bapt$ is $\chap$ which checks
whether there is an object $O$ in $S$ that may become the antipole
of $O^{*}$, by using the distances already computed and cached. If
an antipole is found it is inserted into $Q$ and then the
recursive call in $\bapt$ skips the computation of another antipole pair. \\
The routine $\mclu$ (Fig.~\ref{MakeClu}) creates a cluster of
objects with bounded radius. This procedure computes the cluster
centroid with the  randomized algorithm $\pt$, next for each
	Dennis: centroid $C$
	Dennis: $\pt$ and then computes the 
	the distance between each $O$ in the cluster and $C$.
object $O$ in the cluster it computes the distance between $O$ and
the cluster centroid $C$.
\begin{figure}[ht!]
\begin{center}
\fbox{ {\scriptsize
\begin{minipage}[c]{160mm}
\begin{tabbing}
$\mclu$($S$) \\
\ \ \ \ \ \ \= \+
1 \' \= $\Cluster.C \, \leftarrow$ $\pt$($S$); \\
2 \' \= $\Cluster.\Radius \, \leftarrow \, \max_{x\, \in \, S} \dist(x,\Cluster.C)$ \\
3 \' \> $\Cluster.C_{\List} \, \leftarrow \, S \, \setminus \, \{\Cluster.C\}$; \\
4 \' \> {\bf for each} \= $\, x \, \in \, \Cluster.C_{\List} $;\\
5 \' \>                \> $x.D_{C} \, \leftarrow \, \dist(x,\Cluster.C)$; \\
6 \' \>  {\bf end for each};\\
7 \' \>  {\bf return} $\Cluster$;\\
\textsc{end $\mclu$}
\end{tabbing}
\end{minipage}}
} \caption{MakeCluster Algorithm} \label{MakeClu}
\end{center}
\end{figure}
The data structure yields by $\bapt$ is a binary tree with the
	Dennis: resulting from $\bapt$
	Dennis: whose leaves contain a set of clusters, each
	of which has an approximate centroid and the radius
	based on that centroid is less than $\sigma$.
leaves containing a set of clusters with approximate centroids
whose radii are less than $\sigma$.
Fig.~\ref{example1} (a) shows
the evolution of the data set during the construction of the tree.
At the first step the pair $A$,$B$ is found by the algorithm
$\appb$ after that the input data set is split into two subsets
	Dennis: $\appb$, then the input
containing the objects nearest to $A$ and the objects nearest to
$B$ respectively. The second step proceeds as the first for the
subset containing $A$ while, for the subset containing $B$, it
produces a cluster because its diameter is less than $2 \sigma$.
The third and final step produces the final clusters for the
subsets containing $A_1$ and $B_1$. Fig.~\ref{example1} (b) shows
the corresponding Antipole data structure.
\begin{figure}[ht!]
\begin{center}
\fbox{ \centering
\begin{tabular}{cc}
\includegraphics[width=30mm]{dataset.eps} & \includegraphics[width=30mm]{ds.eps} \\
(a) & (b)
\end{tabular}
} \caption{A clustering example (a) in a generic metric space and
(b) the corresponding Antipole data structure.}\label{example1}
\end{center}
\end{figure}

\subsection{Construction time analysis}
Let us compute the running time of each routine. Building the
Antipole  could take quadratic time in the worst case. For
example, consider a metric space in which the distance between any
pair of distinct objects is $2 \sigma+1$. In this case, the
construction time is quadratic in the number of points as in other
clustering algorithms.%, including MVP-tree.
\section{Range search algorithm}\label{rangesearchsection}
The range search algorithm takes as input the antipole tree $T$,
the query $object$ $q$, the threshold $t$, and returns the result
of the range search of the database with threshold $t$. The search
algorithm recursively descends all branches of the tree until
either it reaches a leaf representing a cluster to be visited or
it detects a subtree that is certainly out of range and therefore
may be pruned out. Such branches are filtered by applying the
triangle inequality. Notice that the triangle inequality is used
both for exclusion and inclusion. The first use establishes that
an object can be pruned avoiding the computation of the distance
	Dennis: pruned, thus avoiding
between such an object and the query. The second one establishes
that an object must be inserted, because the object is close to
its cluster's centroid and the centroid is so close to the query
object (see Figs.~\ref{RangeSearch} and~\ref{VC_RangeSearch} for
the pseudo-code).
\begin{figure}[ht!]
\begin{center}
\fbox{ {\scriptsize
\begin{minipage}[c]{160mm}
\begin{tabbing}
$\rs$($T$, $q$, $t$, $OUT$) \\
\ \ \ \ \ \ \= \+
1 \' \= {\bf if} \= ($T.\Leaf$ = {\it FALSE}) {\bf then}\\
2 \' \>          \> $D_{A} \, \leftarrow \,$ $\dist$($q,$ $T.A$);\\
3 \' \>          \> $D_{B} \, \leftarrow \,$ $\dist$($q,$ $T.B$);\\
4 \' \>          \> {\bf if} \= ($D_{A}  \, \leq \, t$)  {\bf then} \\
5 \' \>          \>          \> $OUT \, \leftarrow \, OUT \, \cup \,\{T.A\}$; \\
6 \' \>          \> {\bf end if;} \\
7 \' \>          \> {\bf if} \= ($D_{B} \, \leq \, t$)  {\bf then} \\
8 \' \>          \>          \> $OUT \, \leftarrow \, OUT \, \cup\, \{T.B\}$; \\
9 \' \>          \> {\bf end if;} \\
10 \' \>         \> $q.D_{V}  \, \leftarrow \, q.D_{V} \, \cup \, \{ D_{A},D_{B}\}$;\\
11 \' \>         \> {\bf if} \= ($D_{A} \, \leq \, t + T.\Rad_A$) {\bf then}\\
12 \' \>         \>          \> $\rs$($T.\myleft$, $q$, $t$, $OUT$); \\
13 \' \>         \> {\bf end if;} \\
14 \' \>         \> {\bf if} \= ($D_{B}\, \leq \, t + T.\Rad_B$) {\bf then} \\
15 \' \>         \>          \> $\rs$($T.\myright$, $q$, $t$, $OUT$); \\
16 \' \>         \> {\bf end if;} \\
17 \' \>         \> $q.D_{V}  \, \leftarrow \, q.D_{V} \, \setminus \, \{ D_{A},D_{B}\}$;\\
18 \' \>         \> {\bf return;} \\
19 \' \> {\bf else} \= \\
20 \' \>            \> $OUT \, \leftarrow \, OUT \, \cup$ $\{$$\vc$($T.\Cluster$, $q$, $t$, $OUT$)$\}$; \\
21 \' \= {\bf end if}; \\
22 \' \= {\textsc{end $\rs$}}. \\
\end{tabbing}
\end{minipage}}
}\caption{The Range Search Algorithm} \label{RangeSearch}
\end{center}
\end{figure}

\begin{figure}[ht!]
\begin{center}
\fbox{ {\scriptsize
\begin{tabular}{c|c}
\begin{minipage}[c]{160mm}
\begin{tabbing}
$\vc$($\Cluster$, $q$, $t$, $OUT$) \\
\ \ \ \ \ \ \= \+
1 \'  \= $q.D_{C} \, \leftarrow$ $\dist$($q,\, \Cluster.C$); \\
2 \'  \> {\bf if} \= ($q.D_{C} \, \leq \, t$) {\bf then} \\
3 \'  \>          \> $OUT \, \leftarrow \, OUT \, \cup \, \{\Cluster.C\}$; \\
4 \'  \> {\bf end if}; \\
5 \'  \= {\bf if} \= ($q.D_{C} \, \geq \, t \, + \, \Cluster.\Radius$) {\bf then}\\
6 \'  \>          \> {\bf return;} \\
7 \'  \= {\bf end if}; \\
8 \'  \> {\bf if} \= ($q.D_{C} \, \leq \, t \, - \, \Cluster.\Radius$) {\bf then}\\
9 \'  \>          \> $OUT \, \leftarrow \, OUT \, \cup \, \Cluster$;\\
10 \' \> \> {\bf return} $OUT$; \\
11 \' \= {\bf end if}; \\
12 \' \> {\bf for} \= {\bf each} $O \, \in \, \Cluster.C_{\List}$ {\bf do} \\
13 \' \>           \> {\bf if} \= ($q.D_{C} \, \geq \, t \, + \, O.D_{C}$) {\bf then}\\
14 \' \>           \>          \> {\bf continue;} \\
15 \' \>           \> {\bf end if}; \\
\end{tabbing}
\end{minipage}
&
\begin{minipage}[c]{160mm}
\begin{tabbing}
\ \ \ \ \ \ \ \ \= \= \=           \=  \+ \\
16 \' \>           \>           \>  {\bf if} \= ($q.D_{C} \, \leq \, t \, - \, O.D_{C}$) {\bf then}\\
17 \' \>           \>           \>           \> $OUT \, \leftarrow \, OUT \, \cup \, \{O\}$; \\
18 \' \>           \>           \>          \> {\bf continue;} \\
19 \' \>           \>           \> {\bf end if}; \\
20 \' \>           \>           \> {\bf if} \= $(\nexists (d_q \, \in \, q.D_V \,\, and \,\,d_O \, \in \, O.D_V)\, | $ \\
   \' \>           \>           \>          \> $\, d_q \, \geq \, t \, + \, d_O \,\, or \,\, d_q \, \leq \, t \, - \, d_O$) {\bf then}\\
22 \' \>           \>           \>          \> {\bf if} \= ($\dist$$(q,O) \, \leq \, t$) {\bf then}\\
23 \' \>           \>           \>          \>          \> $OUT \, \leftarrow \, OUT \,\cup \, \{O\}$; \\
24 \' \>           \>           \>          \> {\bf end if}; \\
25 \' \>           \>           \> {\bf else} \= \\
26 \' \>           \>           \>            \> {\bf if} \= ($d_q \, \leq \, t \, - \, d_O$) {\bf then} \\
27 \' \>           \>           \>            \>          \> $OUT \, \leftarrow \, OUT \, \cup \, \{O\}$; \\
28 \' \>           \>           \>            \> {\bf end if}; \\
29 \' \>           \>           \> {\bf end if}; \\
30 \' \> {\bf end for each}; \\
31 \' \> {\bf return} $OUT$; \\
32 \' \> \textsc{end $\vc$}. \\
\end{tabbing}
\end{minipage}
\end{tabular}}}\caption{VisitCluster} \label{VC_RangeSearch}
\end{center}
\end{figure}

\section{K-Nearest Neighbor Algorithm}\label{knnsection}
The $k$-Nearest Neighbor search algorithm takes as input the
Antipole Tree $T$, the query object $q$, the $k$ parameter
	Dennis: and the parameter $k$
indicating the number of objects requested. It returns the set of
objects in $S$ which are the $k$-nearest neighbors of $q$.
Hjaltason and Samet in~\cite{HS1999} propose a method called
Incremental Nearest Neighbor to perform $k$ nearest neighbor
search in spatial databases. Their approach uses a priority queue
storing the sub trees that should be visited, ordered by their
	Dennis: subtrees
	Make that change throughout
distance from the query node. Authors claim that their approach
	Dennis: The authors claim
could be applied to all hierarchical data structures. Here we show
an application of such a method to Antipole Tree.

The incremental nearest neighbor algorithm starts by putting the
tree in the priority queue $pQueue$, then extracts the minimum
from the priority queue. If the extracted node is a leaf (cluster)
it visits the leaf otherwise it establishes if each branch of such
	Dennis: leaf; otherwise the algorithm determines which
sub tree needs to be visited. In order to perform such pruning a
	Dennis: subtrees
	The rest of this text is confusing.
	Remove up to the references to the figures.
threshold $t$, initialized to $\infty$, indicating the largest
distance from $q$ to any of the current $k$ nearest neighbors,
must be dynamically maintained. Procedures similar to those used
in the preceding section for pruning subtrees and visiting leaf
node clusters must be provided. All the current $k$-nearest
neighbors found are stored in a another heap $outQueue$ in order
to optimize the dynamic operations such as insertion, deletion and
update. Figs.~\ref{KNNSearch} and~\ref{KNNcheck} summarize the
pseudo-code.
\begin{figure}[ht!]
\begin{center}
\fbox{ {\scriptsize
\begin{minipage}[c]{160mm}
\begin{tabbing}
$\knn$($T$, $q$, $t$, $outQueue$, $k$, $pQueue$) \\
\ \ \ \ \ \ \= \+
1 \' \=  Enqueue($pQueue,Tree,NULL,NULL$); \\
2 \' \>  {\bf while} \= NotEmpty($pQueue$) {\bf do} \\
3 \' \>              \>  $node$ = Dequeue($pQueue$);  \\
4 \' \>              \> {\bf if} \= ($node = {\it TRUE}$)  {\bf then}\\
5 \' \>              \>          \> $\kvc$($node$, $q$, $t$, $outQueue$,$k$); \\
6 \' \>              \> {\bf else} \= \\
7 \' \>              \>            \> $D_{A} \, \leftarrow \,$ $\ck$($q,$ $node.A,$ $t,$ $outQueue$);\\
8 \' \>              \>            \> $D_{b} \, \leftarrow \,$ $\ck$($q,$ $node.B,$ $t,$ $outQueue$);\\
9 \' \>              \>            \> {\bf if} \= ($|outQueue| < k$)  {\bf then}\\
10 \' \>             \>            \>          \> Enqueue($pQueue,node.\myleft$,$D_A$,$node.\Rad_A$); \\
11 \' \>             \>            \>          \> Enqueue($pQueue,node.\myright$,$D_B$,$node.\Rad_B$); \\
12 \' \>             \>            \> {\bf else} \= \\
13 \' \>             \>            \>            \> {\bf if} \= ($D_{A}   \, < t \, + node.\Rad_{A}$)   {\bf then} \\
14 \' \>             \>            \>            \>          \>  Enqueue($pQueue,node.\myleft$,$D_A$,$node.\Rad_A$); \\
15 \' \>             \>            \>            \> {\bf end if;} \\
16 \' \>             \>            \>            \> {\bf if} \= ($D_{B}   \, < t \, + node.\Rad_{B}$)   {\bf then} \\
17 \' \>             \>            \>            \>          \>  Enqueue($pQueue,node.\myright$,$D_B$,$node.\Rad_B$); \\
18 \' \>             \>            \>            \> {\bf end if;} \\
19 \' \>             \> {\bf end if;} \\
20 \' \> {\bf end while}; \\
21 \' \= {\textsc{end $\knn$}}. \\
\end{tabbing}
\end{minipage}}
}\end{center} \caption{k-Nearest Neighbor Search Algorithm}
\label{KNNSearch}
\end{figure}

\begin{figure}[ht!]
\begin{center}
\fbox{{\scriptsize
\begin{minipage}[c]{160mm}
\begin{tabbing}
$\ck$($q$, $O$, $t$, $OUT$) \\
\ \ \ \ \ \  \= \+
1 \'      \= $D_O \, \leftarrow \, \dist(q,O);$ \\
2 \'      \> {\bf if} \= ($|OUT| < k$) {\bf then} \\
3 \'      \>          \> $\hi$($O$, $OUT$);\\
4 \'      \>          \> $t \leftarrow $ $\hem$($OUT$);\\
5 \'      \> {\bf else} \= \\
6 \'      \>            \> {\bf if} \= ($D_O \,< \, t$) {\bf then}\\
7 \'      \>            \>          \> $\hi$($O$, $OUT$);\\
8 \'      \>            \>          \> $t \leftarrow $ $\hem$($OUT$);\\
9 \'      \>            \> {\bf end if}; \\
10 \'      \= {\bf end if}; \\
11 \'      \> {\bf return} $D_O$; \\
13 \'      \= {\textsc{end $\ck$}}. \\
\end{tabbing}
\end{minipage}}
}
\end{center}
\caption{Checking if the object $O$ should be added to the $OUT$
set.} \label{KNNcheck}
\end{figure}

\section{Experimental Analysis} \label{ExpAnalysis}
In this section we evaluate the efficiency of constructing and
searching through an Antipole Tree. We implemented that data
	Dennis: have implemented the structure ...
structure using the C programming language under
Linux operating system. Two types of data were used during the
	Dennis: The experiments use synthetic and real data sets.
experiments: synthetic and real data sets. The synthetic data sets
are based on the one used by \cite{BO99}:
\begin{itemize}
\item uniform 10-dimensional Euclidean space (more precisely sets
of $100000, \, 200000,$ $\, \ldots,$ $\, 500000$ objects uniformly
distributed in $[0,1]^{10}$); \item clustered 20-dimensional
Euclidean space. It consists in a set of 1000000 objects organized
in 100 clusters with equal size and uniform distribution inside
each.
\end{itemize}
The real data sets are respectively:
\begin{itemize}
\item a set of 45000 strings chosen from the Linux dictionary;
\item a set of 42000 images  chosen from the Corel image database;
\item Data set of high dimensional Euclidean space obtained from
VISTEX texture database~\cite{VisTex}.
\end{itemize}

For each conducted experiment we run 100 queries obtained half by
sampling into the input data set end the other half outside the
	Dennis: half of which are objects in the data structure
	and half outside
input data set.
\subsection{Construction Time}
We measure construction time in terms of the number of distance
computations and CPU time. In such a case we used uniformly
	Dennis: CPU time on uniformly ...
distributed objects in $[0,1]^{10}$ as described above.
Fig.~\ref{prep} (a) illustrates a comparison between the Antipole
Tree, the MVP-tree and the M-tree, showing the distances needed
during the construction. Data were taken again in $[0,1]^{10}$
with size from 100000 to 500000 elements. The cluster radius
$\sigma$ used was $ \sigma = 0.625$, as found by our estimation
algorithm described below. We used the parameter settings for
MVP-trees and M-trees suggested by the authors \cite{BO99,CP98}.
Fig.~\ref{prep} (a) shows also that building the antipole tree
requires fewer distance computations than the M-tree but more than
the MVP-tree. The difference is roughly a factor of 1.5.
Fig.~\ref{fraction} shows that the difference in construction
costs can be compensated for by faster range queries on less than
0.2\% of the entire input database. Thus, unless queries are very
rare, the antipole tree makes up in queries what it loses in
construction costs. We discuss this further in section \ref{RSA}.
\begin{figure}[ht!]
\begin{center}
\fbox{
\begin{minipage}{0.85\textwidth}
\begin{tabular}{cc}
\includegraphics[width=60mm]{distSize.eps} & \includegraphics[width=60mm]{timeSize.eps} \\
(a) & (b)
\end{tabular}
\end{minipage}
}
\end{center} \caption{(a) Construction complexity as measured by the number of
distance computations varying the size of the database to compare
Antipole Tree (cluster diameter is 1.25) vs. M-tree and MVP-tree.
In both experiments uniformly generated data sets were used. (b)
CPU time in seconds needed to build the Antipole Tree.}
\label{prep}
\end{figure}
\begin{figure}[ht!]
\begin{center}
\includegraphics[width=70mm]{fraction.eps}
\caption{Number of range queries, as a fraction of the data set
size, which are sufficient to recover the higher cost of antipole
tree construction with respect to MVP-tree construction.
} \label{fraction}\end{center}
\end{figure}
Fig.~\ref{prep} (b) shows the CPU time needed to bulk load the
proposed data structure, it also shows that the CPU time needed to
construct the Antipole Tree grows linearly in many cases. Because
the MVP-Tree entails sorting, it requires at least O($n \log n$)
operations (though not distance calculations) to build the data
structure.
\subsection{Choosing the best cluster diameter} \label{CHOSEDIAM}
In this subsection we evaluate the performance of the Antipole
	Dennis: discuss ways to tune the Antipole
	Tree  for range search queries.
Tree in processing range search queries. We measure the cost by
the number of distance calculations among objects of the
underlying metric space. \\
Before the antipole data structure can be used it needs to be
tuned. This means that the radius $\sigma$ of the clusters  must
	Dennis: To tune the Antipole Tree, we must choose the radius ...
	very carefully by analyzing...
be carefully chosen by analyzing the data set properties. In what
follows we will show that optimal cluster radius depends on the
intrinsic dimensionality of the underlying metric space and on the
query threshold $t$. \\ We performed, as described before, our
experiments in 10 and 20 dimensional spaces with uniform and
clustered distributions having size 100000. However the
methodology of finding the optimal diameter can be applied to
other dimensions and arbitrary data sizes.
\begin{figure}[ht!]
\begin{center}
\fbox{
\begin{minipage}{0.85\textwidth}
\begin{tabular}{cc}
\includegraphics[width=60mm]{diamTuning01-1.eps} &
\includegraphics[width=60mm]{distClusterized.eps} \\
Uniform & Clustered \\
\includegraphics[width=60mm]{diamTuning0.25-2.5.eps} &
\includegraphics[width=60mm]{distClusterized2.eps} \\
Uniform & Clustered
\end{tabular}
\end{minipage}
}\caption{Diameter tuning in uniformly and clustered generated
points in 10 and 20 dimension respectively.}\label{DiamEff1}
	Dennis: dimensions
\end{center}
\end{figure}
Figs.~\ref{DiamEff1} (Uniform) and (Clustered) show that across
different values of the threshold $t$ of the range search, the
best choice of the cluster diameter is 0.625 for the uniform data
set and 2 for the clustered one. Experiments with real and
synthetic data showed that choosing the cluster diameter 10\% less
than the median pairwise distance value gives very good
performance.
	Dennis: regardless of the range search threshold,
	a quite surprising result.
\subsection{Range search analysis and comparisons}\label{RSA}
In this subsection we present an extended comparison between the
	Dennis: comparison among
Antipole tree, the MVP-tree, the M-tree and List of Clusters, in
terms of the number of distance computations for range queries.
The number of distance computations required by each query has
been estimated as the average value in a set of a hundred queries.
	Dennis: i must say that 100 is a bit low. 1000 would be better.
In order to perform a fair comparison with the two competing data
structures, MVP-tree and M-tree, we have set their implementation
parameters to the best values according to~\cite{CP98}. For the
MVP-tree, in~\cite{BO99} it is shown that the best performances of
their structure is achieved by setting the parameters in the
following way:
\begin{enumerate}
\item two vantage points in every internal node $v_1$ and $v_2$;
\item $m^2 \, = \, 4$ partition classes. Four children for each
pair of vantage points; \item $k\, = \, 13$ maximum number of
objects in a leaf node; \item $p$ unbounded, the size of the
vector storing the distances between the objects in a leaf and
their ancestors in the tree (the vantage points). Such a vector is
used during the range search to discard objects without having to
compute their distance from the query object. Notice that the
higher the level the more distances from vantage points can be
	Dennis: I don't understand this sentence
used to prune candidates and this improves the performance of the
MVP-Tree in terms of distance computations. For this reason we
have set this parameter to its maximum value: the height of the
MVP-tree.\footnote{We are grateful to T. Bozkaya and M. Ozsoyoglu
for giving us the program to generate the input for the
clustered data set}.
\end{enumerate}
For the M-Tree implementation, we made use of the {\it
BulkLoading}\footnote{We are grateful to P. Ciaccia, M. Patella
and, P. Zezula for giving us the source code of the M Tree}
algorithm \cite{CP98}. The two parameters needed to tune the data
structure in order to obtain better performance are the minimum
node utilization and the secondary memory page size. The best
performance observed during the search was obtained with minimum
node utilization 0.2 and page size 8K. \\
Concerning List of Clusters data structure  we used clusters of
fixed radius determined following authors technique~\cite{CN2000}
	Dennis: s suggested by the authors
with the $p3$ and $p5$ heuristic to choose the $i$-th center in
the $i$-th cluster. p3 heuristic consist in finding the element
	Dennis: cluster. The $p3$ heuristic consists in finding ...
furthest from the $i-1$-th center, p5 heuristic consist in finding
	Dennis: center. The $p5$ heuristic consists
the element which maximizes the sum of distances to previous
centers. \\
 In the first experiment (Fig.~\ref{UNIFORM1}) we
compare the four data structures in a uniform data set taken from
$[0,1]^n$ with $n \, = \, 10$ varying the query threshold from 0.1
to 0.8 using a data set of dimension 300000. For the antipole we
used two different cluster radii $\sigma$: 0.5 and 0.625,
respectively. Antipole performs better
than the other three data structures computing less distances during the search.\\
Notice that using a query threshold from 0.1 to 0.8 we capture in
the output data set from $0\%$ to $0.1\%$ of the elements of the
entire data set. Fig.~\ref{UNIFORM2} stresses that with query
	Dennis: shows that with...
threshold from 0.4 to 0.6 we save between 28\% and 70\% of the
	Dennis: thesholds
distance computations, which in the figure is indicated as the
gain percentage.
\begin{figure}[ht!]
\begin{center}
\begin{minipage}{0.70\textwidth}
\begin{center}
\includegraphics[width=70mm]{DistSize300.eps}
\end{center}
\end{minipage}
 \caption{Comparisons in 10 dimension with 300000 randomly generated
	Dennis: dimensions ... 300,000
vectors varying the query threshold by a factor of 10 from 0.1 to
0.8. The picture shows the number of  computed distances comparing
Antipole Tree with diameter ($2 \sigma$) 1 and 1.25 versus
MVP-Tree, M-Tree and List of Clusters.}
\label{UNIFORM1}\end{center}
\end{figure}
\begin{figure}[ht!]
\begin{center}
\fbox{
\begin{minipage}{0.85\textwidth}
\begin{center}
\begin{tabular}{cc}
\includegraphics[width=60mm]{DistSize2_04.eps} &
\includegraphics[width=60mm]{DistSize2_05.eps} \\
Query Threshold 0.4 & Query Threshold 0.5 \\
\includegraphics[width=60mm]{DistSize2_06.eps} &
\includegraphics[width=60mm]{Percentage3.13.eps}\\
Query Threshold 0.6 & Gain percentage
\end{tabular}
\end{center}
\end{minipage}
} \caption{Comparisons in dimension 10 for randomly generated
	Dennis: for randomly generated 10 dimensional vectors
vectors varying the data set size from 100000 to 500000 objects.
Each picture shows the number of distance calculations from
threshold 0.4 to 0.6 comparing  Antipole, MVP-tree, M-Tree and,
List of Clusters. With the respective gain percentage (percentage
of distances saved) of the Antipole Tree w.r.t. the MVP-tree, the
M-Tree and, the List of Clusters.}
\label{difference}\label{UNIFORM2}\end{center}
\end{figure}
\begin{figure}[ht!]
\begin{center}
\fbox{
\begin{minipage}{0.85\textwidth}
\begin{center}
\begin{tabular}{cc}
\includegraphics[width=60mm]{DistClustered.eps} & \includegraphics[width=60mm]{PercentageCluster.eps}\\
(a) & \\
\includegraphics[width=60mm]{distEDict.eps}     & \includegraphics[width=60mm]{PercentageEDict.eps}\\
(b) & \\
\includegraphics[width=60mm]{distHist.eps}      & \includegraphics[width=60mm]{PercentageHist.eps}\\
(c) &
\end{tabular}
\end{center}
\end{minipage}
} \caption{ (a) Performance in a clustered space of Antipole tree
vs. MVP-tree, M-tree and List of Clusters in 20 dimension varying
	Dennis: dimensions
the query threshold by a factor 10 with cluster radius 2. (b)
Antipole tree vs. MVP-tree using an editing distance metric with
cluster radius 5, (c) Antipole tree vs. MVP-tree, M-tree and List
of Clusters using a set of image histograms with cluster radius
0.4. Gain percentage using various
metrics.}\label{editing_clu}\label{gaingenericmetric}
\end{center}
\end{figure}
The next set of experiments in  Fig.~\ref{editing_clu} was
designed to compare the four data structure in different metric
spaces: the clustered Euclidean space $\Real$$^{20}$, a string
space with the editing distance, and an image histogram space with
	Dennis: under an editing distance metric
	Dennis: with an 
$L_2$ distance metric. The corresponding data sets are: 100000
	Dennis: 100,000 ... 45,000 etc.
clustered points, 45000 strings from the Linux dictionary, and
42000 image histograms from the Corel image
database,~\footnote{Obtained from the UCI Knowledge Discovery in
Databases Archive, http://kdd.ics.uci.edu/} respectively.\\
Results show a 30\% of distance computations savings.
	Dennis: savings in distance computations

Since List of Clusters is claimed to be very good in high
	Dennis: reportedly works well in high dimensions
dimension, in Fig.~\ref{} we show a comparisons  in range search
using VISTEX~\cite{VisTex} texture database in very high
	Dennis; the VISTEX texture database which
	is a high dimensional data set.
dimension.




\subsection{K-nearest neighbor comparisons}
In the Fig.~\ref{KNNgaingenericmetric} we present a set of
experiments in which the $\knn$ algorithm is compared with the
M-Tree and the List of Clusters. Notice that we compared the
Antipole Tree with just the M-Tree and List of Clsuters because
the $k$-nearest neighbor search is not discussed for the MVP-Tree
(see \cite{BO99}). As described in section~\ref{RSA}, we choose
uniform and clustered data in $\Real$$^{10}$ and $\Real$$^{20}$.
Each data set has size 100000. We run the $\knn$ algorithm with
$k\,= 1,2,4,6,8,10,15,20$, using one hundred queries for each
experiment, half sampled from the input data set and the other
	Dennis: half belonging to the data structure
	and half not.
half sampled outside the input data set. Using the Antipole Tree
we save up 85\% of distances computations.
	Dennis: up to
\begin{figure}[ht!]
\begin{center}
\fbox{
\begin{minipage}{0.85\textwidth}
\begin{tabular}{cc}
\includegraphics[width=60mm]{knnDim10.eps} & \includegraphics[width=60mm]{knnDim10Perc.eps}\\
(a) & \\
\includegraphics[width=60mm]{knnDim20.eps} & \includegraphics[width=60mm]{knnDim20Perc.eps}\\
(b) & \\
\includegraphics[width=60mm]{knnDim32.eps} & \includegraphics[width=60mm]{knnDim32Perc.eps}\\
(c) &
\end{tabular}
\end{minipage}
} \caption{K nearest neighbor comparisons in number of distances
computation (left column) with the respective gain percentage
(right column).(a) 100000 uniformly generated points in
$[0,1]^{10}$; (b) 100000 cluster generated points in $\RR$$^{20}$;
(c) Image histogram database. }\label{KNNgaingenericmetric}
\end{center}
\end{figure}

Since List of Clusters is claimed to be very good in high
dimension, in Fig.~\ref{} we show a comparisons using
VISTEX~\cite{VisTex} texture database in very high dimension.
	Dennis; this paragraph has already appeared.

\section{Approximate K-Nearest Neighbor search Via Antipole Tree}\label{approxknn}
When the dimension of the space becomes very high (say $
> {50}$) the performance of all existing data structures which wish to
	Dennis: perform poorly on range and k-nearest neighbor searches
execute exact range search or k-nearest neighbor search shows a
very unsatisfactory behavior. This is due to the well known
problem of the {\em curse of dimensionality} \cite{IM98}. Lower
bounds~\cite{C94} show that the search complexity exponentially
grows with the space dimension. For generic metric spaces,
following~\cite{CN01,CNBM01}, we introduce the concept of {\em
intrinsic} dimensionality:
\begin{defn}
Let $(X,d)$ be a metric space, an let $S \, \in \, X$. The
	Dennis: and let
intrinsic dimension of $S$ is $\rho \, = \, \frac{\mu}{2
\sigma^2}$ where $\mu$ and $\sigma^{2}$ are the mean and the
variance of its histogram distances. \end{defn}
A promising approach to at least alleviate the curse of
dimensionality is to consider approximate and probabilistic
algorithms for $k$-nearest neighbor search. Such algorithms in
some applications give acceptable results.
Several interesting algorithms have been proposed in the literature~\cite{CN01,CP00,M97,GG91}.
One of the most successful data structure seems to be the Tree Structure Vector Quantization (TSVQ).
	Dennis: be Tree Structure
Here we will show how to use the Antipole Tree to design a suitable approximate
search algorithm for the nearest neighbor. A first simple algorithm,
	Dennis: nearest neighbor searching
called $\bp$, follows the best path in the tree from the root to
the leaf and returns the centroid stored in the leaf node. This
algorithm uses the same strategy used by the TSVQ to quickly
	Dennis: the strategy of TSVQ
find an approximate nearest neighbor of a query object. \\
	Dennis: to find ... object quickly
In what follows we present a set of experiments where TSVQ and
Antipole Tree are compared. The experiments refer to uniformly
generated objects in spaces whose dimension ranges from $10$ to $50$.
For each input data set one hundred of queries were executed.
	Dennis: one hundred queries
In order to evaluate the quality of the results we run
the exact search first. \\ Then
the  error $\delta$ is computed in the following way:
\[
\delta \, = \, \frac{|\dist(O_{opt},q) -
\dist(O_{TSVQ/Antipole},q)|}{\dist(O_{opt},q)}.
\]
\begin{figure}[ht!]
\begin{center}
\fbox{
\begin{minipage}{0.85\textwidth}
\begin{center}
\begin{tabular}{cc}
\includegraphics[width=60mm]{APA_TSVQ_GEN.eps}
&\includegraphics[width=60mm]{APA_TSVQ_GEN_DIST.eps} \\
\includegraphics[width=60mm]{APA_TSVQ_R20_CLUSTER_FINALE.eps}
&\includegraphics[width=60mm]{APA_TSVQ_R20_CLUSTER_FINALE_DIST.eps} \\
(a) & (b)
\end{tabular}
\end{center}
\end{minipage}
} \caption{A comparison between the approximate Antipole search
and TSVQ search. (a) shows the average error introduced by the two
algorithms in uniformly generated points with $\sigma = 0.5$ and
various space dimensions (up side picture), and clustered points
(bottom side picture) in space dimension 20 with various cluster
radius $\sigma$. (b) shows the distances computed by the two algorithms to perform the
	Dennis; the number of distances
search.}\label{APvsTSVQ}
\end{center}
\end{figure}
In Fig. \ref{APvsTSVQ} (a)  the errors introduced by
the two approximate algorithms in uniformly generated set of points
	Dennis: a uniformly generated
(up side pictures) and clustered set of points (bottom side pictures)
	Dennis: (upper figures) ... (lower figures0
are depicted. On the other hand, Fig. \ref{APvsTSVQ} (b) shows the
number of distances computed by the two algorithms. \\
The experiments clearly show that the Antipole Tree returns a
	Dennis: improves on TSVQ
better result with respect to the TSVQ. We think that this is
due the better position of the Antipole pairs.\\
	Dennis: due to the
A more sophisticated approximation algorithm to solve the
$k$-nearest neighbor problem can be obtained by using the $\knn$
algorithm. The idea is the following: for each cluster reached
during the search, the algorithm compares the query object with the cluster
centroid without taking into consideration the objects inside it.
\\ This search is slower than the $\bp$ but is more precise and
can be used to perform $k$-nearest neighbor search.
Fig. \ref{knnAP} (a) shows a set of experiments done in uniform
spaces in dimension 30 with radius $\sigma$ set to 1 and 1.5.
\begin{figure}[ht!]
\begin{center}
\fbox{
\begin{minipage}{0.85\textwidth}
\begin{center}
\begin{tabular}{cc}
\includegraphics[width=60mm]{KNN_APPROX.eps}
&\includegraphics[width=60mm]{GainR30.eps} \\
\end{tabular}
\end{center}
\end{minipage}
} \caption{An experiment with the approximate $k$-nearest neighbor
algorithm in dimension 30. (a) the average error, (b) the gain
percentage in the number of distance computations}\label{knnAP}
\end{center}
\end{figure}
A more accurate analysis of experiments in approximate scenarios
	Dennis: Precision and recall are important metrics in 
	approximate matching. 
are those which compute the precision-recall factor.
Following~\cite{} we call the $k$ nearest neighbor elements of a
query $q$ the $k$ golden results. The $recall$ can be defined as
the fraction of the $k$ top golden elements retrieved fixing a
	Dennis: retrieved divided by $k$
bound, called quota, in the number that can be computed during the
search. The $precision$ after $quota$ distances computed
	Dennis: remove sentence fragment

On the other if the recall $r$ is fixed (i.e. 50\%) $precision$
	Dennis: ), the $precision$
after $r$ represents the number of distances calculation to obtain
	Dennis: Colleagues: this is an unconventional definition
	of precision.
	We should say "precision is the number of golden elements
	retrieved over the number computed"
such a recall. We conduct such kind of experiments with our
	Dennis: such experiments
Antipole data structure and with the approximate version of List
of Clusters~\cite{BN2002}. Experiments in Fig.~\ref{} refers to
	Dennis: refer to
100000 elements in dimension 30.
	Dennis: of dimension

\section{A comparison with the linear scan.}\label{linearScan}
	Dennis: with linear scan
In this section we present a set of experiments in which we
compare the proposed data structure with the naive linear scan. We
	Dennis: a naive linear
used a set of very high dimensional Euclidean data sets. Such data
sets were obtained from a set of textures obtained from the VISTEX
database~\cite{VisTex}. Starting from a given texture, the data
sets of tuples were built in  the following way: for each pixel
$p$ in the texture we considered, for each color channel, half of
its $k \times k$ neighborhood (see~\cite{Wey01} for more details).
We obtain data sets of dimension ranging from $63$ to $267$.
Results, which are plotted in Fig.~\ref{linearscan}  show that the
proposed data structure outperforms the linear scan in such high
dimensional data sets.
\begin{figure}[ht!]
\begin{center}
\fbox{
\begin{minipage}{0.85\textwidth}
\begin{center}
\begin{tabular}{cc}
\includegraphics[width=60mm]{linearScan/KNN/2809punti_R267/knnDim267.eps} & \includegraphics[width=60mm]{linearScan/RANGE/2809_R267/DistSize267.eps}\\
(a) & \\
\includegraphics[width=60mm]{linearScan/KNN/3249punti_R147/knnDim147.eps}     & \includegraphics[width=60mm]{linearScan/RANGE/3249punti_R147/DistSize147.eps}\\
(b) & \\
\includegraphics[width=60mm]{linearScan/KNN/3721punti_R63/knnDim63.eps}      & \includegraphics[width=60mm]{linearScan/RANGE/3721punti_R63/DistSize63.eps}\\
(c) & \\
\includegraphics[width=60mm]{linearScan/KNN/Color_Histogram/knnDim32.eps}      & \includegraphics[width=60mm]{linearScan/RANGE/Color_Histogram/DistSize32.eps}\\
(d) & \\
\end{tabular}
\end{center}
\end{minipage}
} \caption{Comparing Antipole Tree and linear scan w.r.t K-nearest neighbor (left side) and range search (right side) in
(a) $\RR$$^{267}$ (b) $\RR$$^{147}$ (c) $\RR$$^{63}$ (d) $\RR$$^{32}$.}\label{linearscan}
\end{center}
\end{figure}
\section{Conclusion}\label{concl}
We extended the ideas of the most successful best match retrieval data structures, such as
M-Tree, MPV-Tree, and  FQ-Tree, by introducing pivots based on
	Dennis: and List of Clusters
the farthest pairs (antipoles) in a data set.
The resulting Antipole Tree is a bisector tree using pivot-based
clustering with bounded diameter. Both range and $k$-nearest neighbor searches are
performed by eliminating those clusters which cannot contain the result of the query.
Antipoles are found by playing a linear time randomized tournament among the
elements of the input set.
	Dennis: No need to repeat exactly how this is done.
	So stop the paragraph here.
Essentially, the tournament partitions the set into ``small'' random
subsets and eliminates the 1-median in each subset. The set is then
replaced by the set of ``winners''  and this is repeated until
the remaining set is sufficiently small to compute its diameter.
Experiments show that such a pseudo-diameter pair gives
good splitting characteristics
on both real and random data. \\
A data-dependent aspect of the algorithm is to control the proliferation
of clusters through the introduction of a suitable diameter threshold. In order to properly define such a
threshold
we propose a statistical analysis on the set of pairwise distances. Moreover to decide when to split,
we need an estimate of the ratio between pseudo-diameter (antipole length) and the real diameter.
Since no guaranteed approximation algorithm for diameter computation in general metric spaces can exist,
we use the approximation ratio given by a very efficient algorithm for diameter computation
in Euclidean spaces. This heuristic gives an estimate
of the dimensions of the metric space. \\
We ran experiments on both synthetic and real data sets with
normal and clustered distributions. All the experiments show that
the proposed structure outperforms the most successful data
structures for best match search by a factor of 1.5-2.5.
\newcommand{\lncs}[1]{volume #1 of \textit{Lecture Notes in Computer Science}}
\newcommand{\ciac}{\textit{the 4th Italian Conference on Algorithms and Complexity (CIAC 2000)}}
\newcommand{\stoc}[1]{\textit{the #1 Annual ACM Symposium on Theory of Computing}}
\newcommand{\cpm}[1]{\textit{the #1 Combinatorial Pattern Matching}}
\newcommand{\tods}{\textit{ACM Transaction on Database Systems}}
\newcommand{\soda}[1]{\textit{the #1 Annual ACM-SIAM Symposium on Discrete Algorithms}}
\newcommand{\icde}[1]{\textit{the IEEE #1 International Conference on Data Engineering}}
\newcommand{\edbt}[1]{\textit{the #1 International Conference on Extending Database Technology}}
\newcommand{\socg}[1]{\textit{the #1 Symposium on Computational Geometry}}
\newcommand{\commACM}{\textit{Communication of the ACM}}
\newcommand{\tois}{\textit{ACM Transaction on Information Systems}}
\newcommand{\vldb}[1]{\textit{the #1 International Conference on Very Large Data Bases}}
\newcommand{\focs}[1]{\textit{the #1 Annual IEEE Symposium on Foundations of Computer Science}}
\bibliographystyle{plain}
\bibliography{XBib}
%%----------------------------------------------------------------------------
%%----------------------------------------------------------------------------
%\newpage
\appendix
\noindent \textbf{\Large APPENDIX}
\section{Diameter computation on Euclidean spaces}~\label{ediam}
In this appendix we will describe an approximation algorithm for
the diameter computation on the Euclidean plane. Several studies
	Dennis: Remind the reader why this is relevant to the paper.
in the literature~\cite{AMS92,BHP99,HP01,C02} have presented efficient
algorithms for the approximate diameter computation in
multidimensional Euclidean Spaces and our approach consists of the
binary search version of~\cite{AMS92}. For the sake of simplicity
we will start with a finite set of points in the plane $S$. We
perform the antipole search as follows. Let ($P_{X_m},P_{X_M}$),
($P_{Y_m},P_{Y_M}$) be four points of $S$ having minimum and
maximum Cartesian coordinates: the so called minimum area bounding
box $BBox$. Notice that these four points belong to the convex
hull of the set $S$ and the whole $S$ is included in a rectangle
	Dennis: and all of $S$
of dimensions $(P_{X_M}.x-P_{X_m}.x)$ and $(P_{Y_M}.y-P_{Y_m}.y)$
	Dennis: bounded by
The two endpoints $(A,B)$ of the diameter of these four points is
	Dennis: points constitute
our Antipole pair. The Antipole distance (the pseudo-diameter) is
not less than $Diagonal/\sqrt{2}$. This yields
$\frac{Antipole}{Diameter} \geq \frac{1}{\sqrt{2}}$ proving that
our approximation ratio in the plane is $1 - 1/\sqrt{2}$. \\
Now we describe a generalization of this method giving us an
approximation algorithm able to obtain an exponentially arbitrary low
approximation factor $\delta$ for the real diameter.
We perform a $\pi/4$ rotation of our cartesian coordinates, which implies a bisection of
the axes, and compute the maximum and minimum coordinate points
for these two new directions. We obtain 8 points. Let $A,B$ the
	Dennis: new axes
diameter of this set. It is easy to see (middle picture in Fig.~\ref{fig:worstCasePolygons})
that $\dist(A,B)/ \cos{\frac{\pi}{8}} \, > diameter(S)$.
By iterating $d$ times this bisecting process we get $\dist(A,B)/ \cos{\frac{\pi}{2^{d+2}}} \, > diameter(S)$
	Dennis: this bisecting process $d$ times,
(see the pseudo code in Fig. \ref{EuclideanAntipole}).
\begin{figure}[tbp]
\begin{center}
\fbox{
\includegraphics[width=3in]{regularpolygons.eps}
} \caption{Worst cases in the first three iteration of the
algorithm.}
\end{center} \label{fig:worstCasePolygons}
\end{figure}
Therefore we have that the error introduced by the algorithm is:
	Dennis: Therefore the error...
\[
\delta \, = \, \frac{|Diameter-Pseudo\_Diameter|}{Diameter} \,\leq \,
\left|1-\cos{\frac{\pi}{2^{(d+2)}}}\right|
\]
So we can conclude that:
\begin{thm}
Let $S$ be a set of points in the plane and let $0 \,< \, \delta \,\leq 1-\sqrt{2}/2$.
Iterating the algorithm for $i =
0,\cdots\, \lceil \log \left( \frac{\pi}{\arccos(1- \delta)}\right) -2
\rceil$ the algorithm returns an Antipole pair $A,B$ (say
Pseudo-Diameter) which approximates the diameter with an error less than
$\delta$. \eod
\end{thm}
Experiments in $n$ dimensional Euclidean spaces show that antipole pair distance,
	Dennis: the antipole pair distance
for tournaments with subset size  $t\, = \, n+1$ is fully comparable with the
	Dennis: put comma after n+1.
first iteration of above pseudo\_diameter algorithm. This suggests that one could calculate
the intrinsic dimension $n$ of a metric space $S$ and then use tournaments of size $t\, = \, n+1$.
\begin{figure}[tbp]
\begin{center}
\fbox{ {\scriptsize
\begin{minipage}[c]{426pt}
%\mbox{}\hrulefill \vspace{-.3em}
\begin{tabbing}
$\diag$($S$) \\
\ \ \ \ \ \ \= \+\\
1 \' \= Let $BBox \,=\, \{P_{X_m}, P_{X_M}, P_{Y_m}, P_{Y_M}\}$ \\
  \' \> be the minimum bounding box of $S$;\\
2 \' \> $V \, \leftarrow \, \{\{S\}\};$ \\
3 \' \= {\bf for} \= $i \, = \, 1$ {\bf to} $\lceil\frac{\pi}{4 \times \arccos(1-\delta)} -1 \rceil$ {\bf do} \\
4 \'    \>        \> $V' \,  = \, \rot\left(V,\frac{\pi}{2^{i+1}}\right)$; \\
5 \'    \>        \> Let $BBox_{\frac{\pi}{2^{i+1}}} \,=\, \{P_{X_m}, P_{X_M}, P_{Y_m}, P_{Y_M}\}$ \\
  \'    \>        \> be the minimum bounding boxes of the rotated sets in  $V'$;\\
6 \'    \>        \> $V \, \leftarrow \,$ Set catalog of $V'$; \\
7 \'    \>        \> $BBox \, = \, BBox \, \cup \, BBox_{i};$\\ %\frac{\pi}{2^{i+1}}}$\\
8 \'    \> {\bf end for} \\
9 \'    \= {\bf return} $\ap(BBOX$); \\
10 \'    \= \textsc{end $\diag$}
\end{tabbing}
%\vspace{-.6em} \mbox{}\hrulefill
\end{minipage}}
}\caption{Pseudo-Diameter Computation.} \label{EuclideanAntipole}
\end{center}
\end{figure}
\end{document}

--0-831932121-1081595245=:72599--


