The idea here is to show the mapping from a lock-based implementation to a compareandset implementation for various frequently needed operations which I call Basic Moves. Basic move I: Inserting a newly allocated node after a given node n on a list. Rules on any sequential or concurrent algorithm: (i) If the list is ordered, then the algorithms should never be possible to update a key value in the list. Deleting is ok, updating the value associated with a key is ok, but changing the key value is not. (ii) the node n should not be logically deleted. This is reflected in a valid bit. addnewnode(n): lock(n) if n.valid == 1 create a new node m // this is private to the process so need not be locked put information in m // for example, a key-value pair m.next := n.next n.next := m unlock(n) else unlock(n) spliceout(n) // see next basic move end if --> In the lockfree version, each node n has a nodestruct consisting of the valid bit and a next pointer. These must be held together in order to be an argument for compareandset. addnewnode(n): keeplooping:= True while keeplooping tmp:= nodestruct(n) if tmp.valid == 1 create a new node m put information in m nodestruct(m).next:= tmp.next flag:= compareandset(if tmp == nodestruct(n).next then nodestruct(n).next:= m) if flag == True then return 'success' else // n is not valid so must be spliced out spliceout(n) end if end while ==== Basic move II: splicing out n from a list, given that n is already logically deleted so cannot be modified in any way. find the predecessor of n, call it pred spliceout(n) lock(pred) if pred.next == n and pred.valid == 1 then pred.next := n.next unlock(pred) else if pred.valid == 0 unlock(pred) spliceout(pred) end if --> Assume that every node has a nodestruct with two fields: valid and next. spliceout(n) find the predecessor of n, call it pred tmp:= pred if tmp.next == n and tmp.valid == 1 then compareandset(if tmp == pred then pred.next := n.next) else if tmp.valid == 0 spliceout(pred) spliceout(n) end if Note that if the compareandset fails, someone else must have spliced n out. ==== Basic move III: performing an insert, delete, or update of a key-value pair in a node n that holds many values. add(n,k,v): lock(n) modify n unlock(n) --> When n is created, allocate an array. Include an index to that array initialized to 0 and with largest index value maxindex. The argument to compareandset will be a 64 bit integer nodestruct that has several fields: valid bit when 0, indicates the node is logically deleted upsertallowed when 1, indicates that upserts are allowed index to upsert array pointer to next node add(n,k,v): keeplooping:= True while(keeplooping): tmp:= nodestruct(n) if (tmp.upsertsallowed == 0) or (tmp.index == maxindex) then split(n) // now start over from the root and find the node where to add (k,v) return 'done' end if tmp2:= tmp tmp2.index := tmp.index + 1 flag:= compareandset(if tmp == nodestruct(n) then nodestruct(n) := tmp2) if flag == True then put (k,v) into location i of the array keeplooping:= False end if end while Note: Nodes internally will be multicopy. Note: upsertallowed set to 0 ensures that upserts can take place only when a split is not taking place, but any thread can perform the split, so if one thread dies, there is no livelock. ==== Basic move IV: split a node having an array of upserts (this would be basically a half-split in a B-tree context) halfsplit(n): lock(n) allocate m move some values from n to m m.next := n.next n.text := m unlock(n) ---> halfsplit(n) find the predecessor of node n, call it pred tmppred:= nodestruct(pred) tmpn := nodestruct(n) if tmppred.valid == 1 and tmppred.upsertsallowed == 1 and tmpn.valid == 1 // For pred, don't want to change the pointer of a logically deleted node // and want permission to change the next pointer. // For n, want it to be valid otherwise the split has already occurred. nodestruct(n).upsertsallowed := 0 // while we're doing this there should // be no upserts to n. allocate two new nodes n1 and n2 ensure that every location of the array in n up to the index value is occupied (otherwise there might be some upsert that is still in progress) (might have to kill processes) read the array of n up to its index value and compress the upserts (take only the last value associated with each key k and ignore those keys whose last value is a delete tombtone). copy the lower half of the keys of n and their values from n to n1 and set nodestruct(n1).index appropriately copy the higher half of the keys of n and their values from n to n2 and set nodestruct(n2).index appropriately nodestruct(n1).next:= n2 nodestruct(n2).next:= n.next nodestruct(n1).valid:= 1 nodestruct(n1).upsertsallowed:= 1 nodestruct(n2).valid:= 1 nodestruct(n2).upsertsallowed:= 1 tmppred2:= tmppred tmppred2.next:= n1 compareandset(if tmppred == nodestruct(pred) then nodestruct(pred):= tmppred2) // whether or not the compareandset succeeds, pred now points to a // replacement for n. We have no more use for n. nodestruct(n).next:= n1 // before this occurs a search might access n // and then miss any upserts to n1 and n2. This is like search recency. nodestruct(n).valid:= 0 end if Note 1: When nodestruct(n).upsertallowed == 0, searches for any key can proceed without change (need no locks); upserts however may not proceed for any key from the smallest key in the array to the largest. Note 2: Once valid == 0, then the node is logically deleted. Note 3: Because the same key may appear multiple times within a node, it may be that a single node may be enough to replace n in which case an optimization would be to allocate only a single node to replace n. === Basic Move V: half-split and complete-split in B-tree setting Add a field to each node called keyrange. keyrange -- a pair (a,b) such that any key k in the node must lie between a and b. Specifically: a <= k < b. The half-split is like the split of the previous basic move except that the keyrange must be split between the two successor nodes. So, n1 might have [a,b1) and n2 might have [b1,b) The full split goes to the parent and adds in a separator at (a,n1,b1) and then another one (b1,n2,b). Then the separator for (a,n,b) can be upserted to a tombstone. ==== Note on the half-split: ensure that every location of the array up to the index value is occupied (otherwise there might be some upsert that is still in progress) Actually can kill dead processes and if no process is present since the time the half split started, then that empty location is not a problem. ==== BW tree https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bw-tree-icde2013-final.pdf The nodes at top level are organized as a B+-tree with all key-value pairs in the leaves. The contents of each node of the BW-tree is spread all over memory (so this structure can NOT be used in any environment where memory locality is helpful (locality is helpful almost in all technologies) and where the overhead of following pointers is low compared to scanning an array). The location of the pointer to each node is in a certain location in a directory array. To handle an upsert on a leaf node whose array location (PID index in their parlance) is i, place the upsert in some random memory location (which could be some shared page location where all recent upserts regardless of the node are being placed), then do a compareandset with the pointer at location i. To compress a node corresponding to i, simply create the compressed node in new memory and do a compareandset with the pointer at array location at i. Split of node P works as follows: allocate a new node Q Find an appropriate separator key K_P from P that provides a balanced split. Move the records fro P with keys greate than K_P. Also set Q.next := P.next (While all this is going on, there could be upserts to P but if so then the CAS below will fail.) Now CAS the pointer to the contents of P with a "split delta record". This contains the separator K_P and a pointer to Q. The parent update inserts a "index term delta record" containing separator information for both Q and P and a pointer to R as well as a separator for Q. This might seem to lead to an error if P is split again (creating a new node Q'), but in that case, a search may incorrectly go to P instead of Q' but then will follow the side pointer to Q'. Summary: Works only when locality and pointer-following does not hurt performance (approximately, never). Does in-place splits but with large likelihood of doing most of the work but the compareandset could fail.