To: rosskaffenberger Subject: my little confusion Dear Ross, In the example I presented, I didn't understand one query: I'm proceeding based on the schema that Alex and I presented to Ross on December 20. I repeat it here (with a few spelling corrections, e.g. LangPropVal is now LingPropVal). Ling(id, name, parentid, depth, groupid, creatorid, timestamps) -- parentid relates different level objects, e.g. the parent of -- sentence may be the speaker who speaks it. -- A ling that has no parent is at depth 0, a ling whose parent -- has no parent is at depth 1 etc. Ross doesn't think we need this, but -- it's convenient for query processing. Property(id, name, type, groupid, creatorid, timestamps) -- Type is something like linguistic/demographic/phone. -- Will be drop-down for known types of a group. Only admin can add new types. LingPropVal(lingid, propid, value, creatorid, timestamps) -- Value is just a string -- User will put in names for lingid/propid, not ids -- One constraint is that a propid can be associated only to lingids -- of a single depth (e.g. demographic properties can't be associated -- to a sentence of depth 1 and a speaker of depth 0) Example(id, lingid, groupid, creatorid, timestamps) ExampleLingPropVal(id, exampleid, LingPropValID, groupid, creatorid, timestamps) -- User chooses property-values from the LingPropVal table -- that pertain to this example. That will go into ExampleLingPropVal -- So a given lang, prop, val can be associated with many examples. -- Many examples can pertain to a lang-prop-val. Attributes(foreignid,foreign_table, name, value, groupid, creatorid, timestamps) -- User fills in attributes of examples of a given -- ling or fills in attributes of a ling. -- Ross thinks the two should be separate. Alex is thinking about this. -- I'm going to separate these into LingAttributes and ExampleAttributes LingAttributes(foreignid, name, value, groupid, creatorid, timestamps) ExampAttributes(foreignid, name, value, groupid, creatorid, timestamps) Search Page Design On the top we have "Show clickboxes" for lingids of various depths and for each property. I think that properties should have depth too so we can group the properties with the lings. Here is an example two depth design with speaker at depth 0 and sentence at depth 1. sentence |_| property |_| value |_| example |_| speaker |_| property |_| value |_| The selector boxes are projections from LingPropVal joined with the Ling and Property table so that strings are presented with a selection on group. Considerations on Queries: For each depth/type combination, we have a separate box. Within a box, we always have an "any" (disjunction) option and may have an "all" (conjunction) option. The semantics of a selection box work like this: if it is disjunction ("any") then we just identify the rows in ling, propid, val that satisfy any of the selections in the box. A keyword search "abc" is assumed to be a wildcard search *abc* though the user may select "Match case" "word-initial" "whole word" The query goes in several stages: i) fold keyword searches into selector boxes ii) take care of the "any" and keyword search on properties and lings. iiia) take care of examples iiib) take care of all iv) Advanced functions. We never do both iiia and iiib The user will get a "This combination is disallowed" message. The semantics are too unclear (do we take the result of all and then select the examples corresponding to those?). The semantics are too murky. Detailed Flow of a Query: Check for invalid combinations, e.g. all on some selector box and Example clicked. That is invalid. A) Fold any keyword searches on lingpropval into their appropriate selector box, e.g. a keyword search on word initial "G" for ling name will be a selection on Greek and German among others (if lings are languages). This will be intersected with any other selections in that selector box to give a net set of selections. B) Perform disjunctions ("any") from selection boxes. (If there are conjunctions ("all") treat these as disjunctions ("any") for the purposes of this step. This saves work for later.) C) Each such disjunction gives a set of LingPropVal rows. They may be at different depths, e.g. sentence-property-value, speaker-property-value. So we need to tie these together. C1) First intersect rows having lingids at the same depth. This gives a set of LingPropVal rows at each depth. C2) Tie together the rows at different depths based on Ling.parentid. This can give a join result having more than three columns. In the result, a row concerning a sentence survives only if the speaker of that sentence survives the demographic query and a row concerning a speaker survives only if a sentence survies the linguistic query. D1) If examples are clicked, then join with the results of the LingPropVal queries based on LingPropValID and filter further based on keyword searches on example attributes. D2) If all is clicked for at least one selection box, then perform a group by on the fields of the ShowClickboxes. Call those fields the ShowFields. This group by is unusual in that there are no aggregates. Instead, for each distinct value of the ShowFields (i.e. each distinct combination of values), we associate a vector of values for each field F on which there is an all. So, for each such distinct value combination c, we have a vector v of F values. We check whether the selected F values in the selector box is a subset of v. If so, then c passes into the solution. Otherwise, it doesn't. E) Link up with attributes table based on lingid and possibly exampleid as appropriate. Order by language-property. F) Advanced Features are mutually exclusive and can all be handled as a postprocessing step. 1. Cross between P1 and P2: Just take the result and enumerate all possible values for P1 and P2. For each combination of P1.value and P2.value, take the maximum depth lings. The list of lings associated with each P1-v1, P2-v2 pair is the result of the cross. Find the latlongs for those to send to the mapper. 2. To compare languages L1 and L2, just take the results and produce a list of common property-values as well as the property values that one has but the other doesn't. 3. To create a similarity tree, take result and convert it to hierarchical clustering (works for yes/no/NA using hamming distance as currently implemented, but could also work using a Jaccard measure (not yet implemented)). 4. The only tricky one is implications. If we choose "both" then we can handle as a pure postprocessing step using whichever LingPropVal result we get from C2. Otherwise, we have to do a second query. The basic algorithm is that we have one set of ling-prop-vals L1 and another set of ling-prop-vals L2. (They are the same if the user selects the both option, but not otherwise. In particular, if the user selects Antecedent, then the result of step C2 is L1 and L2 is the entire LingPropVal table. If the user selects Consequent, then the result of step C2 is L2 and L1 is the entire LingPropVal table.) An implication P1.v1 ==> P2.v2 holds if whenever P1.v1 is associated with a ling (of any depth) and P2 is also associated with the ling, then P2 is associated with exactly one value v. We are most interested in such implications that happen very often and those that happen never. We also want to retain the Here is the basic query form: for each ling x in L2, if property P2 has exactly one value v2, then for every other property-value pair P1.v1 associated with ling in L1, add x to the list associated with P1.v1, P2.v2. If there is ever a "contradiction", i.e. a language where P1.v1, P2.v2' holds and v2 != v2', then just mark the lists of both P1.v1, P2.v2 and P1.v1, P2.v2' as invalid and stop appending to those lists. At the end, sort by length of list of valid pairs and present just the number of languages for which each implications holds. If that number is clicked, the languages should appear. 5. Another advanced feature that we don't have is to save the results of queries and to do AND/OR/DIFFERENCE on them if they have the same columns. This is straightforward. Examples: Speaker at depth 0 and sentence at depth 1 sentence |_| sentenceproperty |_| sentencevalue |_| example |_| speaker |_| speakerproperty |_| speakervalue |_| sentence selector box sentence1_speaker1 sentence2_speaker1 ... sentence1_speaker2 sentenceproperty P1 P2 P3 sentencepropertyvalue P1.v1a P1.v1b P1.v1c P2.v2 ... speaker selector box speaker1 speaker2 ... speakerproperty D1 D2 D3 ... speakerpropertyvalue D1.q1a D1.q1b ... keyword search search on sentence, sentenceproperty, speaker, speakerproperty Can do several keyword searches on different ones of these. They can be MatchCase, fullword, wordinitial, arbitrary infix Default is arbitrary infix without matching case. Example 1: Suppose that we have the following query. ShowClickbox: everything is clicked except example keyword search on sentence property: "verb" keyword search on speaker property: "country" selection on speakers is any: speaker1, speaker2, speaker3 selection on sentences is any: sentence1_speaker1, sentence2_speaker1, sentence5_speaker4 Processing: A) for the keyword searches on verb and country, translate those into clicks on the selctor boxes for sentence properties and speaker properties B) Each disjunction on the sentence, sentenceproperty, speaker, and speakerproperty gives a set of LingPropVal rows. C1) Intersect those rows having to do with sentence and sentenceproperty. That is the depth1result. Intersect those having to do with speaker and speakerproperty. That is the depth0result. C2) insert res from select depth1result.ling as sentence, depth1result.property as sentenceproperty, depth1result.value as sentencevalue, depth0result.ling as speaker, depth0result.property as speakerproperty, depth0result.value as speakervalue from depth0result, depth1result where depth1result.parentid = depth0result.lingid order by sentence, sentenceproperty insert res2 select res.lingid, LingAttribute.name, LingAttribute.value from res, LingAttribute res.lingid = lingattribute.foreignkey This will give the attributes of each lingid. Attributes are properties that are presented in query results but cannot be queried on. Example 2: Same as example 1 except that example is also clicked and there is an additional keyword search on example gloss: "the" whole word All processing is the same up to C2 where the query becomes a bit more complex: insert res from select depth1result.ling as sentence, depth1result.property as sentenceproperty, depth1result.value as sentencevalue, depth0result.ling as speaker, depth0result.property as speakerproperty, depth0result.value as speakervalue from depth0result, depth1result where depth1result.parentid = depth0result.lingid order by sentence, sentenceproperty, sentencevalue insert exampres select distinct res.sentence, res.sentenceproperty, res.value, ExampleLingPropVal.id, ExampAttributes.name, ExampAttributes.value from res, ExampleLingPropVal, ExampAttributes, LingPropVal where LingPropVal.property = res.sentenceproperty and LingPropVal.value = res.sentencevalue and LingPropVal.lingid = res.sentence and LingPropVal.id = ExampleLingPropVal.LingPropValID and ExampAttributes.foreignkey = ExampleLingPropVal.exampleid filter to see that exampres.value matches the keyword constraint "the" for gloss Example 3: (all query) Suppose that we have the following query. ShowClickbox: speaker, speakerproperty, speakervalue selection on speakers is any: speaker1, speaker2, speaker3, speaker4 selection on sentencepropertyvalue is all: P1.v1a, P2.v2b, P3.v3b A) no keyword search, so nothing to do B) disjunctive query on speakers and on sentencepropertyvalue The query on speakers is depth0result. The query on sentencpropertyvalues gives depth1result. Similar to C2 query as above but projecting onto fewer attributes and using a group by. insert res from select depth0result.ling as speaker, depth0result.property as speakerproperty, depth0result.value as speakervalue, depth1result.property-depth1result.value pairs as vector from depth0result, depth1result where depth1result.parentid = depth0result.lingid group by depth0result.ling, depth0result.property, depth0result.value having vector contains P1.v1a, P2.v2b, P3.v3b order by speaker, speakerproperty, speakervalue Of course such a having clause doesn't work, but that is the semantics we would want. Example 4: (two all queries) Suppose that we have the following query. ShowClickbox: speaker, speakerproperty, speakervalue selection on speakers is any: speaker1, speaker2, speaker3, speaker4 selection on sentencepropertyvalue is all: P1.v1a, P2.v2b, P3.v3b selection on speakerpropertyvalue is all D1.q1a, D2.q2a A) no keyword search, so nothing to do B) disjunctive query on speakers and on sentencepropertyvalue and speakerpropertyvalue The query on speakers and speakerpropertyvalue yields depth0result. The query on sentencpropertyvalues gives depth1result. Similar to query as above but projecting onto fewer attributes and using a group by. insert res from select depth0result.ling as speaker, depth0result.property as speakerproperty, depth0result.value as speakervalue, depth1result.property-depth1result.value pairs as vector v1 depth0result.property-depth0result.value pairs as vector v2 from depth0result, depth1result where depth1result.parentid = depth0result.lingid group by depth0result.ling, depth0result.property, depth0result.value having v1 contains P1.v1a, P2.v2b, P3.v3b and v2 contains D1.q1a, D2.q2a order by speaker, speakerproperty, speakervalue Of course such a having clause doesn't work, but that is the semantics we would want.