Java JOIN

#1 · May 15, 2020, 8:47 pm

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Partly inspired by dandl's recent TTM-in-C# efforts, and partly by my own work-in-progress "Java for Data" and datasheet efforts, I've created two variants on a JOIN operator in Java. One is a useful generic JOIN, the other is a hard-wired JOIN for comparison purposes.

The hard-wired JOIN only accepts the canonical S and P example relations (implemented using Java's generic HashSet) and joins them on their common city attributes.

The generic JOIN accepts any two collection instances and joins items in the collections based on the result of evaluating a user-defined expression. The expression must return true for every pair of items that should be included in the JOIN result, and return false for every pair of items that should not be included in the JOIN result.

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

Thus, given S emitted as:

STuple[sNum=S1, sName=Smith, status=20, city=London]
STuple[sNum=S5, sName=Adams, status=30, city=Athens]
STuple[sNum=S3, sName=Blake, status=30, city=Paris]
STuple[sNum=S4, sName=Clark, status=20, city=London]
STuple[sNum=S2, sName=Jones, status=10, city=Paris]

And given P emitted as:

PTuple[pNum=P5, pName=Cam, color=Blue, weight=2.0, city=Paris]
PTuple[pNum=P2, pName=Bolt, color=Green, weight=7.0, city=Paris]
PTuple[pNum=P3, pName=Screw, color=Blue, weight=7.0, city=Oslo]
PTuple[pNum=P4, pName=Screw, color=Red, weight=4.0, city=London]
PTuple[pNum=P6, pName=Cog, color=Red, weight=9.0, city=London]
PTuple[pNum=P1, pName=Nut, color=Red, weight=12.0, city=London]

The result of S JOIN P is:

Pair[left=STuple[sNum=S1, sName=Smith, status=20, city=London], right=PTuple[pNum=P4, pName=Screw, color=Red, weight=4.0, city=London]]
Pair[left=STuple[sNum=S1, sName=Smith, status=20, city=London], right=PTuple[pNum=P6, pName=Cog, color=Red, weight=9.0, city=London]]
Pair[left=STuple[sNum=S1, sName=Smith, status=20, city=London], right=PTuple[pNum=P1, pName=Nut, color=Red, weight=12.0, city=London]]
Pair[left=STuple[sNum=S3, sName=Blake, status=30, city=Paris], right=PTuple[pNum=P5, pName=Cam, color=Blue, weight=2.0, city=Paris]]
Pair[left=STuple[sNum=S3, sName=Blake, status=30, city=Paris], right=PTuple[pNum=P2, pName=Bolt, color=Green, weight=7.0, city=Paris]]
Pair[left=STuple[sNum=S4, sName=Clark, status=20, city=London], right=PTuple[pNum=P4, pName=Screw, color=Red, weight=4.0, city=London]]
Pair[left=STuple[sNum=S4, sName=Clark, status=20, city=London], right=PTuple[pNum=P6, pName=Cog, color=Red, weight=9.0, city=London]]
Pair[left=STuple[sNum=S4, sName=Clark, status=20, city=London], right=PTuple[pNum=P1, pName=Nut, color=Red, weight=12.0, city=London]]
Pair[left=STuple[sNum=S2, sName=Jones, status=10, city=Paris], right=PTuple[pNum=P5, pName=Cam, color=Blue, weight=2.0, city=Paris]]
Pair[left=STuple[sNum=S2, sName=Jones, status=10, city=Paris], right=PTuple[pNum=P2, pName=Bolt, color=Green, weight=7.0, city=Paris]]

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

Notably, it requires no reflection, is fully statically type-safe, and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

The complete code (for Java 14, as the new Java record mechanism is used), including the JOIN method definitions, example data, sample data generators, and performance comparisons:

package org.reldb.experiments;

import java.math.BigDecimal;
import java.util.Collection;
import java.util.HashSet;
import java.util.stream.IntStream;
import java.util.stream.LongStream;
import java.util.stream.Stream;

public class JoinDemo {

  /**
   *  The following methods are used to create random data for performance tests. 
   */
  
  private static String getRandomItem(String[] items) {
    return items[(int)(Math.random() * (double)(items.length))];		
  }
  
  private static String getRandomColor() {
    final String[] items = {"Red", "Green", "Blue"};
    return getRandomItem(items);
  }

  private static String getRandomPart() {
    final String[] items = {"Nut", "Bolt", "Screw", "Cam", "Cog"};
    return getRandomItem(items);
  }

  private static String getRandomCity() {
    final String[] items = {"London", "Paris", "Oslo", "Athens"};
    return getRandomItem(items);
  }

  /** A generic Pair of references. */
  static record Pair<L, R>(L left, R right) {};

  /**
   * Specify the matching expression for generic JOIN.
   *
   * @param <L> - the left operand's type
   * @param <R> - the right operand's type
   */
  public static interface Matcher<L, R> {
    public boolean isMatchingIn(L l, R r);
  }
  
  /**
   * Perform a generic JOIN of a collection l to a collection r using a lambda evaluating to 'true' for every tuple match.
   * 
   * The algorithm here is sub-optimal.
   * 
   * @param <L> - the left collection's item type
   * @param <R> - the right collection's item type
   * @param l - the left collection
   * @param r - the right collection
   * @param matcher a Matcher that specifies the lambda expression that 
   *       returns true for each item in the left operand that should match an item in the right operand.
   * @return a Stream of joined L-to-R PairS.
   */
  public static <L, R> Stream<Pair<L, R>> join(Collection<L> l, Collection<R> r, Matcher<L, R> matcher) {
    return l
        .stream()
        .flatMap(lTuple -> r
            .stream()
            .filter(rTuple -> matcher.isMatchingIn(lTuple, rTuple))
            .map(rTuple -> new Pair<>(lTuple, rTuple)));		
  }
  
  /** Define Parts heading. */
  static record PTuple(String pNum, String pName, String color, BigDecimal weight, String city) {};
  
  /** Define Suppliers heading. */
  static record STuple(String sNum, String sName, int status, String city) {};
  
  /**
   * Perform a hard-wired join of collection of STupleS to a collection of PTupleS by the city attribute of both. 
   *  
   * The algorithm here is sub-optimal.
   * 
   * @param s - Collection of STupleS.
   * @param p - Collection of PTupleS.
   * @return a Stream of joined STuple-to-PTuple PairS.
   */
  public static Stream<Pair<STuple, PTuple>> joinSCityPCity(Collection<STuple> s, Collection<PTuple> p) {
    return s
        .stream()
        .flatMap(sTuple -> p
            .stream()
            .filter(pTuple -> pTuple.city.equals(sTuple.city))
            .map(pTuple -> new Pair<>(sTuple, pTuple)));
  }
  
  /**
   * A stopwatch to time a specified operation.
   * 
   * @param code - the operation to time
   * @return - the duration in milliseconds
   */
  private static long time(Runnable code) {
    long startTime = System.nanoTime();
    code.run();
    long endTime = System.nanoTime();
    return (endTime - startTime) / 1000000;		
  }

  /** 
   * Run a specified operation a given number of times and obtain the average duration.
   * 
   * @param trials - number of trials to run
   * @param code - the operation to time
   * @param prompt - text prompt associated with each trial
   * @return - average duration in milliseconds
   */
  private static double averageTime(int trials, String prompt, Runnable code) {
    long totalTime = LongStream.range(1, trials + 1)
      .map(trial -> {
        System.out.print("Trial " + trial + ":\t" + prompt);
        var duration = time(code);
        System.out.println(" took " + duration + " milliseconds.");				
        return duration;
      })
      .sum();
    return (double)totalTime / (double)trials;
  }

  public static void main(String[] args) {
    // Create canonical S relation
    var s = new HashSet<STuple>();
    s.add(new STuple("S1", "Smith", 20, "London"));
    s.add(new STuple("S2", "Jones", 10, "Paris"));
    s.add(new STuple("S3", "Blake", 30, "Paris"));
    s.add(new STuple("S4", "Clark", 20, "London"));
    s.add(new STuple("S5", "Adams", 30, "Athens"));
    
    // Create canonical P relation
    var p = new HashSet<PTuple>();
    p.add(new PTuple("P1", "Nut", "Red", new BigDecimal("12.0"), "London"));
    p.add(new PTuple("P2", "Bolt", "Green", new BigDecimal("7.0"), "Paris"));
    p.add(new PTuple("P3", "Screw", "Blue", new BigDecimal("7.0"), "Oslo"));
    p.add(new PTuple("P4", "Screw", "Red", new BigDecimal("4.0"), "London"));
    p.add(new PTuple("P5", "Cam", "Blue", new BigDecimal("2.0"), "Paris"));
    p.add(new PTuple("P6", "Cog", "Red", new BigDecimal("9.0"), "London"));
    
    // Show S.
    System.out.println();
    System.out.println("====== s ======");
    s.forEach(System.out::println);
    
    // Show P.
    System.out.println();
    System.out.println("====== p ======");
    p.forEach(System.out::println);
    
    // Show result of hard-wired S JOIN P.
    System.out.println();
    System.out.println("====== s JOIN p (hard-wired join) ======");
    joinSCityPCity(s, p).forEach(System.out::println);
    
    // Show result of generic S JOIN P.
    System.out.println();
    System.out.println("====== s JOIN p (generic join) ======");
    join(s, p, (sTuple, pTuple) -> sTuple.city.equals(pTuple.city)).forEach(System.out::println);
        
    // Compare time performance of hard-wired JOIN vs generic JOIN.
    System.out.println();
    System.out.println("====== Hard-wired vs Generic JOIN ======");
    var speedTestS = new HashSet<STuple>();
    IntStream.range(0, 10000)
      .forEach(i -> speedTestS.add(new STuple("S" + i, "Name" + i, (int)(Math.random() * 30.0), getRandomCity())));
    var speedTestP = new HashSet<PTuple>();
    IntStream.range(0, 20000)
      .forEach(i -> speedTestP.add(new PTuple("P" + i, getRandomPart(), getRandomColor(), new BigDecimal(Math.random() * 20.0), getRandomCity())));
    final int trials = 10;
    var averageHardwiredTime = averageTime(trials, "hard-wired speedTestS JOIN speedTestP",
        () -> joinSCityPCity(speedTestS, speedTestP).forEach(result -> {}));
    var averageGenericTime = averageTime(trials, "generic speedTestS JOIN speedTestP",
        () -> join(speedTestS, speedTestP, (sTuple, pTuple) -> sTuple.city.equals(pTuple.city)).forEach(result -> {}));
    System.out.println();
    System.out.println("Average hard-wired join time is " + averageHardwiredTime + " milliseconds.");
    System.out.println("Average    generic join time is " + + averageGenericTime + " milliseconds.");
  }
}

Partly inspired by dandl's recent TTM-in-C# efforts, and partly by my own work-in-progress "Java for Data" and datasheet efforts, I've created two variants on a JOIN operator in Java. One is a useful generic JOIN, the other is a hard-wired JOIN for comparison purposes.

The hard-wired JOIN only accepts the canonical S and P example relations (implemented using Java's generic HashSet) and joins them on their common city attributes.

The generic JOIN accepts any two collection instances and joins items in the collections based on the result of evaluating a user-defined expression. The expression must return true for every pair of items that should be included in the JOIN result, and return false for every pair of items that should not be included in the JOIN result.

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

Thus, given S emitted as:

STuple[sNum=S1, sName=Smith, status=20, city=London]
STuple[sNum=S5, sName=Adams, status=30, city=Athens]
STuple[sNum=S3, sName=Blake, status=30, city=Paris]
STuple[sNum=S4, sName=Clark, status=20, city=London]
STuple[sNum=S2, sName=Jones, status=10, city=Paris]

And given P emitted as:

PTuple[pNum=P5, pName=Cam, color=Blue, weight=2.0, city=Paris]
PTuple[pNum=P2, pName=Bolt, color=Green, weight=7.0, city=Paris]
PTuple[pNum=P3, pName=Screw, color=Blue, weight=7.0, city=Oslo]
PTuple[pNum=P4, pName=Screw, color=Red, weight=4.0, city=London]
PTuple[pNum=P6, pName=Cog, color=Red, weight=9.0, city=London]
PTuple[pNum=P1, pName=Nut, color=Red, weight=12.0, city=London]

The result of S JOIN P is:

Pair[left=STuple[sNum=S1, sName=Smith, status=20, city=London], right=PTuple[pNum=P4, pName=Screw, color=Red, weight=4.0, city=London]]
Pair[left=STuple[sNum=S1, sName=Smith, status=20, city=London], right=PTuple[pNum=P6, pName=Cog, color=Red, weight=9.0, city=London]]
Pair[left=STuple[sNum=S1, sName=Smith, status=20, city=London], right=PTuple[pNum=P1, pName=Nut, color=Red, weight=12.0, city=London]]
Pair[left=STuple[sNum=S3, sName=Blake, status=30, city=Paris], right=PTuple[pNum=P5, pName=Cam, color=Blue, weight=2.0, city=Paris]]
Pair[left=STuple[sNum=S3, sName=Blake, status=30, city=Paris], right=PTuple[pNum=P2, pName=Bolt, color=Green, weight=7.0, city=Paris]]
Pair[left=STuple[sNum=S4, sName=Clark, status=20, city=London], right=PTuple[pNum=P4, pName=Screw, color=Red, weight=4.0, city=London]]
Pair[left=STuple[sNum=S4, sName=Clark, status=20, city=London], right=PTuple[pNum=P6, pName=Cog, color=Red, weight=9.0, city=London]]
Pair[left=STuple[sNum=S4, sName=Clark, status=20, city=London], right=PTuple[pNum=P1, pName=Nut, color=Red, weight=12.0, city=London]]
Pair[left=STuple[sNum=S2, sName=Jones, status=10, city=Paris], right=PTuple[pNum=P5, pName=Cam, color=Blue, weight=2.0, city=Paris]]
Pair[left=STuple[sNum=S2, sName=Jones, status=10, city=Paris], right=PTuple[pNum=P2, pName=Bolt, color=Green, weight=7.0, city=Paris]]

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

Notably, it requires no reflection, is fully statically type-safe, and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

The complete code (for Java 14, as the new Java record mechanism is used), including the JOIN method definitions, example data, sample data generators, and performance comparisons:

package org.reldb.experiments;

import java.math.BigDecimal;
import java.util.Collection;
import java.util.HashSet;
import java.util.stream.IntStream;
import java.util.stream.LongStream;
import java.util.stream.Stream;

public class JoinDemo {

  /**
   *  The following methods are used to create random data for performance tests. 
   */
  
  private static String getRandomItem(String[] items) {
    return items[(int)(Math.random() * (double)(items.length))];		
  }
  
  private static String getRandomColor() {
    final String[] items = {"Red", "Green", "Blue"};
    return getRandomItem(items);
  }

  private static String getRandomPart() {
    final String[] items = {"Nut", "Bolt", "Screw", "Cam", "Cog"};
    return getRandomItem(items);
  }

  private static String getRandomCity() {
    final String[] items = {"London", "Paris", "Oslo", "Athens"};
    return getRandomItem(items);
  }

  /** A generic Pair of references. */
  static record Pair<L, R>(L left, R right) {};

  /**
   * Specify the matching expression for generic JOIN.
   *
   * @param <L> - the left operand's type
   * @param <R> - the right operand's type
   */
  public static interface Matcher<L, R> {
    public boolean isMatchingIn(L l, R r);
  }
  
  /**
   * Perform a generic JOIN of a collection l to a collection r using a lambda evaluating to 'true' for every tuple match.
   * 
   * The algorithm here is sub-optimal.
   * 
   * @param <L> - the left collection's item type
   * @param <R> - the right collection's item type
   * @param l - the left collection
   * @param r - the right collection
   * @param matcher a Matcher that specifies the lambda expression that 
   *       returns true for each item in the left operand that should match an item in the right operand.
   * @return a Stream of joined L-to-R PairS.
   */
  public static <L, R> Stream<Pair<L, R>> join(Collection<L> l, Collection<R> r, Matcher<L, R> matcher) {
    return l
        .stream()
        .flatMap(lTuple -> r
            .stream()
            .filter(rTuple -> matcher.isMatchingIn(lTuple, rTuple))
            .map(rTuple -> new Pair<>(lTuple, rTuple)));		
  }
  
  /** Define Parts heading. */
  static record PTuple(String pNum, String pName, String color, BigDecimal weight, String city) {};
  
  /** Define Suppliers heading. */
  static record STuple(String sNum, String sName, int status, String city) {};
  
  /**
   * Perform a hard-wired join of collection of STupleS to a collection of PTupleS by the city attribute of both. 
   *  
   * The algorithm here is sub-optimal.
   * 
   * @param s - Collection of STupleS.
   * @param p - Collection of PTupleS.
   * @return a Stream of joined STuple-to-PTuple PairS.
   */
  public static Stream<Pair<STuple, PTuple>> joinSCityPCity(Collection<STuple> s, Collection<PTuple> p) {
    return s
        .stream()
        .flatMap(sTuple -> p
            .stream()
            .filter(pTuple -> pTuple.city.equals(sTuple.city))
            .map(pTuple -> new Pair<>(sTuple, pTuple)));
  }
  
  /**
   * A stopwatch to time a specified operation.
   * 
   * @param code - the operation to time
   * @return - the duration in milliseconds
   */
  private static long time(Runnable code) {
    long startTime = System.nanoTime();
    code.run();
    long endTime = System.nanoTime();
    return (endTime - startTime) / 1000000;		
  }

  /** 
   * Run a specified operation a given number of times and obtain the average duration.
   * 
   * @param trials - number of trials to run
   * @param code - the operation to time
   * @param prompt - text prompt associated with each trial
   * @return - average duration in milliseconds
   */
  private static double averageTime(int trials, String prompt, Runnable code) {
    long totalTime = LongStream.range(1, trials + 1)
      .map(trial -> {
        System.out.print("Trial " + trial + ":\t" + prompt);
        var duration = time(code);
        System.out.println(" took " + duration + " milliseconds.");				
        return duration;
      })
      .sum();
    return (double)totalTime / (double)trials;
  }

  public static void main(String[] args) {
    // Create canonical S relation
    var s = new HashSet<STuple>();
    s.add(new STuple("S1", "Smith", 20, "London"));
    s.add(new STuple("S2", "Jones", 10, "Paris"));
    s.add(new STuple("S3", "Blake", 30, "Paris"));
    s.add(new STuple("S4", "Clark", 20, "London"));
    s.add(new STuple("S5", "Adams", 30, "Athens"));
    
    // Create canonical P relation
    var p = new HashSet<PTuple>();
    p.add(new PTuple("P1", "Nut", "Red", new BigDecimal("12.0"), "London"));
    p.add(new PTuple("P2", "Bolt", "Green", new BigDecimal("7.0"), "Paris"));
    p.add(new PTuple("P3", "Screw", "Blue", new BigDecimal("7.0"), "Oslo"));
    p.add(new PTuple("P4", "Screw", "Red", new BigDecimal("4.0"), "London"));
    p.add(new PTuple("P5", "Cam", "Blue", new BigDecimal("2.0"), "Paris"));
    p.add(new PTuple("P6", "Cog", "Red", new BigDecimal("9.0"), "London"));
    
    // Show S.
    System.out.println();
    System.out.println("====== s ======");
    s.forEach(System.out::println);
    
    // Show P.
    System.out.println();
    System.out.println("====== p ======");
    p.forEach(System.out::println);
    
    // Show result of hard-wired S JOIN P.
    System.out.println();
    System.out.println("====== s JOIN p (hard-wired join) ======");
    joinSCityPCity(s, p).forEach(System.out::println);
    
    // Show result of generic S JOIN P.
    System.out.println();
    System.out.println("====== s JOIN p (generic join) ======");
    join(s, p, (sTuple, pTuple) -> sTuple.city.equals(pTuple.city)).forEach(System.out::println);
        
    // Compare time performance of hard-wired JOIN vs generic JOIN.
    System.out.println();
    System.out.println("====== Hard-wired vs Generic JOIN ======");
    var speedTestS = new HashSet<STuple>();
    IntStream.range(0, 10000)
      .forEach(i -> speedTestS.add(new STuple("S" + i, "Name" + i, (int)(Math.random() * 30.0), getRandomCity())));
    var speedTestP = new HashSet<PTuple>();
    IntStream.range(0, 20000)
      .forEach(i -> speedTestP.add(new PTuple("P" + i, getRandomPart(), getRandomColor(), new BigDecimal(Math.random() * 20.0), getRandomCity())));
    final int trials = 10;
    var averageHardwiredTime = averageTime(trials, "hard-wired speedTestS JOIN speedTestP",
        () -> joinSCityPCity(speedTestS, speedTestP).forEach(result -> {}));
    var averageGenericTime = averageTime(trials, "generic speedTestS JOIN speedTestP",
        () -> join(speedTestS, speedTestP, (sTuple, pTuple) -> sTuple.city.equals(pTuple.city)).forEach(result -> {}));
    System.out.println();
    System.out.println("Average hard-wired join time is " + averageHardwiredTime + " milliseconds.");
    System.out.println("Average    generic join time is " + + averageGenericTime + " milliseconds.");
  }
}

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#2 · May 15, 2020, 10:38 pm

Quote from AntC on May 15, 2020, 10:38 pm

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

I wouldn't describe that as 'unusual'. A (ordered) pair is the usual mathematical result from a Cartesian Product. So what you have is a Cartesian Pair subject to an equi-restriction. You go on to say needs no reflection. So how is the query determining there's an attribute name in common? It's name and the attribute type? What if same-named attributes turn out to be different types/incomparable?

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

Hmm? Now JOIN SP on to your result. You're presumably going to get a collection of Pairs of a tuple with a Pair of tuples PairS(SP, PairS(S, P)). So the right operand is not a 'flat' relation but a collection of PairS. How does the join restriction take the S# from the first of the Pair and test against the S# from the first of the second of the Pair? And contrariwise for P#?

Now apply that for a typical normalised schema design: SP JOIN S JOIN P, join to reference data for STATUS name, CityID, ColourID, Units-of-measure name, warehouse picking location, ... There's a very good reason JOIN doesn't produce pairs/does produce 'flattened' relations.

Notably, it requires no reflection, is fully statically type-safe,

D'uh well yes, because the type of the result is just a pairing of the types of its arguments. I'm not seeing you've achieved anything that wasn't blimmin' obvious already. Inferring the Heading of the result from relational operations is the main/only hard part for static type-safe inference in GP languages (of enough other sophistication).

and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

The complete code (for Java 14, as the new Java record mechanism is used), including the JOIN method definitions, example data, sample data generators, and performance comparisons:

Good to see you've used the new Java records: how ergonomic was it to code?

P.S. I apologise if the above is rather dismissive. People in Haskell are forever producing pairs like this as if they've 'solved' SQL-alike SELECT/JOINs. They typically map a SELECT-alike over the collection of pairs to produce a 'flat' relation -- as you could do here presumably. And it's type-safe because the SELECT name-list specifies the Heading of the result. But they need to specify each attribute type of the result, and the compiler can statically check that is the same type as in the pair. That's type-checking, not type inference (automatically deriving the attribute type without needing spelling it out). So I'm disagreeing this is "equivalent functionality" to SQL SELECT, let alone to NATURAL JOIN.

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

I wouldn't describe that as 'unusual'. A (ordered) pair is the usual mathematical result from a Cartesian Product. So what you have is a Cartesian Pair subject to an equi-restriction. You go on to say needs no reflection. So how is the query determining there's an attribute name in common? It's name and the attribute type? What if same-named attributes turn out to be different types/incomparable?

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

Hmm? Now JOIN SP on to your result. You're presumably going to get a collection of Pairs of a tuple with a Pair of tuples PairS(SP, PairS(S, P)). So the right operand is not a 'flat' relation but a collection of PairS. How does the join restriction take the S# from the first of the Pair and test against the S# from the first of the second of the Pair? And contrariwise for P#?

Now apply that for a typical normalised schema design: SP JOIN S JOIN P, join to reference data for STATUS name, CityID, ColourID, Units-of-measure name, warehouse picking location, ... There's a very good reason JOIN doesn't produce pairs/does produce 'flattened' relations.

Notably, it requires no reflection, is fully statically type-safe,

D'uh well yes, because the type of the result is just a pairing of the types of its arguments. I'm not seeing you've achieved anything that wasn't blimmin' obvious already. Inferring the Heading of the result from relational operations is the main/only hard part for static type-safe inference in GP languages (of enough other sophistication).

and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

The complete code (for Java 14, as the new Java record mechanism is used), including the JOIN method definitions, example data, sample data generators, and performance comparisons:

Good to see you've used the new Java records: how ergonomic was it to code?

P.S. I apologise if the above is rather dismissive. People in Haskell are forever producing pairs like this as if they've 'solved' SQL-alike SELECT/JOINs. They typically map a SELECT-alike over the collection of pairs to produce a 'flat' relation -- as you could do here presumably. And it's type-safe because the SELECT name-list specifies the Heading of the result. But they need to specify each attribute type of the result, and the compiler can statically check that is the same type as in the pair. That's type-checking, not type inference (automatically deriving the attribute type without needing spelling it out). So I'm disagreeing this is "equivalent functionality" to SQL SELECT, let alone to NATURAL JOIN.

#3 · May 16, 2020, 2:34 am

Quote from dandl on May 16, 2020, 2:34 am

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Partly inspired by dandl's recent TTM-in-C# efforts, and partly by my own work-in-progress "Java for Data" and datasheet efforts, I've created two variants on a JOIN operator in Java. One is a useful generic JOIN, the other is a hard-wired JOIN for comparison purposes.

The hard-wired JOIN only accepts the canonical S and P example relations (implemented using Java's generic HashSet) and joins them on their common city attributes.

The generic JOIN accepts any two collection instances and joins items in the collections based on the result of evaluating a user-defined expression. The expression must return true for every pair of items that should be included in the JOIN result, and return false for every pair of items that should not be included in the JOIN result.

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

That's a straightforward theta join: Cartesian Product followed by compare operation as defined by Codd. Now you rely on the caller to do the final step of projecting the pair of tuples onto a single tuple, to remove excess attributes and duplicates.

This will behave really badly in cases where the final cardinality is low. Say you have 1 thousand tuples each side, typical joins will return anywhere from a few thousand tuples down to a just few or none, but you have to generate 1 million pairs to get there.

It will also perform poorly if the end result it a semijoin or antijoin, but you can fix that more easily.

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

The theta join is OK, it's the lack of the final project that is non-RA.

Notably, it requires no reflection, is fully statically type-safe, and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

I'm sure I could come up with sample data that would destroy that observation. Try 1000 suppliers and 1000 parts, with exactly one of each with CITY="Sydney". Rows processed: 1,000,000. Rows in final result: 1.

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Partly inspired by dandl's recent TTM-in-C# efforts, and partly by my own work-in-progress "Java for Data" and datasheet efforts, I've created two variants on a JOIN operator in Java. One is a useful generic JOIN, the other is a hard-wired JOIN for comparison purposes.

The hard-wired JOIN only accepts the canonical S and P example relations (implemented using Java's generic HashSet) and joins them on their common city attributes.

The generic JOIN accepts any two collection instances and joins items in the collections based on the result of evaluating a user-defined expression. The expression must return true for every pair of items that should be included in the JOIN result, and return false for every pair of items that should not be included in the JOIN result.

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

That's a straightforward theta join: Cartesian Product followed by compare operation as defined by Codd. Now you rely on the caller to do the final step of projecting the pair of tuples onto a single tuple, to remove excess attributes and duplicates.

This will behave really badly in cases where the final cardinality is low. Say you have 1 thousand tuples each side, typical joins will return anywhere from a few thousand tuples down to a just few or none, but you have to generate 1 million pairs to get there.

It will also perform poorly if the end result it a semijoin or antijoin, but you can fix that more easily.

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

The theta join is OK, it's the lack of the final project that is non-RA.

Notably, it requires no reflection, is fully statically type-safe, and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

I'm sure I could come up with sample data that would destroy that observation. Try 1000 suppliers and 1000 parts, with exactly one of each with CITY="Sydney". Rows processed: 1,000,000. Rows in final result: 1.

Andl - A New Database Language - andl.org

#4 · May 16, 2020, 4:04 am

Having implemented generic relational join myself, unless I'm missing something, your implementation is missing a simple but important key detail if you want it to be performant.

You need to explicitly index on the attributes being joined, and not the tuples as a whole, but where those are one and the same.

How I implemented this is that in principle rather than using a HashSet I used a Dictionary, where the values are the whole tuple and the keys are a projection on the attributes being joined.

When you do this, the common equal-join is just order of N rather than order of N-squared, because it can iterate over the smaller cardinality argument and for each one do an order of one lookup at the index for the other argument (building the index is also order of N), so say you could join a thousand to a thousand or whatever in reasonable time, barring when you are explicitly doing a cartesian product.

The above assumes a hash index. I never bothered with the complexity of a tree or something that would support doing certain other kinds of theta join quickly.

#5 · May 16, 2020, 7:52 am

Quote from AntC on May 16, 2020, 7:52 am

Quote from Darren Duncan on May 16, 2020, 4:04 am

Having implemented generic relational join myself, unless I'm missing something, your implementation is missing a simple but important key detail if you want it to be performant.

You need to explicitly index on the attributes being joined, and not the tuples as a whole, but where those are one and the same.

There's a guy on StackOverflow who never passes over the opportunity -- no matter how remotely connected to the question -- to warn against (SQL NATURAL) JOIN on the grounds inter alia that it should only be allowed where bridging a Foreign Key reference/constraint.

One of his complaints is that (NATURAL) JOIN on anything other than a key has horrible performance. I suppose I could counter that explicit FROM S INNER JOIN P ON S.CITY = P.CITY has just as bad performance; but he'd only raise another load of objections about 'accidental' capture of same-named attributes that the query-writer didn't intend.

For a practical D, I notice Tutorial D:

as well as JOIN has TIMES and INTERSECT to support the query-writer explicitly telling the compiler about attributes in common.

as well as UNION, MINUS has D_UNION, I_MINUS to support the query-writer explicitly telling the compiler about tuples (not) in common.

Then could there be a

SP JOIN_ON {S#} S to support the query-writer explicitly telling the compiler about attributes expected to be in common?
And to help the compiler figure out access paths/indexing. Even infer the heading of the result?

Or even SP FK_JOIN_ON {S#} S, to explicitly tell the compiler to expect the attributes in common to be a key to one of the operands?

Or SP FK_JOIN S, to explicitly tell the compiler the attributes in common are a key to one of the operands?

Caveat: this is going to be disergonomic for programming; because it's not an associative operator:

(SP FK_JOIN_ON {S#} S) FK_JOIN_ON {P#} (P RENAME {PCITY := CITY}))

/≡ SP FK_JOIN_ON {S#} (S FK_JOIN_ON {P#} (P RENAME {PCITY := CITY})) -- that's not valid, try:

SP FK_JOIN_ON {S#, P#} (S FK_JOIN_ON {} (P RENAME {PCITY := CITY})) -- not valid either: no foreign key bridging S, P

FK_JOIN is going to fail similarly.

So there's no hope for an n-adic FK_JOIN_ON.

Really this needs an implementation of RM VSS 6 generic relational operators.

Quote from Darren Duncan on May 16, 2020, 4:04 am

Having implemented generic relational join myself, unless I'm missing something, your implementation is missing a simple but important key detail if you want it to be performant.

You need to explicitly index on the attributes being joined, and not the tuples as a whole, but where those are one and the same.

There's a guy on StackOverflow who never passes over the opportunity -- no matter how remotely connected to the question -- to warn against (SQL NATURAL) JOIN on the grounds inter alia that it should only be allowed where bridging a Foreign Key reference/constraint.

One of his complaints is that (NATURAL) JOIN on anything other than a key has horrible performance. I suppose I could counter that explicit FROM S INNER JOIN P ON S.CITY = P.CITY has just as bad performance; but he'd only raise another load of objections about 'accidental' capture of same-named attributes that the query-writer didn't intend.

For a practical D, I notice Tutorial D:

as well as JOIN has TIMES and INTERSECT to support the query-writer explicitly telling the compiler about attributes in common.
as well as UNION, MINUS has D_UNION, I_MINUS to support the query-writer explicitly telling the compiler about tuples (not) in common.

Then could there be a

SP JOIN_ON {S#} S to support the query-writer explicitly telling the compiler about attributes expected to be in common?
And to help the compiler figure out access paths/indexing. Even infer the heading of the result?
Or even SP FK_JOIN_ON {S#} S, to explicitly tell the compiler to expect the attributes in common to be a key to one of the operands?
Or SP FK_JOIN S, to explicitly tell the compiler the attributes in common are a key to one of the operands?

Caveat: this is going to be disergonomic for programming; because it's not an associative operator:

(SP FK_JOIN_ON {S#} S) FK_JOIN_ON {P#} (P RENAME {PCITY := CITY}))
/≡ SP FK_JOIN_ON {S#} (S FK_JOIN_ON {P#} (P RENAME {PCITY := CITY})) -- that's not valid, try:
SP FK_JOIN_ON {S#, P#} (S FK_JOIN_ON {} (P RENAME {PCITY := CITY})) -- not valid either: no foreign key bridging S, P
FK_JOIN is going to fail similarly.

So there's no hope for an n-adic FK_JOIN_ON.

Really this needs an implementation of RM VSS 6 generic relational operators.

#6 · May 16, 2020, 8:15 am

Anthony, I should explain further that my hash-based implemented technique to get reasonable performance in a technically simple manner was not actually specific to Codd's join operator which implicitly joined on the common subset of attributes (and that operation was implemented for N-ary by just repeated binary applications), the technique was a useful generic indexing technique used for a variety of things: simple element lookup (order of 1), eliminating or counting duplicates (order of N), all the usual set operations union/intersect/minus/etc (order of N), semijoin or anti join or natural join (all order of N). Indexes are created automatically behind the scenes on first use so they support lazy operations. Indexes take order of N to create. The technique scales linearly to relations of any size. For my implementation, the only other important part is that I had a custom first-pass hashing function which deterministically serialized any distinct value of the type system into a distinct character string, that is a carefully chosen serialization of the value, and then the underlying programming language I could just use its standard built-in hashing mechanisms which knew how to use character strings as keys to take care of the rest. So in Java/C# my system was based on a dictionary of (String,Object) where the String is the serialization of a value derived from Object that we want to perform our equality tests on. My solution may not be the absolute most performant, but it was technically simple and easy to implement and at least decently performant not exceeding order of N for anything.

#7 · May 16, 2020, 8:24 am

Quote from Dave Voorhis on May 16, 2020, 8:24 am

Quote from AntC on May 15, 2020, 10:38 pm

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

I wouldn't describe that as 'unusual'.

Unusual compared to usual implementations of the relational model.

A (ordered) pair is the usual mathematical result from a Cartesian Product. So what you have is a Cartesian Pair subject to an equi-restriction.

Yes.

You go on to say needs no reflection. So how is the query determining there's an attribute name in common? It's name and the attribute type? What if same-named attributes turn out to be different types/incomparable?

It's checked and determined statically, at compile-time, per compilation of the JOIN expression (modulo cheating a bit for expediency in my example code by using Java 'equals' *mumble*Java*mumble*.) Reflection is a run-time mechanism.

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

Hmm? Now JOIN SP on to your result. You're presumably going to get a collection of Pairs of a tuple with a Pair of tuples PairS(SP, PairS(S, P)). So the right operand is not a 'flat' relation but a collection of PairS. How does the join restriction take the S# from the first of the Pair and test against the S# from the first of the second of the Pair? And contrariwise for P#?

Indeed, the result of a JOIN is not a flat relation. The operands don't have to be relations either. Per my post, this is not an implementation of the relational model.

...Which is my point, really. Trying to fit the relational model of any kind, TTM or otherwise, into Java or C# or other popular imperative programming languages inevitably involves undesirable tradeoffs. Dispensing with strict adherence to the relational model results in a different set of tradeoffs, but ones I find less undesirable. However, your mileage may vary, etc.

The JOIN restriction is an explicit expression, so performing a JOIN on the result of a prior JOIN requires that the expression be something like spTuple.l.city == spjTuple.l.r.city or whatever. Type information is statically preserved, so this works.

Now apply that for a typical normalised schema design: SP JOIN S JOIN P, join to reference data for STATUS name, CityID, ColourID, Units-of-measure name, warehouse picking location, ... There's a very good reason JOIN doesn't produce pairs/does produce 'flattened' relations.

Notably, it requires no reflection, is fully statically type-safe,

D'uh well yes, because the type of the result is just a pairing of the types of its arguments. I'm not seeing you've achieved anything that wasn't blimmin' obvious already.

Yes, exactly!

That's what I like about it. It the context of typical Java, it's intuitive, effective, and obvious.

Inferring the Heading of the result from relational operations is the main/only hard part for static type-safe inference in GP languages (of enough other sophistication).

and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

The complete code (for Java 14, as the new Java record mechanism is used), including the JOIN method definitions, example data, sample data generators, and performance comparisons:

Good to see you've used the new Java records: how ergonomic was it to code?

Saved me having to write the usual equals() and hashCode() implementations. That was nice.

P.S. I apologise if the above is rather dismissive. People in Haskell are forever producing pairs like this as if they've 'solved' SQL-alike SELECT/JOINs. They typically map a SELECT-alike over the collection of pairs to produce a 'flat' relation -- as you could do here presumably. And it's type-safe because the SELECT name-list specifies the Heading of the result. But they need to specify each attribute type of the result, and the compiler can statically check that is the same type as in the pair. That's type-checking, not type inference (automatically deriving the attribute type without needing spelling it out). So I'm disagreeing this is "equivalent functionality" to SQL SELECT, let alone to NATURAL JOIN.

It's neither, nor is it the relational model, but the combination of this JOIN + Java Streams provides equivalent facility to the relational model in general; arguably (though subjectively) more so than attempts to shoehorn a "true" relational algebra (TTM or otherwise) into Java (or C#, or whatever.)

Quote from AntC on May 15, 2020, 10:38 pm

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

I wouldn't describe that as 'unusual'.

Unusual compared to usual implementations of the relational model.

A (ordered) pair is the usual mathematical result from a Cartesian Product. So what you have is a Cartesian Pair subject to an equi-restriction.

Yes.

You go on to say needs no reflection. So how is the query determining there's an attribute name in common? It's name and the attribute type? What if same-named attributes turn out to be different types/incomparable?

It's checked and determined statically, at compile-time, per compilation of the JOIN expression (modulo cheating a bit for expediency in my example code by using Java 'equals' *mumble*Java*mumble*.) Reflection is a run-time mechanism.

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

Hmm? Now JOIN SP on to your result. You're presumably going to get a collection of Pairs of a tuple with a Pair of tuples PairS(SP, PairS(S, P)). So the right operand is not a 'flat' relation but a collection of PairS. How does the join restriction take the S# from the first of the Pair and test against the S# from the first of the second of the Pair? And contrariwise for P#?

Indeed, the result of a JOIN is not a flat relation. The operands don't have to be relations either. Per my post, this is not an implementation of the relational model.

...Which is my point, really. Trying to fit the relational model of any kind, TTM or otherwise, into Java or C# or other popular imperative programming languages inevitably involves undesirable tradeoffs. Dispensing with strict adherence to the relational model results in a different set of tradeoffs, but ones I find less undesirable. However, your mileage may vary, etc.

The JOIN restriction is an explicit expression, so performing a JOIN on the result of a prior JOIN requires that the expression be something like spTuple.l.city == spjTuple.l.r.city or whatever. Type information is statically preserved, so this works.

Now apply that for a typical normalised schema design: SP JOIN S JOIN P, join to reference data for STATUS name, CityID, ColourID, Units-of-measure name, warehouse picking location, ... There's a very good reason JOIN doesn't produce pairs/does produce 'flattened' relations.

Notably, it requires no reflection, is fully statically type-safe,

D'uh well yes, because the type of the result is just a pairing of the types of its arguments. I'm not seeing you've achieved anything that wasn't blimmin' obvious already.

Yes, exactly!

That's what I like about it. It the context of typical Java, it's intuitive, effective, and obvious.

Inferring the Heading of the result from relational operations is the main/only hard part for static type-safe inference in GP languages (of enough other sophistication).

and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

The complete code (for Java 14, as the new Java record mechanism is used), including the JOIN method definitions, example data, sample data generators, and performance comparisons:

Good to see you've used the new Java records: how ergonomic was it to code?

Saved me having to write the usual equals() and hashCode() implementations. That was nice.

P.S. I apologise if the above is rather dismissive. People in Haskell are forever producing pairs like this as if they've 'solved' SQL-alike SELECT/JOINs. They typically map a SELECT-alike over the collection of pairs to produce a 'flat' relation -- as you could do here presumably. And it's type-safe because the SELECT name-list specifies the Heading of the result. But they need to specify each attribute type of the result, and the compiler can statically check that is the same type as in the pair. That's type-checking, not type inference (automatically deriving the attribute type without needing spelling it out). So I'm disagreeing this is "equivalent functionality" to SQL SELECT, let alone to NATURAL JOIN.

It's neither, nor is it the relational model, but the combination of this JOIN + Java Streams provides equivalent facility to the relational model in general; arguably (though subjectively) more so than attempts to shoehorn a "true" relational algebra (TTM or otherwise) into Java (or C#, or whatever.)

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#8 · May 16, 2020, 8:28 am

Quote from AntC on May 16, 2020, 7:52 am

Quote from Darren Duncan on May 16, 2020, 4:04 am

Having implemented generic relational join myself, unless I'm missing something, your implementation is missing a simple but important key detail if you want it to be performant.

You need to explicitly index on the attributes being joined, and not the tuples as a whole, but where those are one and the same.

There's a guy on StackOverflow who never passes over the opportunity -- no matter how remotely connected to the question -- to warn against (SQL NATURAL) JOIN on the grounds inter alia that it should only be allowed where bridging a Foreign Key reference/constraint.

One of his complaints is that (NATURAL) JOIN on anything other than a key has horrible performance.

...

Which is notionally true I suppose, but a completely useless point when divorced from requirements.

The requirement is to do a non-key (NATURAL) JOIN on two tables of roughly ten rows each, once a month, and it has to complete within an eighteen-hour maintenance window and normally takes, oh, a few dozen milliseconds?

I think we're ok with that horrible performance.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#9 · May 16, 2020, 8:32 am

Quote from Darren Duncan on May 16, 2020, 4:04 am

Having implemented generic relational join myself, unless I'm missing something, your implementation is missing a simple but important key detail if you want it to be performant.

You need to explicitly index on the attributes being joined, and not the tuples as a whole, but where those are one and the same.

How I implemented this is that in principle rather than using a HashSet I used a Dictionary, where the values are the whole tuple and the keys are a projection on the attributes being joined.

When you do this, the common equal-join is just order of N rather than order of N-squared, because it can iterate over the smaller cardinality argument and for each one do an order of one lookup at the index for the other argument (building the index is also order of N), so say you could join a thousand to a thousand or whatever in reasonable time, barring when you are explicitly doing a cartesian product.

The above assumes a hash index. I never bothered with the complexity of a tree or something that would support doing certain other kinds of theta join quickly.

Note lines 51 and 79 of my example code. My JOIN is hacked together for illustration -- and non-performance-critical use -- using the usual idiomatic Java Streams "join".

Indeed, a performant implementation should use indexing.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#10 · May 16, 2020, 8:36 am

Addendum to my system that I described before: In case anyone read my description as the solution only working when the index is on a key of both operands, for the general solution that isn't the case. The general solution was like any non-unique index. When a join etc is not on a key, multiple tuples yield the same hash key for the equality comparison. But when that happens, the associated value is a collection of tuples sharing the same hash, and then a join when encountering that hash is locally a cartesian product, or the whole join can degenerate to one if all source tuples are the same values for those attributes; but similarly, the collection has exactly 1 element each when the attributes are a key, and its just order of N, and if the system knows this is the case it could avoid the collection and just have the single element. Also, my system is adaptable to either just automatically using common attributes or alternately using whatever derived value the user decides they want to use for joining with, as long as the operation is ultimately equality comparisons.

The Forum for Discussion about The Third Manifesto and Related Matters

Java JOIN