Skip to content

Exponential runtime complexity for Expr.substitute #835

@fjetter

Description

@fjetter

Let's assume for the sake of simplicity that we have an expression graph where every expression depends on D other expressions except of the root/source expressions which do not depend on any other expression. In total, we have N expressions.

Simplifying by ignoring Fused, literals and removing boilerplate, substitute currently looks like

def substitute(self, old, new):
    new_exprs = []
    for op in self.operands: # D
        if isinstance(op, Expr):
            op.substitute(old, new)
        else:
            new_exprs.append(op)
    return type(self)(*new_exprs) # with caching of tokenization/names O(D)

The runtime of this is then T(N) = D * T(M) + O(1) where M are the number of nodes the individual operands depend on and N are the total number of expressions in the graph. In a very simple case of a tree-like structure, M is approximately N/D which gives us T(N) = D * T(N / D)
using the recursion master theorem (a=b=D; c_crit = 1) we get T(N) = O(N) which is fine.
However, if the subproblem size M is not reduced as strongly but only be some constant factor, e.g. M = N - 1, we get T(N) = D * T(N - 1) + O(D) which reduces to (using induction) T(N) = O(D^N), i.e. this is exponential growth which is catastrophic (whether the constant is 1 or smth else doesn't matter)

This may sound artificial but since we're dealing with generic DAGs, this condition is not impossible and not even uncommon. Whenever there is a cycle in our graph structure the reduction is only by a constant factor. Assuming that our cycles are often diamond-like structures (i.e. D=2 / two branches) and C is the number of cycles in an expression graph, this gives us a worst case runtime of O(2^C)

That this is not just a theoretical problem but also a practical one can be seen in Query 21 of the TPCH benchmark suite as it is currently implemented in the coiled/benchmarks repo which takes a relatively long time to run the substitution, see also #798 (comment) (If my math checks out, adding another filter to the query would double the optimize runtime, haven't tested this, yet)

I instrumented the code and could measure ~13.4M invocations of Expr.substitute. In this particular example it seems that these cycles are introduced by Filter expressions. We had a similar problem in the past with Assign expressions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions