Aggregations
Aggregation is the point where a table stops being only a list of rows and starts producing summaries.
In Lutra, aggregation happens in three stages:
- group rows and partitions with key and values,
- switch values to a columnar view with
aggregateormap_columnar, - reduce one column, for example with
sumorcount.
Count rows¶
count returns the number of items in an array.
const movies = [
{id = 1: int32, title = "Arrival"},
{id = 2: int32, title = "Dune"},
]
func main() -> movies | count()
count([]) is 0.
Sum a column¶
A common pattern is:
- map rows to one field,
- aggregate that field.
const sales = [
{category = "books", amount = 10: int32},
{category = "games", amount = 25: int32},
{category = "books", amount = 15: int32},
]
func main() -> (
sales
| map(x -> x.amount)
| sum()
)
sum returns zero on an empty array.
min, max, and mean¶
min and max return an option-like result because an empty array has no
minimum or maximum.
func main() -> {
min([5, 3, 8: int32]),
max([5, 3, 8: int32]),
}
func main() -> min([]: [int32])
That returns none.
mean always returns float64.
func main() -> mean([5, 3, 8: int32])
On an empty array, mean returns NaN.
Boolean aggregates¶
Use all, any, and contains for boolean-style questions.
func main() -> {
all([true, true, true]),
any([false, false, true]),
contains(["a", "b", "c"], "b"),
}
Useful edge cases:
all([])istrueany([])isfalsecontains([], x)isfalse
Group rows and summarize them with group_map¶
group_map is the primary grouped-aggregation tool.
It is often the closest Lutra equivalent to SQL GROUP BY.
const sales = [
{category = "books", amount = 10: int32},
{category = "games", amount = 25: int32},
{category = "books", amount = 15: int32},
]
func main() -> (
sales
| group_map(
x -> x.category,
func (category, rows) -> {
category = category,
total_amount = rows | map(r -> r.amount) | sum(),
}
)
)
The flow is:
- choose a grouping key,
- gather all rows for that key,
- compute one summary row for that group.
Compute several metrics per group¶
You can return as many summary fields as you need.
func main() -> (
sales
| group_map(
x -> x.category,
func (category, rows) -> {
category = category,
total_amount = rows | map(r -> r.amount) | sum(),
count = rows | count(),
}
)
)
This is often more convenient than using group first and then writing another
map step.
Aggregate a whole relation into one row¶
You can summarize the whole relation by grouping everything under the same key.
func main() -> (
sales
| group_map(
_ -> true,
func (_, rows) -> {
total_amount = rows | map(r -> r.amount) | sum(),
count = rows | count(),
}
)
)
This returns an array with one summary row.
Sometimes you want the summary row itself rather than a one-item array. A common pattern is:
func main() -> (
sales
| group_map(
_ -> true,
func (_, rows) -> {
total_amount = rows | map(r -> r.amount) | sum(),
count = rows | count(),
}
)
| index(0)
| option::or_default()
)
This is also useful when the input relation may be empty.
Use group directly when you need the grouped rows¶
group returns rows of the shape {key, values}.
func main() -> (
sales
| group(x -> x.category)
)
Use group directly when you want access to the grouped rows themselves.
Use group_map when you already know you want one summary row per group.
Use fold and scan for custom reductions¶
fold lets you build your own accumulator.
func main() -> fold(
[1, 2, 3, 4, 5]: [int64],
{sum = 0: int64, count = 0: int64},
func (s, n: int64) -> {sum = s.sum + n, count = s.count + 1}
)
scan is similar, but it returns every intermediate accumulator.
func main() -> scan(
[1, 2, 3, 4, 5]: [int64],
0,
func (s: int64, n: int64): int64 -> n + s
)
Use fold when you want only the final result. Use scan when you want the
running results.
Think in columns with aggregate¶
Most of the time, Lutra code works row by row. Sometimes it is more natural to work column by column instead.
aggregate converts an array of rows into a columnar shape and passes it to a
closure.
const rel: [{sales: int64, refunds: int64}] = [
{sales = 5, refunds = 3},
{sales = 65, refunds = 1},
{sales = 3, refunds = 2},
]
func main() -> aggregate(rel, func (x: {sales: [int64], refunds: [int64]}) -> {
min_sales = min(x.sales),
min_refunds = min(x.refunds),
})
The closure receives columns, not rows. That means it can naturally express operations that are easier to write over full columns.
Understand to_columnar, from_columnar, and map_columnar¶
These helper functions expose the same idea more explicitly.
to_columnar turns rows into columns:
func main() -> to_columnar([
{sales = 5: int16, refunds = 3: int16},
{sales = 65, refunds = 1},
{sales = 3, refunds = 2},
])
from_columnar turns columns back into rows:
func main() -> from_columnar({
sales = [5: int16, 65, 3],
refunds = [3: int16, 1, 2],
})
map_columnar lets you transform a relation through a columnar closure and then
convert back to rows.
func main() -> map_columnar(
[
{sales = 5: int16, refunds = 3: int16},
{sales = 65, refunds = 1},
{sales = 3, refunds = 2},
],
func (x: {sales: [int16], refunds: [int16]}) -> {
sales = lag(x.sales, 1),
refunds = lead(x.refunds, 1),
}
)
A useful rule of thumb is:
- use
mapandgroup_mapfor row-oriented work, - use
aggregateormap_columnarwhen the problem is naturally column-oriented.
See also¶
- Tabular data basics for row-oriented transformations.
- Pipelines for the general left-to-right programming style.
- Reference: Expressions for exact call and pipe rules.
- Reference: Types for tuples, arrays, and grouped result shapes.