Pyarrow - Nate's Notes

# Pyarrow * View api * public user interface * get head of view/dataframe * write to parquet * want DS to not even know it exists * motivation * kernel * how do we do that work in that space for this particular task of things we want to do * mapto kernel that is unique to dask, daskarrow, and sql * engine represents * idea of repartitioning doesn't exist in fabric * engine.py, schema.py, mapto.py * `%%prun -T big_meas.txt -D big` * `gprof2dot` * every partition is an arrow table * everything else is the same * Arrow gains * strong typing * faster than dask * take one sorted * memory mapping? * Things that suck about arrow * arrow tables don't have some common things * no table.groupby() - we wrote our own * no pd.map() * engineers tend to interact with aca (apply concat apply) * agg is a mask for aca * demo notebook for arrow * pair with reed * don't get pandas for free, but do get numpy for free (then we get all of numpy) *