I like arrow for its type system. It's efficient, complete and does not have "infinite precision decimals". Considering Postgres's decimal encoding, using i256 as the backing type is so much saner approach.
if I could tell myself in 2015 who had just found the feather library and was using it to power my unhinged topic modeling for power point slides work, and explained what feather would become (arrow) and the impact it would have on the date ecosystem. I would have looked at 2026 me like he was a crazy person.
Yet today I feel it was 2016 dataders who is the crazy one lol
Indeed. feather was a library to exchange data between R and pandas dataframes. People tend to bash pandas but its creator (Wes McKinney) has changed the data ecosystem for the better with the learnings coming from pandas.
I know pandas has a lot of technical warts and shortcomings, but I'm grateful for how much it empowered me early in my data/software career, and the API still feels more ergonomic to me due to the years of usage - plus GeoPandas layering on top of it.
Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
Do people bash pandas? If so, it reminds me of Bjarne's quip that the two types of programming languages are the ones people complain about and the ones nobody uses.
I also use polars in new projects. I think Wes McKinney also uses it. If I remember correctly I saw him commenting on some polars memory related issues on GitHub. But a good chunk of polars' success can be attributed to Arrow which McKinney co-created. All the gripes people have with pandas, he had them too and built something powerful to overcome those.
I saw Wes speak in the early days of Pandas, in Berkeley. He solved problems that others just worked around for decades. His solutions are quirky but the work was very solid. His career advanced a lot IMHO for substantial reasons.. Wes personally marched through swamps and reached the other side.. others complain and do what they always have done.. I personally agree with the criticisms of the syntax, but Pandas is real and it was not easy to build it.
Its nice to see useful, impactful interchange formats getting the attention and resources they need, and ecosystems converging around them. Optimizing serialization/deserialization might seem like a "trivial" task at first, but when moving petabytes of data they quickly become the bottlenecks. With common interchange formats, the benefits of these optimizations are shared across stacks. Love to see it.
We use Apache Arrow at my company and it's fantastic. The performance is so good. We have terabytes of time-series financial data and use arrow to store it and process it.
That's not really the purpose; it's really a language-independent format so that you don't need to change it for say, a dataframe or R. It's columnar because for analytics (where you do lots of aggregations and filtering) this is way more performant; the data is intentionally stored so the target columns are continuous. You probably already know, but the analytics equivalent of SQLite is DuckDB. Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.
> Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.
Not sure if I misunderstood, what are the chances those different consumers / tools / operations are running in your memory space?
Not an expert, so I could be wrong, but my understanding is that you could just copy those bytes directly from the wire to your memory and treat those as the Arrow payload you're expecting it to be.
You still have to transfer the data, but you remove the need for a transformation before writing to the wire, and a transformation when reading from the wire.
If I recall, Arrow is more or less a standardized representation in memory of columnar data. It tends to not be used directly I believe, but as the foundation for higher level libraries (like Polars, etc.). That said, I'm not an expert here so might not have full info.
You can absolutely use it directly, but it is painful. The USP of Arrow ist that you can pass bits of memory between Polars, Datafusion, DuckDB, etc. without copying. It's Parquet but for memory.
This is true, and as a result IME the problem space is much smaller than Parquet, but it can be really powerful. The reality is most of us don't work in environments where Arrow is needed.
What's the difference between feather and parquet in terms of usage? I get the design philosophy, but how would you use them differently?
https://stackoverflow.com/questions/48083405/what-are-the-di...
I like arrow for its type system. It's efficient, complete and does not have "infinite precision decimals". Considering Postgres's decimal encoding, using i256 as the backing type is so much saner approach.
if I could tell myself in 2015 who had just found the feather library and was using it to power my unhinged topic modeling for power point slides work, and explained what feather would become (arrow) and the impact it would have on the date ecosystem. I would have looked at 2026 me like he was a crazy person.
Yet today I feel it was 2016 dataders who is the crazy one lol
Indeed. feather was a library to exchange data between R and pandas dataframes. People tend to bash pandas but its creator (Wes McKinney) has changed the data ecosystem for the better with the learnings coming from pandas.
I know pandas has a lot of technical warts and shortcomings, but I'm grateful for how much it empowered me early in my data/software career, and the API still feels more ergonomic to me due to the years of usage - plus GeoPandas layering on top of it.
Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
Do people bash pandas? If so, it reminds me of Bjarne's quip that the two types of programming languages are the ones people complain about and the ones nobody uses.
polars people do - although I wouldn't call polars something that nobody uses.
I also use polars in new projects. I think Wes McKinney also uses it. If I remember correctly I saw him commenting on some polars memory related issues on GitHub. But a good chunk of polars' success can be attributed to Arrow which McKinney co-created. All the gripes people have with pandas, he had them too and built something powerful to overcome those.
I saw Wes speak in the early days of Pandas, in Berkeley. He solved problems that others just worked around for decades. His solutions are quirky but the work was very solid. His career advanced a lot IMHO for substantial reasons.. Wes personally marched through swamps and reached the other side.. others complain and do what they always have done.. I personally agree with the criticisms of the syntax, but Pandas is real and it was not easy to build it.
Its nice to see useful, impactful interchange formats getting the attention and resources they need, and ecosystems converging around them. Optimizing serialization/deserialization might seem like a "trivial" task at first, but when moving petabytes of data they quickly become the bottlenecks. With common interchange formats, the benefits of these optimizations are shared across stacks. Love to see it.
We use Apache Arrow at my company and it's fantastic. The performance is so good. We have terabytes of time-series financial data and use arrow to store it and process it.
[delayed]
stumbled upon it recently while optimizing parquet writes. It worked flawlessly and 10-20x'd my throughput
I had to look up what Arrow actually does, and I might have to run some performance comparisons vs sqlite.
It's very neat for some types of data to have columns contiguous in memory.
>> some performance comparisons vs sqlite.
That's not really the purpose; it's really a language-independent format so that you don't need to change it for say, a dataframe or R. It's columnar because for analytics (where you do lots of aggregations and filtering) this is way more performant; the data is intentionally stored so the target columns are continuous. You probably already know, but the analytics equivalent of SQLite is DuckDB. Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.
> Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.
Not sure if I misunderstood, what are the chances those different consumers / tools / operations are running in your memory space?
Not an expert, so I could be wrong, but my understanding is that you could just copy those bytes directly from the wire to your memory and treat those as the Arrow payload you're expecting it to be.
You still have to transfer the data, but you remove the need for a transformation before writing to the wire, and a transformation when reading from the wire.
Thanks! This is all probably me using the familiar sqlite hammer where I really shouldn't.
If I recall, Arrow is more or less a standardized representation in memory of columnar data. It tends to not be used directly I believe, but as the foundation for higher level libraries (like Polars, etc.). That said, I'm not an expert here so might not have full info.
You can absolutely use it directly, but it is painful. The USP of Arrow ist that you can pass bits of memory between Polars, Datafusion, DuckDB, etc. without copying. It's Parquet but for memory.
This is true, and as a result IME the problem space is much smaller than Parquet, but it can be really powerful. The reality is most of us don't work in environments where Arrow is needed.
Take a look at parquet.
You can also store arrow on disk but it is mainly used as in-memory representation.
yeah not necessarily compute (though it has a kernel)!
it's actually many things IPC protocol wire protocol, database connectivity spec etc etc.
in reality it's about an in-memory tabular (columnar) representation that enables zero copy operations b/w languages and engines.
and, imho, it all really comes down to standard data types for columns!