The saddest part about this article being from 2014 is that the situation has arguably gotten worse.
We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.
I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.
When I worked as a data engineer, I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON at 10s of MB/s - creating a huge bottleneck.
By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).
Just how much data do you need when these sort of clustered approaches really start to make sense?
> I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON
Hah, incredibly funny, I remember doing the complete opposite about 15 years ago, some beginner developer had setup a whole interconnected system with multiple processes and what not in order to process a bunch of JSON and it took forever. Got replaced with a bash script + Python!
> Just how much data do you need when these sort of clustered approaches really start to make sense?
I dunno exactly what thresholds others use, but I usually say if it'd take longer than a day to process (efficiently), then you probably want to figure out a better way than just running a program on a single machine to do it.
How do you stream parse json? I thought you need to ingest it whole to ensure it is syntactically valid, and most parsers don't work with inchoate or invalid json? Or at least it doesn't seem trivial.
Well yeah, but that's a _very_ different engineering decision with different constraints, it's not fully apples to apples.
Having materialised views increases insert load for every view, so if you want to slice your data in a way that wasn't predicted, or that would have increased ingress load beyond what you've got to spare, say, find all devices with a specific model and year+month because there's a dodgy lot, you'll really wish you were on a DB that can actually run that query instead of only being able to return your _precalculated_ results.
Not only is this a contrived non-comparison, but the statement itself is readily disproven by the limitations basically _everyone_ using single instance ClickHouse often run into if they actually have a large dataset.
Spark and Hadoop have their place, maybe not in rinky dink startup land, but definitely in the world of petabyte and exabyte data processing.
And now with things like DuckDB and clickhouse-local you won't have to worry about data processing performance ever again. Just kidding, but especially with ClickHouse it's so much better to handle the large data volume compared to the past, and even a single beefy server is often enough to satisfy all data analytics needs for a moderate-to-large company.
[delayed]
The saddest part about this article being from 2014 is that the situation has arguably gotten worse.
We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.
I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.
When I worked as a data engineer, I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON at 10s of MB/s - creating a huge bottleneck.
By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).
Just how much data do you need when these sort of clustered approaches really start to make sense?
> I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON
Hah, incredibly funny, I remember doing the complete opposite about 15 years ago, some beginner developer had setup a whole interconnected system with multiple processes and what not in order to process a bunch of JSON and it took forever. Got replaced with a bash script + Python!
> Just how much data do you need when these sort of clustered approaches really start to make sense?
I dunno exactly what thresholds others use, but I usually say if it'd take longer than a day to process (efficiently), then you probably want to figure out a better way than just running a program on a single machine to do it.
How do you stream parse json? I thought you need to ingest it whole to ensure it is syntactically valid, and most parsers don't work with inchoate or invalid json? Or at least it doesn't seem trivial.
A selection of times it's been previously posted:
(2018, 222 comments) https://news.ycombinator.com/item?id=17135841
(2022, 166 comments) https://news.ycombinator.com/item?id=30595026
(2024, 139 comments) https://news.ycombinator.com/item?id=39136472 - by the same submitter as this post.
Great article. Hadoop (and other similar tools) are for datasets so huge they don't fit on one machine.
https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one...
I like this one where they put a dataset on 80 machines only then for someone to put the same dataset on 1 Intel NUC and outperform in query time.
https://altinity.com/blog/2020-1-1-clickhouse-cost-efficienc...
Datasets never become big enough…
Well yeah, but that's a _very_ different engineering decision with different constraints, it's not fully apples to apples.
Having materialised views increases insert load for every view, so if you want to slice your data in a way that wasn't predicted, or that would have increased ingress load beyond what you've got to spare, say, find all devices with a specific model and year+month because there's a dodgy lot, you'll really wish you were on a DB that can actually run that query instead of only being able to return your _precalculated_ results.
>Datasets never become big enough…
Not only is this a contrived non-comparison, but the statement itself is readily disproven by the limitations basically _everyone_ using single instance ClickHouse often run into if they actually have a large dataset.
Spark and Hadoop have their place, maybe not in rinky dink startup land, but definitely in the world of petabyte and exabyte data processing.
And we can have pretty fucking big single machines right now
This has been a recurring theme for ages, with a few companies taking it to extremes—there are people transpiring COBOL to bash too…
Hadoop, blast from the past
And now with things like DuckDB and clickhouse-local you won't have to worry about data processing performance ever again. Just kidding, but especially with ClickHouse it's so much better to handle the large data volume compared to the past, and even a single beefy server is often enough to satisfy all data analytics needs for a moderate-to-large company.