Digression: Nowadays when RAM is expensive good old zram is gaining popularity ;) Try to check on trends.google.com . Since 2025-09 search for it doubled ;)
> Peak memory consumption is 1.3 MB. At this point you might want to stop reading and make a guess on how much memory a native code version of the same functionality would use.
I wish I knew the input size when attempting to estimate, but I suppose part of the challenge is also estimating the runtime's startup memory usage too.
> Compute the result into a hash table whose keys are string views, not strings
If the file is mmap'd, and the string view points into that, presumably decent performance depends on the page cache having those strings in RAM. Is that included in the memory usage figures?
Nonetheless, it's a nice optimization that the kernel chooses which hash table keys to keep hot.
The other perspective on this is that we sought out languages like Python/Ruby because the development cost was high, relative to the hardware. Hardware is now more expensive, but development costs are cheaper too.
The take away: expect more push towards efficiency!
>> Peak memory consumption is 1.3 MB. At this point you might want to stop reading and make a guess on how much memory a native code version of the same functionality would use.
At this point I'd make two observations:
- how big is the text file? I bet it's a megabyte, isn't it? Because the "naive" way to do it is to read the whole thing into memory.
- all these numbers are way too small to make meaningful distinctions. Come back when you have a gigabyte. It gets more interesting when the file doesn't fit into RAM at all.
> all these numbers are way too small to make meaningful distinctions. Come back when you have a gigabyte.
I have to disagree. Bad performance is often a result of a death of a thousands cuts. This function might be one among countless similarly inefficient library calls, programs and so on.
Well, we can use memoryview for the dict generation avoiding creation of string objects until the time for the output:
import re, operator
def count_words(filename):
with open(filename, 'rb') as fp:
data= memoryview(fp.read())
word_counts= {}
for match in re.finditer(br'\S+', data):
word= data[match.start(): match.end()]
try:
word_counts[word]+= 1
except KeyError:
word_counts[word]= 1
word_counts= sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True)
for word, count in word_counts:
print(word.tobytes().decode(), count)
String views were a solid addition to C++. Still underutilized. It does not matter which language you are using when you make thousands of tiny memory allocations during parsing. https://en.cppreference.com/w/cpp/string/basic_string_view.h...
Digression: Nowadays when RAM is expensive good old zram is gaining popularity ;) Try to check on trends.google.com . Since 2025-09 search for it doubled ;)
Nice!
> Peak memory consumption is 1.3 MB. At this point you might want to stop reading and make a guess on how much memory a native code version of the same functionality would use.
I wish I knew the input size when attempting to estimate, but I suppose part of the challenge is also estimating the runtime's startup memory usage too.
> Compute the result into a hash table whose keys are string views, not strings
If the file is mmap'd, and the string view points into that, presumably decent performance depends on the page cache having those strings in RAM. Is that included in the memory usage figures?
Nonetheless, it's a nice optimization that the kernel chooses which hash table keys to keep hot.
The other perspective on this is that we sought out languages like Python/Ruby because the development cost was high, relative to the hardware. Hardware is now more expensive, but development costs are cheaper too.
The take away: expect more push towards efficiency!
>> Peak memory consumption is 1.3 MB. At this point you might want to stop reading and make a guess on how much memory a native code version of the same functionality would use.
At this point I'd make two observations:
- how big is the text file? I bet it's a megabyte, isn't it? Because the "naive" way to do it is to read the whole thing into memory.
- all these numbers are way too small to make meaningful distinctions. Come back when you have a gigabyte. It gets more interesting when the file doesn't fit into RAM at all.
The state of the art here is : https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times... , wherein our hero finds the terrible combination of putting the whole file in a single string and then running strlen() on it for every character.
> all these numbers are way too small to make meaningful distinctions. Come back when you have a gigabyte.
I have to disagree. Bad performance is often a result of a death of a thousands cuts. This function might be one among countless similarly inefficient library calls, programs and so on.
Well, we can use memoryview for the dict generation avoiding creation of string objects until the time for the output:
We could also use `mmap.mmap`.> how much memory a native code version of the same functionality would use.
native to what? how c++ is more native than python?
I think py version can be shortened as:
from collections import Counter
stats = Counter(x.strip() for l in open(sys.argv[1]) for x in l)
Would that decrease memory usage though?
"copyright infringement factories"