I have to agree - the highest quality libraries in my experience are the ones maintained that one dedicated person as their pet project. There's no glory, no money, no large community, no Twitter followers - just a person with a problem to solve and making the solution open source for the benefit of others.
Also, isn't that just 99% of OSS projects out there?
I maintained a project for the past 7+ years, and despite 1 million downloads, tens of thousands of monthly active users, it's still mostly me, maintaining and committing.
Yes, there is a bus factor, but it's a common and known problem in open-source. It would be better to try to improve the situation instead of just flagging all the projects. It's hard enough to find people ready to help and work on something outside their working hours on a regular basis...
Fair take—it's definitely context-dependent. In some cases, solo-maintainer projects can be great, especially if they’re stable or purpose-built. But from a trust and maintenance standpoint, it’s worth flagging as a signal: if 90% of commits are from one person who’s now inactive, it could mean slow responses to bugs or no updates for security issues. Doesn’t mean the project is bad—just something to consider alongside other factors.
Heuristics are never perfect and it's all iterative but it's all about understanding the underlying assumptions and taking the knowledge you get out of it with your own context. Probably could enhance it slightly by a run through an LLM with a prompt but I prefer to keep things purely statistical for now.
It could also mean that the project is stable. Since you only look at the one repository's commit activity, a stable project with a maintainer who's still active on GitHub in other places would be "less trustworthy" than a project that's a work in progress.
I agree. I have a popular-ish project on GitHub that I haven't touched in like a decade. I would if needed, but it's basically "done". It works. It does everything it needs to, and no one's reported a bug in many, many years.
You could etch that thing into granite as far as I can tell. The only thing left to do is rewrite it in Rust.
Not a bad idea tbh, maybe an additional how long issues are left open, would be a good idea. Though yeh thats why I was contemplating of not necessarily highlighting the actual number and more have a range e.g. 80-100 is good, 50-70 Moderate and so on.
Be careful with this. Each project has different practices which could lead to false positives and false negatives. You may also create the wrong incentives, depending on how you measure and report things.
The signal here is how many unpatched vulnerabilities there are maybe multiplied by how long they’ve been out there. Purely statistical. And an actual signal.
Lol yeah tbh - I just made it without really thinking of an audience, just was looking for a project to work on till I saw the paper and figured it would be cool to check it out on some repositories out there. That part is just me asking gpt to make the read me better.
Yep, and it's not just Clojure. This will end up flagging projects across all non-mainstream ecosystems. Whether it's Vim plugins, niche command-line tools, academic research code, or hobbyist libraries for things like game development or creative coding, they'll likely get flagged simply because they're often maintained by individual developers. These devs build the projects, iterate quickly in the early stages, and eventually reach a point where the code is stable and no longer needs frequent updates.
It's a shame that this tool penalizes such projects, which I think are vital to a healthy open source ecosystem.
It's a nice project otherwise. But flagging stable projects from solo developers really sticks out like a sore thumb. :(
Ironically your chance of getting a PR through is about 10x higher on smaller one-man-show repos than more heavily trafficked corporate repos that require all manner of hoops to be jumped through for a PR.
Very nice! I'm personally looking into bot account detection for my own service and have come up with very similar heuristics (albeit simpler ones since I'm doing this at scale) so I will provide some additional ones that I have discovered:
1. Fork to stars ratio. I've noticed that several of the "bot" repos have the same number of forks as stars (or rather, most ratios are above 0.5). Typically a project doesn't have nearly as many forks as stars.
2. Fake repo owners clone real projects and push them directly to their account (not fork) and impersonate the real project to try and make their account look real.
Why care about stars in the first place? Github is a repo of source repos, using it like social media is pretty silly. If I like a project, it goes into a folder in my bookmarks, that's the 'star' everyone should use.
For VCs? What, are you looking to make an open source todo app into a multi million dollar B2B SaaS? VCs are the almighty gods of the world, and us humble peons do not need to lend our assistance into helping them lose money :-)
I think these are just the package managers that it supports parsing dependencies for. The actual script seems to just be a single python file.
It does seem like the repo is missing some files though; make is mentioned in the README but no makefile and no list of python dependencies for the script that I can see.
Yeah to be fair I need to clean it up, was stuck in the testing diff strategies and making it work and just wanted to get feedback asap before moving on to the next step (didn't want to spend too much time on something and turns out I was wrong about something badly) - next step is to get it all cleaned up.
> they still shape hiring decisions, VC term sheets, and dependency choices
This is nuts to me. A star is a "like". It has carries no signal of quality and even its popularity proxy is quite weak. I can't remember the last time I looked at stars and considered them meaningful.
the difference means getting funded or not. people fake testimonials and put logos of large companies too on their saas.
some people even buy residential proxies and create accounts on communities that can steer them towards specific action like "hey lets short squeeze this company let me sell you my call option" etc.
there's no incentive to be honest, i know two founders where one cheated with fake accounts, github likes and exited. the other ultimately gave up and worked in another field.
the old saying "if you lie to those who wants to be lied to you will become wealthy" rings true.
however at the end of the day it is dishonest and money earned through deception is bad.
This speaks more to the incompetence of VC than anything. How can you justify deploying hundreds of thousands or millions of dollars on the basis of "stars"?
I have a very hard time blaming the people who pull off this scam. Money is money and taking from VCs is morally (nearly) free.
It would be interesting if there were an AI tool to analyze the growth pattern of an OSS project. The tool should work based on star info from the GitHub API and perform some web searches based on that info.
For example: the project gets 1,000 stars on 2024-07-23 because it was posted on Hacker News and received 100 comments (<link>). Below is the static info of stargazers during this period: ...
Yeah I thought about this and maybe down the line, but wanted to start with the pure statistics part as the base so it's as little of a black box as possible.
In the US people can already randomly sue you for any random damages. I could sue github right now even if I'd never previously heard of or interacted with the site.
Yeah to be fair would be great, sometimes just giving a nudge and showing people want these features is the first step to getting an official integration.
I approve! It would be cool to have customizable and transparent heuristics. That way if you know for example that a burst of stars was organic, or you don’t care and want to look at other metrics, you can, or you can at least see a report that explains the reasoning.
1. there are hallucinatory descriptions in the Readme (make test), and also in the code, such as the rate limit set at line 158, which is the wrong number
2. all commits are done on github webui, checking the signature confirms this
3. too verbose function names and a 2000 line python file
I don't have a complaint about ai, but the code quality clearly needs improvement, the license only lists a few common examples, the thresholds for detection seem to be set randomly, _get_stargazers_graphql the entire function is commented out and performs no action, it says "Currently bypassed by get_ stargazers", did you generate the code without even reading through it?
Bad code like this gets over 100stars, it seems like you're doing a satirical fake-star performance art.
I checked your past submissions and yes, they are also ai generated.
I know it's the age of ai, but one should do a little checking oneself before posting ai generated content, right? Or at least one should know how to use git and write meaningful commit messages?
It's a project I'm making purely for myself and I like to share what I make - sorry I didn't put up most effort in the commit messages, will not do that again.
Well I initially planned to use GraphQL and started to implement it, but switched to REST for now as it's still not fully complete, just to keep things simpler while I iterate and the fact that it's not required currently. I’ll bring GraphQL back once I’ve got key cycling in place and things are more stable. As for the rate limit, I’ve been tweaking things manually to avoid hitting it constantly which I did to an extent—that’s actually why I want to add key rotation... and I am allowed to leave comments for myself for a work in progress no? or does everything have to be perfect from day one?
You would assume if it was pure ai generated it would have the correct rate limit in the comments and the code .... but honestly I don't care and yeah I ran the read me through GPT to 'prettify it'. Arrest me.
Just spitballing, but assuming the fake stars are added in a "naive" manner (i.e. as fast as possible, no breaks) you could distinguish the two by looking for the long tail usually associated with organic traffic spikes.
Of course, the problem with that is the adversary could easily simulate the same effect by mixing together some fall-off functions and a bit of randomness.
For each spike it samples the users from that spike (I set it to a high enough value currently it essentially gets all of them for 99.99% of repos - though that should be optimised so it's faster but just figured I will just grab every single one for now whilst building it). It checks the users who caused this spike for signs of being "fake accounts".
Basically what I mean by it is for example a repository appears to be under a permissive license like MIT, Apache, or BSD, but actually includes code that’s governed by a much stricter or viral license—like GPL or AGPL—often buried in a subdirectory, dependency, or embedded snippet. The problem is, if you reuse or build on that code assuming it’s fully permissive, you could end up violating the terms of the stricter license without realising it. It’s a trap because the original authors might have mixed incompatible licenses, knowingly or not, and the legal risk then falls on downstream users. So yeah essentially a plain old license violation which are relatively easy to miss or not think about
Stars are just a signal. When I am looking at multiple libraries that do the same, I am going to trust more a repo with 200 starts that one with 0. Its not perfect, but I don't have the time to go through the entire codebase and try it out. If the repo works for me I will star it to contribute to the signal.
Sadly lists had a hard cap at 32 or 36 or something like that.. i was too eager early with my specificity (hav elists w 1 repo) and now i cant make new ones (need to delete others)
lol
found a couple non-maintained projects for managing them
I tend to put more attention on repos with 15-75 (ish) stars. Less is something obscure or unproven maybe, and above ~500 is much more likely to be BS/hype.
Github was a "social network" from its very beginning. The whole premise was geared around git hosting and "social coding". I don't think it became enshittified later since that was the entire value proposition from day 1.
There are tons of places you can use for simple git hosting. The only reason to use github over the others is due to the social factors. Because everyone already has an account on it so they can easily file issues, PRs, etc. For simple git hosting, github leaves a lot to be desired.
> It checks contributor patterns too. If 90% of commits come from one person who hasn’t pushed in months, that’s flagged.
IMO this is a slight green flag; not red.
I have to agree - the highest quality libraries in my experience are the ones maintained that one dedicated person as their pet project. There's no glory, no money, no large community, no Twitter followers - just a person with a problem to solve and making the solution open source for the benefit of others.
Also, isn't that just 99% of OSS projects out there? I maintained a project for the past 7+ years, and despite 1 million downloads, tens of thousands of monthly active users, it's still mostly me, maintaining and committing. Yes, there is a bus factor, but it's a common and known problem in open-source. It would be better to try to improve the situation instead of just flagging all the projects. It's hard enough to find people ready to help and work on something outside their working hours on a regular basis...
Fair take—it's definitely context-dependent. In some cases, solo-maintainer projects can be great, especially if they’re stable or purpose-built. But from a trust and maintenance standpoint, it’s worth flagging as a signal: if 90% of commits are from one person who’s now inactive, it could mean slow responses to bugs or no updates for security issues. Doesn’t mean the project is bad—just something to consider alongside other factors.
Heuristics are never perfect and it's all iterative but it's all about understanding the underlying assumptions and taking the knowledge you get out of it with your own context. Probably could enhance it slightly by a run through an LLM with a prompt but I prefer to keep things purely statistical for now.
It could also mean that the project is stable. Since you only look at the one repository's commit activity, a stable project with a maintainer who's still active on GitHub in other places would be "less trustworthy" than a project that's a work in progress.
I agree. I have a popular-ish project on GitHub that I haven't touched in like a decade. I would if needed, but it's basically "done". It works. It does everything it needs to, and no one's reported a bug in many, many years.
You could etch that thing into granite as far as I can tell. The only thing left to do is rewrite it in Rust.
Not a bad idea tbh, maybe an additional how long issues are left open, would be a good idea. Though yeh thats why I was contemplating of not necessarily highlighting the actual number and more have a range e.g. 80-100 is good, 50-70 Moderate and so on.
Be careful with this. Each project has different practices which could lead to false positives and false negatives. You may also create the wrong incentives, depending on how you measure and report things.
The signal here is how many unpatched vulnerabilities there are maybe multiplied by how long they’ve been out there. Purely statistical. And an actual signal.
The problem is your audience is:
> CTOs, security teams, and VCs automate open-source due diligence in seconds.
The people that probably have less brain cells than the average programmer to understand the nuance in the flagging.
Lol yeah tbh - I just made it without really thinking of an audience, just was looking for a project to work on till I saw the paper and figured it would be cool to check it out on some repositories out there. That part is just me asking gpt to make the read me better.
It's gonna flag most of the clojure ecosystem
Yep, and it's not just Clojure. This will end up flagging projects across all non-mainstream ecosystems. Whether it's Vim plugins, niche command-line tools, academic research code, or hobbyist libraries for things like game development or creative coding, they'll likely get flagged simply because they're often maintained by individual developers. These devs build the projects, iterate quickly in the early stages, and eventually reach a point where the code is stable and no longer needs frequent updates.
It's a shame that this tool penalizes such projects, which I think are vital to a healthy open source ecosystem.
It's a nice project otherwise. But flagging stable projects from solo developers really sticks out like a sore thumb. :(
It would still count as "trustworthy" just wouldnt come out to 100/100 :(.
Ironically your chance of getting a PR through is about 10x higher on smaller one-man-show repos than more heavily trafficked corporate repos that require all manner of hoops to be jumped through for a PR.
Not sure if this is a red flag.
Very nice! I'm personally looking into bot account detection for my own service and have come up with very similar heuristics (albeit simpler ones since I'm doing this at scale) so I will provide some additional ones that I have discovered:
1. Fork to stars ratio. I've noticed that several of the "bot" repos have the same number of forks as stars (or rather, most ratios are above 0.5). Typically a project doesn't have nearly as many forks as stars.
2. Fake repo owners clone real projects and push them directly to their account (not fork) and impersonate the real project to try and make their account look real.
Example bot account with both strategies employed: https://github.com/algariis
Crazy how far people go for these things tbh.
Why care about stars in the first place? Github is a repo of source repos, using it like social media is pretty silly. If I like a project, it goes into a folder in my bookmarks, that's the 'star' everyone should use. For VCs? What, are you looking to make an open source todo app into a multi million dollar B2B SaaS? VCs are the almighty gods of the world, and us humble peons do not need to lend our assistance into helping them lose money :-)
Outside of that, neat project.
Dependencies: PyPI, Maven, Go, Ruby
This looks like a cool project, but why on earth would it need Python, Java, Go, AND Ruby?
It doesn't need them, it parses SBOMs and manifests from their ecosystems. I think you misunderstood this section of the README.
> Dependencies | SBOM / manifest parsing across npm, PyPI, Maven, Go, Ruby; flags unpinned, shadow, or non-registry deps.
The project seems like it only requires Python >= 3.9!
I think these are just the package managers that it supports parsing dependencies for. The actual script seems to just be a single python file.
It does seem like the repo is missing some files though; make is mentioned in the README but no makefile and no list of python dependencies for the script that I can see.
Yeah to be fair I need to clean it up, was stuck in the testing diff strategies and making it work and just wanted to get feedback asap before moving on to the next step (didn't want to spend too much time on something and turns out I was wrong about something badly) - next step is to get it all cleaned up.
> they still shape hiring decisions, VC term sheets, and dependency choices
This is nuts to me. A star is a "like". It has carries no signal of quality and even its popularity proxy is quite weak. I can't remember the last time I looked at stars and considered them meaningful.
the difference means getting funded or not. people fake testimonials and put logos of large companies too on their saas.
some people even buy residential proxies and create accounts on communities that can steer them towards specific action like "hey lets short squeeze this company let me sell you my call option" etc.
there's no incentive to be honest, i know two founders where one cheated with fake accounts, github likes and exited. the other ultimately gave up and worked in another field.
the old saying "if you lie to those who wants to be lied to you will become wealthy" rings true.
however at the end of the day it is dishonest and money earned through deception is bad.
> the difference means getting funded or not.
This speaks more to the incompetence of VC than anything. How can you justify deploying hundreds of thousands or millions of dollars on the basis of "stars"?
I have a very hard time blaming the people who pull off this scam. Money is money and taking from VCs is morally (nearly) free.
It would be interesting if there were an AI tool to analyze the growth pattern of an OSS project. The tool should work based on star info from the GitHub API and perform some web searches based on that info.
For example: the project gets 1,000 stars on 2024-07-23 because it was posted on Hacker News and received 100 comments (<link>). Below is the static info of stargazers during this period: ...
Yeah I thought about this and maybe down the line, but wanted to start with the pure statistics part as the base so it's as little of a black box as possible.
Great idea. This should be done by Github though. I'm surprised Github hasn't been sued for serving malware.
> I'm surprised Github hasn't been sued for serving malware.
do you want a world where people can randomly sue you for any random damages they suffer or do you want nice things like free code hosting?
In the US people can already randomly sue you for any random damages. I could sue github right now even if I'd never previously heard of or interacted with the site.
> do you want a world where people can randomly sue you for any random damages they suffer
Isn't that already a thing, but in the US, not the entire world.
I’m not sure if you’re being sarcastic but if the claim of damages is likely to win then I’d like someone to hear it.
Yeah to be fair would be great, sometimes just giving a nudge and showing people want these features is the first step to getting an official integration.
I approve! It would be cool to have customizable and transparent heuristics. That way if you know for example that a burst of stars was organic, or you don’t care and want to look at other metrics, you can, or you can at least see a report that explains the reasoning.
CTOs don't care about github stars. They are behind tons of screening processes.
Believe me, CTO's of startups do.
I believe you, throwaway314155!!
Frankly, I think this program is ai generated.
1. there are hallucinatory descriptions in the Readme (make test), and also in the code, such as the rate limit set at line 158, which is the wrong number
2. all commits are done on github webui, checking the signature confirms this
3. too verbose function names and a 2000 line python file
I don't have a complaint about ai, but the code quality clearly needs improvement, the license only lists a few common examples, the thresholds for detection seem to be set randomly, _get_stargazers_graphql the entire function is commented out and performs no action, it says "Currently bypassed by get_ stargazers", did you generate the code without even reading through it?
Bad code like this gets over 100stars, it seems like you're doing a satirical fake-star performance art.
I checked your past submissions and yes, they are also ai generated.
I know it's the age of ai, but one should do a little checking oneself before posting ai generated content, right? Or at least one should know how to use git and write meaningful commit messages?
It's a project I'm making purely for myself and I like to share what I make - sorry I didn't put up most effort in the commit messages, will not do that again.
Don’t apologize. You didn’t do anything wrong. It’s your repo, use it how you wish. You don’t owe that guy anything.
Well I initially planned to use GraphQL and started to implement it, but switched to REST for now as it's still not fully complete, just to keep things simpler while I iterate and the fact that it's not required currently. I’ll bring GraphQL back once I’ve got key cycling in place and things are more stable. As for the rate limit, I’ve been tweaking things manually to avoid hitting it constantly which I did to an extent—that’s actually why I want to add key rotation... and I am allowed to leave comments for myself for a work in progress no? or does everything have to be perfect from day one?
You would assume if it was pure ai generated it would have the correct rate limit in the comments and the code .... but honestly I don't care and yeah I ran the read me through GPT to 'prettify it'. Arrest me.
You probably should have put "v0.1" or "alpha/beta" in the post title or description - currently it reads like it's already been polished up IMO.
How does it differentiate between organic (like project posted on HN) and inorganic star spikes?
Just spitballing, but assuming the fake stars are added in a "naive" manner (i.e. as fast as possible, no breaks) you could distinguish the two by looking for the long tail usually associated with organic traffic spikes.
Of course, the problem with that is the adversary could easily simulate the same effect by mixing together some fall-off functions and a bit of randomness.
For each spike it samples the users from that spike (I set it to a high enough value currently it essentially gets all of them for 99.99% of repos - though that should be optimised so it's faster but just figured I will just grab every single one for now whilst building it). It checks the users who caused this spike for signs of being "fake accounts".
I love the idea! How feasible would it be to turn it into a browser extension?
Could you add support for PHP via package.json? Accept patch?
I haven't done that before so it would be a small learning curve for me to figure that out. Feel free to make a pull request.
What is a license trap? This "AGPL sneaking into a repo claiming MIT"? Isn't that just a plain old license violation?
Basically what I mean by it is for example a repository appears to be under a permissive license like MIT, Apache, or BSD, but actually includes code that’s governed by a much stricter or viral license—like GPL or AGPL—often buried in a subdirectory, dependency, or embedded snippet. The problem is, if you reuse or build on that code assuming it’s fully permissive, you could end up violating the terms of the stricter license without realising it. It’s a trap because the original authors might have mixed incompatible licenses, knowingly or not, and the legal risk then falls on downstream users. So yeah essentially a plain old license violation which are relatively easy to miss or not think about
oh interesting you put a word on it, most of the VC funded FOSS -open- core apps/saas that have pop up the past years are like this
the /ee folders are a disgrace
they get around it by licensing differently only packages / parts of the codebase
Of course, github could just drop the stars, but everything has to entshittify towards "engagement" and add social network features.
Or users could ignore the stars and go old school and you know, research their dependencies before they rely on them.
Stars are just a signal. When I am looking at multiple libraries that do the same, I am going to trust more a repo with 200 starts that one with 0. Its not perfect, but I don't have the time to go through the entire codebase and try it out. If the repo works for me I will star it to contribute to the signal.
I use stars for bookmarking purposes, i wouldn't care if they go private but would miss the feature
Same along with lists. I've got more than a thousand starred repos by now.
Sadly lists had a hard cap at 32 or 36 or something like that.. i was too eager early with my specificity (hav elists w 1 repo) and now i cant make new ones (need to delete others)
lol
found a couple non-maintained projects for managing them
https://github.com/astralapp/astral https://github.com/gkze/gh-stars
If that works for you, great. I don't do that. I don't even check how many stars it has.
I check the docs, features, and sometimes the code quality. Sometimes I check the date of the last commit.
I tend to put more attention on repos with 15-75 (ish) stars. Less is something obscure or unproven maybe, and above ~500 is much more likely to be BS/hype.
Github was a "social network" from its very beginning. The whole premise was geared around git hosting and "social coding". I don't think it became enshittified later since that was the entire value proposition from day 1.
Funny, I'm pretty sure I paid them just so I don't have to maintain my own git hosting.
I never even noticed the stupid stars until they started being mentioned on HN.
There are tons of places you can use for simple git hosting. The only reason to use github over the others is due to the social factors. Because everyone already has an account on it so they can easily file issues, PRs, etc. For simple git hosting, github leaves a lot to be desired.
See the tagline under the logo, May 14, 2008: https://web.archive.org/web/20080514210148/http://github.com...
I'm sorry, but I never read the main github site. I only spend a few seconds on it when my login expires and I need my repository list.
99.99999% of my interaction is via git pull and push :)
You may like Drew Devault’s https://sr.ht more