We've been using Aspose.PDF for the last 10 years or so in our C# platform, and paying for the license. It's expensive and buggy and has shite support, so a year or so back I decided to see if there was some other library or combination of libraries that could meet our needs. Basically, we needed:
* HTML to PDF
* Compress PDF
* Manual PDF generation
* Text extraction
* No browser engine or other weird dependencies
I researched every library I could find, and downloaded, integrated and tested anything that looked remotely promising.
At the end of all that, I reluctantly handed my company credit card back to Aspose. There simply wasn't any open-source or even just cheaper PDF library that I could actually make work, and all the other paid ones that did work were even more expensive.
I've been making my reports in self-contained HTML files[0] and it works out so much better than PDF. It is not constrained by paper sizes, and it lets me add some nifty features. For example, I recently added support for hiding columns in a table using exclusively CSS. The only downside is browsers can render things slightly differently, but for my use cases I don't need pixel-perfect identical rendering.
[0] Images are inlined base64-encoded, CSS/JS embedded with style and script tags. No external assets / no http requests.
Being constrained by page sizes is “a feature, not a bug” in most contexts. If I’m calling out numbers on the 3rd line of page 38 of a report, it helps if that’s consistent.
Not only can you embed the fonts, but you can make it interactive and output a PDF if you really wanted to. The HTML might grow if you embed enough JS, but on the other hand... some PDFs are insanely large.
The only reason PDFs still have a job is: pixel perfect consistency; the built-in validity stuff (ensuring the document wasn't altered, etc.); or the customer doesn't need the other things, but isn't open to alternatives. Otherwise, PDF is just a major headache.
Also page-level consistency, and generally layouting in a printable format
Even with the same word document opened only in various MS Word versions (web, desktop, etc) you won't get consistent page numbers. And HTML tables work great on screen but don't print very well if they span more than what fits on a single sheet of paper
The wider .NET ecosystem is lacking when trying to step out the mainline. I don't bother hunting for unused, partially implemented .NET libraries anymore and just call out to a process or API call when needing to get something done.
It's not ideal, but when there isn't a good option isn't available in .NET it's usually available in Python/npm. Typically I'll use background jobs when calling out of process for added resiliency/replayability and observability.
Not sure I agree. Also depends of the domain. The python ecosystem is of course a lot richer for anything AI. But try to open, manipulate and export spreadsheets. In python you pretty much need a different library for every excel file format (xls, xlsx, etc) and usually the more file formats a library can handle, the least capable it is (eg pandas). In .net you have libraries like spreadsheetgear that are super powerful, including their own excel calculation engine. I see nothing remotely close in python.
When I used PdfSharp about 9 years ago, it wasn't really designed to import arbitrary PDFs; it crashed or hung on many less common constructs or invalid PDF files. It was really only designed to either create PDFs or edit PDFs created by itself (or MigraDoc, which used it); that it could also import some other PDFs was considered a bonus by its maintainers. I submitted some patches back then to fix the most egregious problems. Hopefully it has improved.
We needed a library to read arbitrary PDF files (although I forgot what exactly we needed to read from them; it wasn't for full rendering) and ended up using PdfSharp, because iText did not respond to our pricing request.
My favorite approach for PDF rasterization was to interop with a simple, custom Java console application that leveraged Apache PdfBox.
This lasted until the log4j exploit, at which point we had to abandon it altogether due to our customers (banks) having a complete meltdown over it at the time.
It's probably still a really good option. I would definitely go back to it in a different context.
I create PDF files from C# using LaTeX as an intermediate format. This works very reliable but sometimes takes a bit of tinkering until everything fits.
People here on HN recently recommended Typst as a replacement for LaTeX, but I haven't tried it myself yet.
>Naturally, I first started looking for permissively licensed libraries, which could be used free of charge and without additional license requirements.
There is a lot of work in a good PDF library, expecting to get it for free feels unreasonable to me.
It seems the PDFSharp rabbit hole goes even deeper than I've realized!
Latest MigraDoc & PDFSharp seem to have been updated and ported to .NET 6 after a lot of the forks happened, so it was unclear to me whether there's merits in looking at other, mostly abandoned forks.
I might add PdfSharpCore, though the use of SixLabors.ImageSharp and SixLabors.Fonts leads to a disqualification from the "quest", given their custom split license [1]
Edit: Actually, the license seems to turn into an Apache 2.0 license, when used with an open source licensed project and also as transitive dependency. Certainly a confusing license.
Edit: PSA - PdfSharpCore uses older releases of SixLabors.ImageSharp v1.0.4 and Fonts-1.0.0-beta17 which both were (and are still) distributed under plain Apache-2.0.
I needed this post a year ago when I was looking for this exact thing. I did end up going with Puppeteer because I needed it for something else that I couldn't avoid. I use a large list of flags with it to launch the most minimal version of headless Chrome that I can.
I am going to look into switching to MigraDoc and see if i can drop puppeteer
Oh yeah, PDF. In a past project I created a monster solution:
* Scriban to fill in templates (LaTeX)
* Custom Angular SSR to reuse frontend components (charts etc)
* Playwright to convert SSR output to PDF
* LuaLaTeX to convert LaTeX document + stuff to PDF
Super slow, but very high quality results. Do not try this at home!
At work we were using I think it was GDPicture? Which is now called Nutrient. They started out with a flat fee, royalty free, then their pricing scheme became more hostile over time (per developer, per application licensing, and I don't recall if they wanted to know how many users - which is crazy unless it's a SaaS). I have friends (former coworkers) and family who ask me for advice on software libraries to use for what, since they know I'm a hyper nerd for that sort of thing, last time a former coworker asked what PDF library to use I told them to avoid Nutrient like the plague. There's wanting to be sustainable and then there's greed.
So yeah I too was looking for permissive licensing. The worst part is now its drastically harder for me to suggest any paid alternatives because we don't know that the alternative wont hike up prices on us. It's a really awful spot to be in.
We've been using Aspose.PDF for the last 10 years or so in our C# platform, and paying for the license. It's expensive and buggy and has shite support, so a year or so back I decided to see if there was some other library or combination of libraries that could meet our needs. Basically, we needed:
* HTML to PDF * Compress PDF * Manual PDF generation * Text extraction * No browser engine or other weird dependencies
I researched every library I could find, and downloaded, integrated and tested anything that looked remotely promising.
At the end of all that, I reluctantly handed my company credit card back to Aspose. There simply wasn't any open-source or even just cheaper PDF library that I could actually make work, and all the other paid ones that did work were even more expensive.
If you are looking for a solution to generate PDF reports, I highly recommend using typst
> obviously needs to be a PDF
I've been making my reports in self-contained HTML files[0] and it works out so much better than PDF. It is not constrained by paper sizes, and it lets me add some nifty features. For example, I recently added support for hiding columns in a table using exclusively CSS. The only downside is browsers can render things slightly differently, but for my use cases I don't need pixel-perfect identical rendering.
[0] Images are inlined base64-encoded, CSS/JS embedded with style and script tags. No external assets / no http requests.
You can also use media queries for printing specific styling too so you can remove things that maybe a user doesn't need to print out:
https://developer.mozilla.org/en-US/docs/Web/CSS/Guides/Medi...
Being constrained by page sizes is “a feature, not a bug” in most contexts. If I’m calling out numbers on the 3rd line of page 38 of a report, it helps if that’s consistent.
Unless you can embed fonts [into the page itself] you aren’t beating PDF
Not only can you embed the fonts, but you can make it interactive and output a PDF if you really wanted to. The HTML might grow if you embed enough JS, but on the other hand... some PDFs are insanely large.
Not a problem with data: URIs. But then, a report may not need fancy fonts if HTML is acceptable.
You can embed fonts into an HTML page. For example, place an @font-face with the src:url being a base64-encoded blob, in a style element.
The only reason PDFs still have a job is: pixel perfect consistency; the built-in validity stuff (ensuring the document wasn't altered, etc.); or the customer doesn't need the other things, but isn't open to alternatives. Otherwise, PDF is just a major headache.
Also page-level consistency, and generally layouting in a printable format
Even with the same word document opened only in various MS Word versions (web, desktop, etc) you won't get consistent page numbers. And HTML tables work great on screen but don't print very well if they span more than what fits on a single sheet of paper
The wider .NET ecosystem is lacking when trying to step out the mainline. I don't bother hunting for unused, partially implemented .NET libraries anymore and just call out to a process or API call when needing to get something done.
It's not ideal, but when there isn't a good option isn't available in .NET it's usually available in Python/npm. Typically I'll use background jobs when calling out of process for added resiliency/replayability and observability.
Not sure I agree. Also depends of the domain. The python ecosystem is of course a lot richer for anything AI. But try to open, manipulate and export spreadsheets. In python you pretty much need a different library for every excel file format (xls, xlsx, etc) and usually the more file formats a library can handle, the least capable it is (eg pandas). In .net you have libraries like spreadsheetgear that are super powerful, including their own excel calculation engine. I see nothing remotely close in python.
There is hardly anything that isn't available in .NET, the main problem is being willing to pay for tooling.
This looks like ChatGPT. There are PLENTY of alternatives on the post.
Python and others have similar issues, with them having limitations as well
It wouldn't be a quest if there were lots of good options, a few good options is better than lots of unused/unmaintained ones.
When I used PdfSharp about 9 years ago, it wasn't really designed to import arbitrary PDFs; it crashed or hung on many less common constructs or invalid PDF files. It was really only designed to either create PDFs or edit PDFs created by itself (or MigraDoc, which used it); that it could also import some other PDFs was considered a bonus by its maintainers. I submitted some patches back then to fix the most egregious problems. Hopefully it has improved.
We needed a library to read arbitrary PDF files (although I forgot what exactly we needed to read from them; it wasn't for full rendering) and ended up using PdfSharp, because iText did not respond to our pricing request.
My favorite approach for PDF rasterization was to interop with a simple, custom Java console application that leveraged Apache PdfBox.
This lasted until the log4j exploit, at which point we had to abandon it altogether due to our customers (banks) having a complete meltdown over it at the time.
It's probably still a really good option. I would definitely go back to it in a different context.
I create PDF files from C# using LaTeX as an intermediate format. This works very reliable but sometimes takes a bit of tinkering until everything fits.
People here on HN recently recommended Typst as a replacement for LaTeX, but I haven't tried it myself yet.
>Naturally, I first started looking for permissively licensed libraries, which could be used free of charge and without additional license requirements.
There is a lot of work in a good PDF library, expecting to get it for free feels unreasonable to me.
I my eyes, PdfSharpCore¹ is now the "canonical" version of pdfcore.
IMHO the list is incomplete without it.
1: https://github.com/ststeiger/PdfSharpCore
It seems the PDFSharp rabbit hole goes even deeper than I've realized!
Latest MigraDoc & PDFSharp seem to have been updated and ported to .NET 6 after a lot of the forks happened, so it was unclear to me whether there's merits in looking at other, mostly abandoned forks.
I might add PdfSharpCore, though the use of SixLabors.ImageSharp and SixLabors.Fonts leads to a disqualification from the "quest", given their custom split license [1]
Edit: Actually, the license seems to turn into an Apache 2.0 license, when used with an open source licensed project and also as transitive dependency. Certainly a confusing license.
[1] https://github.com/SixLabors/ImageSharp/blob/main/LICENSE
Edit: PSA - PdfSharpCore uses older releases of SixLabors.ImageSharp v1.0.4 and Fonts-1.0.0-beta17 which both were (and are still) distributed under plain Apache-2.0.
https://web.archive.org/web/20251104163604/https://codeload....
Good to know, thank you!
Though, makes me wonder how much "old code" this is then collecting...
I needed this post a year ago when I was looking for this exact thing. I did end up going with Puppeteer because I needed it for something else that I couldn't avoid. I use a large list of flags with it to launch the most minimal version of headless Chrome that I can.
I am going to look into switching to MigraDoc and see if i can drop puppeteer
Thanks for this great research!
Oh yeah, PDF. In a past project I created a monster solution:
Super slow, but very high quality results. Do not try this at home!Scriban is totally awesome though.
Thanks for this post! I've wanted to create such a post for a long while but never got around to it. Yours is fantastic!
At work we were using I think it was GDPicture? Which is now called Nutrient. They started out with a flat fee, royalty free, then their pricing scheme became more hostile over time (per developer, per application licensing, and I don't recall if they wanted to know how many users - which is crazy unless it's a SaaS). I have friends (former coworkers) and family who ask me for advice on software libraries to use for what, since they know I'm a hyper nerd for that sort of thing, last time a former coworker asked what PDF library to use I told them to avoid Nutrient like the plague. There's wanting to be sustainable and then there's greed.
So yeah I too was looking for permissive licensing. The worst part is now its drastically harder for me to suggest any paid alternatives because we don't know that the alternative wont hike up prices on us. It's a really awful spot to be in.
PDFlib - I've used it since 2001. Their pricing is stable, and they've been flexible over the years as computing models have shifted.
The wrappers to wkhtmltopdf look to me the best candidates.
Which use-cases needing Qt WebKit is an issue?
wkhtmltopdf is unmaintained and deprecated though.