Extract structured data from PDF invoices

· 151 words · 1 minute read

Most invoices exist in electronic format. They are generated from structured data and need to be entered as structured data. It’s a shame that we still need humans to manually extract data points, like amount, date or issuer from it.

In the last days, I tried a few online invoicing solutions, like shoeboxed, but none of them does a good job at automatically recognizing new invoices. Some do it manually and charge accordingly.

Currently I don’t see a way to automatically get the data. PDFs are simply not made for this. the best we can do is to add templates for a specific invoice format and use that to extract the data. I have created a proof of concept library, which is open source on github.

If you have any thoughts on what to improve or would like to extend this to use it in a production accounting, let me know.