Extract structured data from PDF invoices

Most invoices exist in electronic format. They are generated from structured data and need to be entered as structured data. It’s a shame that we still need humans to manually extract data points, like amount, date or issuer from it.

In the last days, I tried a few online invoicing solutions, like shoeboxed, but none of them does a good job at automatically recognizing new invoices. Some do it manually and charge accordingly.

Currently I don’t see a way to automatically get the data. PDFs are simply not made for this. the best we can do is to add templates for a specific invoice format and use that to extract the data. I have created a proof of concept library, which is open source on github.

If you have any thoughts on what to improve or would like to extend this to use it in a production accounting, let me know.

Shanghai PM2.5 update

I’m back in Shanghai and faced with the pollution problem once again. Here a quick update on the last months. You can clearly see a reduction in pollution around Chinese New Year, when factories are shutting down. With warmer weather the readings also seem to be lower. I don’t know the reason, but anecdotal evidence gives the same effect for Beijing. An expert from Vienna University of Economics is currently analyzing the data and correlating it with weather observations. I’ll give an update, when some results emerge.

Continue reading

Attack from below – Oracle vs the rest

Threats rarely come from above, but most of the time from below. Small and flexible companies start with niches and keep improving their performance, until they become a threat to the established player. In this case we see the S-curve model playing out against Oracle.

These migrations indicate that after years of development, many open source or low-cost databases have now attained performance that is either roughly equivalent to parts of Oracle, such as Postgres, or have developed capabilities that while irrelevant to much of the database market, are far in advance of any technology operated by the Red Borg, such as MongoDB or Riak or Cassandra.

via Hungry termites nibbling at Oracle’s foundation

Personally I’m a fan of MongoDB and Redis. I also tried CouchDB, but didn’t find it very active.

M/Monit preparing new monitoring tool

Since my webserver broke down, while I was caught on a ship to Japan, I have relied on the excellent monit to have an eye on all my important services.

Currently their inventors, who give the client-version away for free are working on a remarkable evolution of their M/Monit-tool, a solution to keep track of multiple monit-instances. It only used to give you alarms and show events. Now it will record your system load and memory usage.

If you already have monit installed, this is a great complement. Find out about the beta-version here.

Happy New Year 2013!

Good luck in 2013 to everyone. I hope that it will be quieter than 2012 and we get some time to consolidate some of the big trends that started in 2012. My favourite ones are the Raspberry Pi and ownCloud.

I also applied the latest 3.5 update to WordPress. This brought some changes in media management. So in case you miss some pictures, just check for the correct paths in your posts. Generally most stuff should still work.

OwnCloud 4.5 released

This morning version 4.5 of the excellent owncloud package was released. This is excellent competition to all those Apples and Googles who fight to take control of your personal data. Finally there is a decent solution to host all of this yourself.

Calendar, as well as sharing it works flawless without problems. For addressbooks, there seems to be a flaw in Mac OSX, which prevents the addressbook app to see more than one of them. It just picks the first it sees. If someone shares an addressbook with you, it can even cover up your own contacts. This is quite annoying and I recommend not using this feature for now. Hopefully a workaround is found soon.

Rest should work fine. You can install all settings via profiles as usually. Old installations will continue to work as well. In case you experience any problems, just reinstall your configuration profile and let it sync again.

Data Retention Coming to Austria

On Sunday all connection data for telephone and internet connections will be saved for 6 months. This might sound harmless and people will say that they don’t mind, because they have “nothing to hide”. No matter what, once this infrastructure is in place, it can be used for all kinds of things and should therefor be opposed from the start. Consider this: Before contacting someone, you need to think, whether you want this person to be associated with you or not (because this information will be saved). If your friend is a drug dealer or pimp (and you don’t know about it), you might be surveilled as well.

For doctors and lawyers this new measure brings another problem. They can’t guarantee the confidentiality of their client correspondence any more and will have to resort to sending letters again. A detailed explaination can be found here.

If you worry about government surveillance or want to protect sensible data, contact us for consulting and secure offshore hosting services.

Recent Updates

Webmail was updated to version 0.7 with a new skin. I also removed the last bits of MySQL-dependence of the email-system. This has the benefit that everything is simpler to administer and more stable.

Apple iOS 5

Apple’s latest operating system for mobile devices has been out for a few weeks now. The upgrade was mostly an evolutionary one and didn’t add too many new features. One thing apple has done tough was to tighten their grip on devices after they have been sold, by tighter integrating them into their iCloud service. If customers should wish, they can now upload their pictures, calendar, address book, bookmarks, notes, documents or location to Apple’s servers. Since the firm’s own data center in North Carolina isn’t finished yet, extra capacity was rented from Microsoft and Amazon. This is problematic, because now we don’t even know which company is handling our data.

This is one reason why I want to remind people that almost all of iCloud’s functionality can be realized by using a simple Unix-server as well. This includes email and notes by simply using IMAP. Contacts, calendars and reminders are based on CalDAV and CardDAV. For bookmarks, documents and photos one could use WebDAV.

When using the open version of iCloud, you don’t only keep your data under control, but can also use them from non-Apple Android, Windows and Linux systems.

Contacts and Calendar

Hosted calendar and contacts service is now available for all users of email. It’s base on Card/CalDAV and should work out-of-the-box with all newer Apple devices, as well as most open source clients. For Android there are apps available.

Leaving Lighty for Apache

In the early days of my server-career, I had to use lighttpd for RAM-reasons. Over time the limitations of lighttpd were mounting. E.g. no .htaccess-files, no auth with certificates, bad DAViCal integration, inconsistent LDAP-filters. Moreover after some time lighttpd uses up more and more memory.

So for now I’m just using both of them to compare performance and overall ease-of-user.


It is happening. Conventional IPs will run out in less than 2 months from now. After that we might get by another year, by using the IPs we have more efficiently. Still the only long-term solution is to switch to the new IP-protocol version 6.

I’m proud to announce that since today the snapdragon-servers are fully IPv6-ready. That means webserver, mailserver and IMAP. You shouldn’t notice any change, since IPv4 is still on as well.

If you would like to see whether you’re viewing an v4 or v6-address, you could install this Firefox-plugin or start Chrome with the option –prefer-family=IPv6 or simply browse to the IPv6-version of Google.