New Release of invoice2data

Thanks to some awesome contributors, there is a new release for invoice2data. This Python package allows you to get structured data from PDF invoices. Major enhancements:

  • powerful Yaml-based template format for new invoice issuers.
  • improved date-parseing thanks to dateparser.
  • improved PDF conversion thanks to new feature in xpdf
  • better testing and CI
  • option to add multiple keywords and regex to each field
  • option to define currency and date format (day or month first?)

All details and download on Github.

Unit testing for Jupyter (iPython) notebooks

At Quantego, we do most high-level work that supports energy analysts in Jupyter Notebooks. This allows us to pull several Java and Python packages together for a highly productive work environment.

Sample notebooks are hosted on Github and distributed with our Docker images. Of course we prefer for our sample notebooks to work, when people run them. They also uncover potential problems, by running at a very high level and thus using almost all available features.

If you have a similar setup – for example in a data analytics-driven environment – the following could work for you as well:

  1. Make sure your notebooks run correctly when running "Run All". When testing you may try different things and run cells out of order. For automatic testing to work, they should run all in sequence.
  2. Test locally with
    jupyter nbconvert --to=html --ExecutePreprocessor.enabled=True my-notebook.ipynb

    This will convert your notebooks to HTML. We're not intersted in the output, only potential errors during conversion. This only works with Jupyter and iPython >=4. Previous versions simply ignore errors.

  3. Next you could just run the same command in an isolated Docker container or in a CI step
    docker run my_container /bin/sh -c \
      "/usr/local/bin/jupyter nbconvert \
          --to=html --ExecutePreprocessor.enabled=True \
          --ExecutePreprocessor.timeout=3600 \
          samples/my-sample.ipynb"

    A full working example for CircleCI can be found in our sample-repo.

Shell Function to Remove all Metadata from PDF

A handy function to remove all metadata from a PDF file. When done it will show all the remaining metadata for inspection. Needs pdftk and exiftool installed.

Combines commands from here and here. Good job, guys.

clean_pdf() {
 pdftk $1 dump_data | \
  sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | \
  pdftk $1 update_info - output clean-$1
 
 exiftool -all:all= clean-$1
 exiftool -all:all clean-$1
 exiftool -extractEmbedded -all:all clean-$1
 qpdf --linearize clean-$1 clean2-$1
 
 pdftk clean2-$1 dump_data
 exiftool clean2-$1
 pdfinfo -meta clean2-$1
}

After adding this snippet in ~/.profile or copy and pasting it in the shell, you can just run

clean_pdf my-unclean.pdf

Incremental FTP backups

If you happen to only have FTP access to a server or account (CPanel) you're looking after, LFTP is an efficient tool to keep incremental backups. This will make hard links of the previous backup and updated it, copying and storing only changed files.

#!/usr/bin/env bash
username='xxx'
password='xxx'
host='ftp.host'
localBackupDir='/backups/host'
remoteDir='/public_html/'
cd $localBackupDir
rm -rf backup.3
mv backup.2 backup.3
mv backup.1 backup.2
mv backup.0 backup.1
cp -al backup.1 backup.0 #-al or -r
lftp -e "set ssl:verify-certificate no; \
         mirror --only-newer --parallel=4 $remoteDir $localBackupDir/backup.0;\
         exit"\
     -u $username,$password $host

Extend Pandas DataFrame with custom functions and attributes

At Quantego.com we love working with Pandas Dataframes. We use them to store and analyze results from simulation runs. On top of our data matrix and a multi-level index we also need to accommodate custom plotting functions and attributes from the previous simulation run.

Subclassing pandas.DataFrame for this task was a no-brainer. The new version 0.16.1 (to be released in the next days) includes some fixes to make working with subclasses of complex data-frames (DF) easier. Here an example of what can be done. First define two new classes for pandas.Series (single col DF) and pandas.DataFrame . You can define new functions or attributes, as needed.

class CustomSeries(pandas.Series):
    @property
    def _constructor(self):
        return CustomSeries

    def custom_series_function(self):
        return 'OK'

class CustomDataFrame(pandas.DataFrame):
    "My custom dataframe"
    def __init__(self, *args, **kw):
        super(CustomDataFrame, self).__init__(*args, **kw)

    @property
    def _constructor(self):
        return CustomDataFrame

    _constructor_sliced = CustomSeries

    def custom_frame_function(self):
        return 'OK'

Notice _constructor  and _constructor_sliced . They make sure you get the correct class back, when slicing the DF.

Via self  you have convenient access to all Pandas functions and can even roll your own.

Regex to find phone numbers in every format

I couldn't find a truly universal regular expression (regex) to match phone numbers, no matter from which country and in which format. They all seemed to be limited in some way. Even named entity extraction APIs require you to set a country to find phone numbers.

In the end I rolled my own regex. It simply looks for a certain amount of numbers and characters generally used to make phone numbers human-readable. If you are looking to match longer or shorter numbers, you can just change the quantifiers. Some examples it will match:

540 297 1860
+886-2-8663-8287
0870882993
0090 530 229 12 04
+66 (0) 28340463
058 218 0600
(2014-2015
062-21-8608888
886-5-2781880
011-81-27372-9341
03- 7722 5012
+886-2-8663-8287
+62 – 21 – 5694 2002
+34 918 380 082
+90 532 643 34 34
+7 495 228 3513
+ 7 702 270 38 13 + 7 777

While not matching:

9:00-17:00
2015(15:00

And here the regex:

(?!.*[a-zA-Z\,:])(?=(\D*\d){7,14})([\+\d\(]{1,2}.{6,23}\d)

To use it in Python

import re

rex = '(?!.*[a-zA-Z\,:])(?=(\D*\d){7,14})([\+\d\(]{1,2}.{6,23}\d)'
numbers = re.findall(rex, str_with_phone_numbers)

 

Let me know, if this is useful for you or if you find space for improvement. Currently the biggest issue I see is that the matching ranges between numbers and total chars are unrelated. Due to many filling chars higher values are needed. Those can lead to false negatives. Best test it for yourself.

Python clipboard access

I was using Python and Jinja2 to generate some tables with 100+ rows for WordPress. This package saved me the extra step to open a file and copy+paste it from there.

There should be many other uses to integrate it into semi-automated workflows.

Check it out here.

Extract structured data from PDF invoices

Most invoices exist in electronic format. They are generated from structured data and need to be entered as structured data. It's a shame that we still need humans to manually extract data points, like amount, date or issuer from it.

In the last days, I tried a few online invoicing solutions, like shoeboxed, but none of them does a good job at automatically recognizing new invoices. Some do it manually and charge accordingly.

Currently I don't see a way to automatically get the data. PDFs are simply not made for this. the best we can do is to add templates for a specific invoice format and use that to extract the data. I have created a proof of concept library, which is open source on github.

If you have any thoughts on what to improve or would like to extend this to use it in a production accounting, let me know.

Scalable Docker Monitoring with Fluentd, Elasticsearch and Kibana 4

Screen Shot 2014-11-20 at 14.38.27

Docker is a great set of technologies. Once you are comfortable with using it, you are presented with a set of challenges, you didn't have before. To name some:

  • log consolidation: How to retrieve log files from dozens of containers?
  • monitoring: How much RAM and CPU is each container using?

There are a few articles on this topic out there. After reading them none of the solutions really hit me, but they all had some nice features which I chose to combine here. Continue reading

Linksnappy Command Line Downloader (Python)

Simple Python script to download files via Linksnappy.

#! /usr/bin/env python

import requests
import json
import sys

USERNAME = 'my username'
PASSWORD = 'my pass'

params = {'link': sys.argv[1],
'type': '',
'username': USERNAME,
'password': PASSWORD}

resp = requests.post('http://gen.linksnappy.com/genAPI.php',
data={'genLinks': json.dumps(params)})

url = json.loads(resp.text)['links'][0]['generated']

local_filename = url.split('/')[-1]

# http://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py
r = requests.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
print local_filename

SSLv3 no longer supported

I had SSLv3 disabled for HTTP for quite some time. In the light of recent event, it is now also disabled for IMAP and SMTP. If you run into any trouble, let us know or update your clients.

Online iPython Notebook Viewer

We recently started using the slide function of iPython notebooks. Basically it allows you to partition your notebook onto different slides, slide fragments and subslides. Those can be exported to reveal.js

There is already a great viewer for notebooks on http://nbviewer.ipython.org. To save some steps in exporting, converting and adding reveal.js, I took the idea and added a slide viewer. Anyone can use it to link to their slides on Github, Gist or any other place. We even support Basic Auth. Check it out at:

https://slides.quantego.com

 

Access Docker container attributes in Ansible

Ansible is a great automation solution. I mainly use it to provision servers and launch Docker instances on them. Sometimes I need container attributes, like PID or Port to configure Nginx or monitoring tools.

While the Ansible documentation gives you some hints, I didn't find it 100% obvious on how to solve this. Basically all your newly-created containers will end up in a list called docker_containers. It has the same structure as docker inspect.

For the PID:

docker_containers[0]['State']['Pid']

For the host port:

docker_containers[0]['NetworkSettings']['Ports']['8888/tcp'][0]['HostPort']

So you could add a PID-file for a container like this:

 - copy: dest="/var/run/{{ image_name }}.pid"
   content="{{docker_containers[0]['State']['Pid']}}"

Also read the full docs here.