Email is a terrible way to communicate and should be avoided where possible. Unfortunately it is also the lowest common denominator on the web and will continue to be for the near future.
In the early days of the internet it was easy to run your own mailserver. Due to the absurd quantity of spam this task got increasingly harder and many tech-savvy people gave up and switched to Gmail or other services. This is a pity because a decentralized email infrastructure is harder to surveil, subpoena or shut down. I encourage everyone to run their own mail service if possible.
In this guide I will summarize the steps needed to get an effective
spamassassin (SA) setup. Continue reading →
Online marketers rely on statistics about their visitors to constantly adapt offers and learn about traffic sources. Google Analytics is the current tool of choice for most of them. Unfortunately GA is suffering from two major issues that won’t go away any time soon Continue reading →
When it launched Amazon Glacier was applauded for providing a super-cheap long-term storage solution. While there are no surprises when uploading and storing files, retrieving them can get expensive. The pricing reflects the fact that Amazon needs to retrieve your files from tape, which is expensive and takes a long time. Severalusers reported high charges after retrieving their backups. To its defence, Amazon published a very detailed FAQ on this topic.
The key to getting your files back on the cheap is time. You can retrieve 5% of your total files for free each month, but that amount is calculated per hour, rather than per month or day. Example: You keep 500GB in Glacier. 500GB*5%/30/24=36MB/hour.
That’s great to know, but how can you keep retrieving 36MB for days or months without doing it manually? If you’re on OSX or Linux you can use mt-aws-glacier. It’s a Perl script to track and retrieve your Glacier files. Using mtglacier for slow retrieval isn’t straight forward. There is an issue to improve it, but for now let’s work with what’s available.
The tool has two commands to restore files. One initiates the job on Amazon’s side. After the job completes (takes about 4h), the files can be downloaded. With that in mind, we need to tell mtglacier to only request a few files and then sleep for another hour. There is no way to limit the file size to be requested, but the number of files can be limited. As long as you have many small files, this works great.
At Quantego, we do most high-level work that supports energy analysts in Jupyter Notebooks. This allows us to pull several Java and Python packages together for a highly productive work environment.
Sample notebooks are hosted on Github and distributed with our Docker images. Of course we prefer for our sample notebooks to work, when people run them. They also uncover potential problems, by running at a very high level and thus using almost all available features.
If you have a similar setup – for example in a data analytics-driven environment – the following could work for you as well:
Make sure your notebooks run correctly when running “Run All”. When testing you may try different things and run cells out of order. For automatic testing to work, they should run all in sequence.
This will convert your notebooks to HTML. We’re not intersted in the output, only potential errors during conversion. This only works with Jupyter and iPython >=4. Previous versions simply ignore errors.
Next you could just run the same command in an isolated Docker container or in a CI step
If you happen to only have FTP access to a server or account (CPanel) you’re looking after, LFTP is an efficient tool to keep incremental backups. This will make hard links of the previous backup and updated it, copying and storing only changed files.
At Quantego.com we love working with Pandas Dataframes. We use them to store and analyze results from simulation runs. On top of our data matrix and a multi-level index we also need to accommodate custom plotting functions and attributes from the previous simulation run.
pandas.DataFrame for this task was a no-brainer. The new version 0.16.1 (to be released in the next days) includes some fixes to make working with subclasses of complex data-frames (DF) easier. Here an example of what can be done. First define two new classes for
pandas.Series (single col DF) and
pandas.DataFrame . You can define new functions or attributes, as needed.
"My custom dataframe"
_constructor_sliced . They make sure you get the correct class back, when slicing the DF.
self you have convenient access to all Pandas functions and can even roll your own.
I couldn’t find a truly universal regular expression (regex) to match phone numbers, no matter from which country and in which format. They all seemed to be limited in some way. Even named entity extraction APIs require you to set a country to find phone numbers.
In the end I rolled my own regex. It simply looks for a certain amount of numbers and characters generally used to make phone numbers human-readable. If you are looking to match longer or shorter numbers, you can just change the quantifiers. Some examples it will match:
Let me know, if this is useful for you or if you find space for improvement. Currently the biggest issue I see is that the matching ranges between numbers and total chars are unrelated. Due to many filling chars higher values are needed. Those can lead to false negatives. Best test it for yourself.
Most invoices exist in electronic format. They are generated from structured data and need to be entered as structured data. It’s a shame that we still need humans to manually extract data points, like amount, date or issuer from it.
In the last days, I tried a few online invoicing solutions, like shoeboxed, but none of them does a good job at automatically recognizing new invoices. Some do it manually and charge accordingly.
Currently I don’t see a way to automatically get the data. PDFs are simply not made for this. the best we can do is to add templates for a specific invoice format and use that to extract the data. I have created a proof of concept library, which is open source on github.
If you have any thoughts on what to improve or would like to extend this to use it in a production accounting, let me know.