A week or so ago, I had to come up with a solution to export a cleaner version of our data stored in our application database. We are using Postgres as our database and making use of the JSON storage and HSTORE key-value storage for our application needs. But when it comes to exporting our data for analysis, having columns with JSON and HSTORE encoded key-value pairs is not convenient. So I decided to take the opportunity to dive into using Python. After all, Python is quickly becoming the de-facto language for data science.
However I am a noob when it comes to Python. As such, this blog post is intended for noobs like me. I will try to explain how to setup a development environment for Python and give some tips on where to get started in learning Python for data science. I am also assuming that this is not your first language and you are probably a professional programmer who is fluent in one-or-more languages, and good with the command-line. If so, let’s start:
What to install?
When you start googling around, you will run into many alternatives. Your first decision is going to be around using Python 2.7 or 3.3, which turns out to be not backwards compatible. Yes, it definitely sounds like Python 3.3 is better than 2.7 and after 5 years it is probably time to move. Looks like most popular packages including the machine learning package scikit-learn has made the move to Python 3.3+ this year, so we should be good.
The second decision is in choosing a package manager/distribution. The most common choice is to use PIP. But as I was searching around, I found out about another alternative specifically geared towards scientific computing. It is called Conda and it sounded like a pretty convenient way to quickly setup my system with a set of data science related packages. Even though the default installation comes with 2.7, they also support Python 3.2+. Also once you start using Conda, you don’t have to worry about leaving behind PIP. Apparently Conda falls back to PIP, if it can not find the distribution package you are looking for. This Stackoverflow thread has more on that.
The third decision is to figure out your development environment, text editor to use etc. I am a big fan of Sublime Text and I decided to keep using that. My terminal choice is iTerm, shell choice is zsh and finally as the debugger/REPL environment, I settled on iPython. More on that later…
If you like IDEs better, Jetbrains has a version of their IDE for Python development, called PyCharm.There is a free community edition as well as a paid one, which comes with various framework support including Django etc. I also recently bumped into a brand new OSX native Python IDE that claims to be inspired by XCode called Exedore. I have not had a chance to try it out myself, but if you are interested, there is a free trial.
Anacondo distribution contains all the interesting packages as well as the iPython and iPython Notebook. It is free and probably the most straight forward way to get started. I picked the GUI package installer amongst the choices for Mac OSX. When you first install it, you need to set your path, so that it reads from the anaconda/bin
path instead.
REPL Environment - iPython
Being used to the Ruby goodies, including the Ruby REPL called pry, my first instinct is to look for a good REPL environment. And iPython delivers on that. Especially when learning a new language as well as testing out some quick thoughts, REPL driven development is very productive. Plus you can also use iPython for debugging your scripts.
Let’s say, you have a simple Hello World script like below in file hello_world.py
1 from IPython.core.debugger import Tracer; breakpoint = Tracer()
2
3 output = "Hello World"
4 breakpoint()
5 print output
Then if you open up your ipython shell, and call the magic command %run hello_world.py
, you will stop at the breakpoint you defined on line 5:
```
In [5]: %run hello_world.py
> /Users/foo/PyTest/hello_world.py(5)<module>()
3 output = "Hello World"
4 breakpoint()
----> 5 print output
```
To learn more about iPython, you can read the tutorial on the official iPython site.
iPython Notebook
Going through iPython documentation, I also bumped into another great tool called the iPython Notebook. Let’s say you come up with a cool analysis of your data and you would like to share the code, the math behind it, inline explanation of what your code does and finally the full output of running your code, including graphs. That is where iPython Notebook comes handy. You can use the public viewer to share your notebooks with other people if you don’t mind it being available publicly. This is especially great for public presentations, where you can go through your code and its output and explain it to your audience and then make it available to them after your presentation. You can also set up a private internal server as well. If you like to host it in the cloud instead, there are some good step-by-step instructions on how to do that on AWS. Or you can try out some 3rd party hosted solutions for privately sharing your notebooks.
To get this setup on your computer, all you need to do is to run ipython notebook
which starts a new server locally and fires up your browser. In the browser session, you can then create new notebooks, load existing ones and save them for later usage. The notebook file format is a single json with the .ipynb extension. To share your notebooks publicly, all you need to do is to create a new public gist on Github and copy its GistID to the page you see at http://nbviewer.ipython.org. You can also convert your notebooks to other formats using the nbconvert tool for non-interactive viewing.
Learning Python
Now that, you set up your environment and learned how to share your code with your colleagues and friends, it is now time to do some actual coding. There is a list of learning python sources on the python.org wiki, but my favorite is the Google Python Class. Another even shorter intro for experienced developers is 10-minute Intro to Python.
Once you get a quick grasp of Python, if you are doing some data science work, I highly recommend this tutorial for using Pandas. After reading that tutorial and once you get used to using iPython and iPython Notebook, you will probably stop using Excel for many of your basic data manipulations.
If you have access to the Safari Books Online Bookshelf, I would also recommend adding Python for Data Analysis and Programming Collective Intelligence to your bookshelf to get going with Data Science using Python. Another really cool resource is an online book called Probabilistic Programming and Bayesian Methods for Hackers. The whole book is interactive and is rendered with the iPython Notebook Viewer!
And one final tip, if you are looking for an interactive alternative to Matplotlib, check out the up-and-coming library Bokeh.