Now you’ve got your feet wet and Python installed and tested follow me on this journey to prove to yourself how easy, fast and fun it is to analyse data with python.
We’re going to be working with the PIMA Indians diabetes data. This is an open data source that you can get from the UCI Machine Learning Repository and is downloaded directly as part of each little program we’re going to run below to amaze ourselves with what python can do. But you don’t have to search for the file on the UCI server. I’ve provided it for you here. Right click THIS LINK and save the file to a directory on your computer. I suggest you create a directory named DNA-Python and save it there. We’ll use this directory in future classes so it’ll be useful to create it right now.
Once you’ve downloaded it open it with a spreadsheet application. In most cases all you have to do is right click the file and choose <open with> and choose your spreadhseet app – Microsoft Excel, Openoffice, Numbers – whatever you have on your machine. If you don’t have a spreadsheet app open it with a text editor – Notepad, GEdit, TextEdit, Sublime, Atom, Emacs, Vim – whatever you like.
Here is the file open in various apps on my machine:
And here it is open on Openoffice.
Now you have the file open scroll through and have a good look at it. There are 9 columns A to I across the top and 768 rows. Those are 9 measures of health for 768 patients.
What else can you say about this data. Can you tell what the columns are depicting? So how do we find out more about this dataset to decide if it interests us and to decide if we can use this dataset to learn something about data science.
Let’s try a google search. I put <pima Indians diabetes dataset> into google and found that it is on Kaggle and I can source it through my free account on data.world so I headed there and this is what I got >>
So, a little bit more information about the dataset. Getting somewhere. If I scroll around and click on various links just browsing, playing with the tools, gleaning a little bit here and there about this dataset I find this >>
Aha! Here are the 9 columns. So now I have some idea what I can call my columns as I analyse the data with python.
Now here is an important thing for you to do to follow along with me and learn some data science. Fire up a Jupyter notebook. Forgot how to? Go back and read this article once more then come back here and start a jupyter notebook. Make sure your you’re in cell 1 of your python 3 notebook and TYPE in all the code you see below. Don’t try to understand any of it right now, just type it in. My goal here is to show you what python can do not to teach you python the language. We’ll get to that later. So in cell 1 type the code you see below and press control-enter when you’re done. There is one stumbling block here and that is figuring out what the path to the PIMA file you downloaded is. I can’t figure out what your path should be and this is one of the steps you have to ferret out yourself. So pull out that detectives hat Sherlock and get it to run. Here’s what it looks like on my machine >>
So some of you are complaining that you got an error message stating something like matplotlib not found or pandas not found. If you do this is what you have to do. Open up a terminal (or command prompt in windows) and type this >>
When I do this on my machine I get this >>
Which is a good thing telling me I already have these libraries installed. Remember libraries that we talked about on this page? These are only two of them we’re using here and they’re giving us the ability to use those functions that you typed in earlier.
OK. Go back to the cell you keyed in. The one that looks like this >>
And press control-enter again. Did you get this …
If you did CONGRATULATIONS! You’ve done your first analysis of a dataset using python. Now we can look at the plots and try to figure out what our data is telling us. I don’t want to get into the data science part of this course just yet. At this point in our explorations I just want you to see what an amazing tool python is.
Make sure you get this working. Make sure you KEY IN ALL the code I’ve listed above. C’mon it’s only 7 lines of code. KEY IT IN and press control-enter. Get it to work. This is the grunt work of a data scientist working in python. Not too hard is it? A little more knowledge under your belt and your bosses or partners will be calling you a genius.
Ready for the journey of a lifetime? Read on or reach out to me and let’s talk about what heights you want to take your life to.
Now in the same intuitive way I’m going to lead you through a neural network. Yes! You read that right. We’re going into the big leagues and building a neural network right there in your Jupyter Notebook! Click here to visit that page. It’s a work on progress so you should come back often to keep up with the updates I’ll be making over the next few days.