Making Sense of Data with Python & Pandas

 

“Those who do not learn history are doomed to repeat it.”

~George Santayana

It’s important for mankind to learn from the past in order to survive. This notion is true both in personal and professional aspects of live. Every decision we make should be based on some concrete data. This idea seems more relevant for business decisions. So with these things in mind let’s see how to get vital insights from data gathered over time. We will use a dataset and answer some questions using it.

If you know the basics and good to go, skip to the questions.

Before we begin please note that it’s not a traditional tutorial where all basic Pandas functions are explained. Here we will learn functions as and when we need them, with reference to detailed documentation wherever necessary. I will be using a dataset from Kaggle, which has 20 years of computer games history like ratings, platform, editor’s choice etc. You can find it here.

Firsts things first, make sure you have Python (version > 3) on your system along with Pandas (needed to process the dataset), matplotlib (to visualize the data). A quick Google search would fetch you plenty of posts with step by step guide to do that. Now import all the libraries needed :

import pandas as pd
import matplotlib.pyplot as plt

Since the dataset is quite big, to be able to view all columns in the terminal do the following :

pd.set_option('max_columns',None)
pd.options.display.width = None

The above statements remove the upper limit on, maximum number of columns that can be shown and display width . A full list of all supported attributes can be found here. Now that we are all set let’s read the data using Pandas. It’s as easy as this :

data = pd.read_csv("/path_to_file/ign.csv")

This reads the data and stores it in “DataFrame” variable. For those who are new to Pandas, DataFrame is one of the data structures in Pandas which store data in a tabular form or 2D array as we will see later. The other two data structures used in Pandas are Series and Panel explained here. Let’s see what the data looks like :

print(data.iloc[::3][["title","score","score_phrase","genre","platform","editors_choice"]])

DataFrame “iloc” is an attribute that let’s you select rows based on integer index values. For your reference index is the leftmost column in below image. “[::3]” chooses every 3rd row of dataframe. You can view 10th to 50th row (for instance) using “[10:50]” instead. In the next square brackets we pass an array of column labels to show.

This prints below output :

SampleData

You can remove the columns section and execute to see all the columns. Now let’s try to answer some questions based on this data.

Q1 : What platform has better rated games, PlayStation or Xbox ?

Let’s say you want to a gaming console and you are torn between PS or Xbox. One the important criteria is what games are available for each of them. Thanks to these datasets you don’t have to make a blind decision. Here’s how we are going to do it : group the dataset by “score” column and find the count of PS and Xbox for each group of score. If you look at the “score” column it consists of decimal values which may lead to too many groups and meaning less insight. So first we need to round up the “score” column and store in a new column “score_round”.

data["score_round"] = data["score"].round(0)

Parameter passed to round function tells, up to what decimal places we want to round the decimal number. Please note “round” converts a decimal to nearest number. If decimal is midway (eg. 8.5) it would be rounded to nearest even number (i.e. 8). So 7.5 would be 8 and 8.5 would be 8 too.   To visualize our insight in an easy to understand way we must plot the result on bar graph. To do that we store result in a new data frame with 3 columns, ratings, playstation_count, xbox_count.

ratings = sorted(data["score_round"].unique())
data_pf_ratings = pd.DataFrame({"ratings" : ratings})
data_pf_ratings["playstation_count"] = data[data["platform"].str.contains("PlayStation")]["score_round"].value_counts()
data_pf_ratings["xbox_count"] = data[data["platform"].str.contains("Xbox")]["score_round"].value_counts()

First we got all the unique and sorted values of score_round column in an array. Then we created a new data frame “data_pf_ratings” with ratings column only. Next we created new column “playstation_count” count of PS games for each scores (1,2,3 etc). value_counts() groups by and counts the value for each group. Now plot the graph like this :

data_pf_ratings.plot(kind="bar", title="PlayStation vs Xbox",y=["playstation_count","xbox_count"],edgecolor="black",linewidth=1.2)
plt.show()

output :

PSvsXbox

As you can see PS has a lot more games under 7-10 rating category. Go PlayStation!

Q2 : Is rating of a game dependent on editor’s choice?

This question may seem difficult to answer but thanks to Pandas it’s one of the easiest if you know the concept of “correlation”.

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

Pandas has inbuilt function to find correlation :

DataFrame["Col1"].corr(DataFrame["Col2"])

This would return a float value. To give some idea if you use same column in both places in above statement you would get a value of 1.0. So that means a column is completely dependent on itself. But both columns have to be numerical type. The columns that we need to consider in our dataset are “score” and “editors_choice”. So let’s convert editors_choice to numbers and save it in new column “editors_choice_int” :

data.loc[data["editors_choice"] == "Y","editors_choice_int"] = 1
data.loc[data["editors_choice"] == "N","editors_choice_int"] = 0

Now find correlation :

print("Correlation between score and editors_choice")
print(data["score"].corr(data["editors_choice_int"]))
print(data.loc[1:50,["score","editors_choice_int"]].corr())
print(data.corr())

You can use any one of the above ways to get correlation between score and editors_choice columns. Their respective outputs are shown below :

Screen Shot 2018-06-25 at 9.27.00 PM

As you can see from 3rd command result all the completely unrelated columns have very less correlation value (except for unnamed 0 & release_year) but score and editors_choice have value of 0.540672. So score dependents on editor’s choice to some extent.

Q3 : Get all games in 1996 by playstation with score > 7?

print(data.loc[(data["platform"].str.contains("PlayStation")) & (data["release_year"] == 1996) & (data["score"] > 7),["title","score","platform","genre","editors_choice","release_year"]])

output :

Screen Shot 2018-06-25 at 9.31.00 PM

Q4 : Find the change in game genre (Action/Fight) with year?

Now let’s understand how popular action games have been over the decades. Let’s get a subset of real data which has only action and fight genre games :

data2 = data.loc[(data["genre"].str.contains("Action", case=False, na=False)) | data["genre"].str.contains("Fight", case=False, na=False)]

Now let’s find action/fight game counts for each year :

df_year_to_genre = data2.groupby("release_year")["title"].agg("count").to_frame("count").reset_index()
df_year_to_genre.set_index("release_year").plot(kind="bar", title="Action games per year")

This can also be done using following statement but result won’t be sorted by release_year :

data2["release_year"].value_counts().plot(kind="bar", title="Action games per year")
plt.show()

output :

ActionGamesPerYear

As you can see action game genre has been on the decline since 2010, contrary to what we may believe.

Q5 : What genre gets best ratings (7-10)?

Suppose you have few ideas for developing games under different genres. Which one should you choose. You would obviously want to know what type of games, players like the most. Here’s how :

data_7_10 = data.loc[(data["score"] >= 7)]
dataTemp = data_7_10["genre"].value_counts()
dataTemp[:5].plot(kind="bar", title="Top 5 rated game genres")
plt.show()

It shows top 5 genres like this :

TopGenres

As you can see “Action” is the winner by huge margin.

Q6 : What ratings to Action games usually get?

This question has similar ring to it. It’s just another way to make sure your above decision to delve into action game developement.

print("\nNumber of Action games per rating score")
print(data.loc[(data["genre"].str.contains("Action", case=False, na=False)) | data["genre"].str.contains("Fight", case=False, na=False)].groupby("score_round")["genre"].agg("count"))

“data[“genre”]” returns a series and str is an attribute of series to access contains function.

output :

Number of Action games per rating score

score_round

1.0       12

2.0       94

3.0      148

4.0      514

5.0      509

6.0     1202

7.0     1115

8.0     2022

9.0      710

10.0     139

So most of the action games get ratings of 6-8.

Q6 : What platform do most developers choose?

data_highest_release_pf = data["platform"].value_counts()
data_highest_release_pf[:10].plot(kind="bar", title="Highest release platforms")
plt.show()

output :

TopPlatforms

Q7 : During what part of year, most games are released?

You went ahead with your idea for action game and developed it with your team. Now comes the time to release it. What is the most suitable time for game release? When do most companies release their games?

data_month_grp = data["release_month"].value_counts().to_frame("count").reset_index()

The above statement finds the count of number of games released in each of year for complete 20 years data and adds it to data frame as new column “count” and resets index of the frame. Why do we need to call reset_index? if we use only

data["release_month"].value_counts().to_frame("count")

it gives a data frame like this :

     count

11   2657

10   2310

9    1707

3    1573

12   1505

6    1483

8    1338

2    1327

4    1264

7    1190

5    1143

1    1128

It gives counts in new column with months of year as index. If we need to use months as a column we should have them in a separate column.

data_month_grp = data["release_month"].value_counts().to_frame("count").reset_index()

This does that. Now dataframe is :

     index  count

0      11   2657

1      10   2310

2       9   1707

3       3   1573

4      12   1505

5       6   1483

6       8   1338

7       2   1327

8       4   1264

9       7   1190

10      5   1143

11      1   1128
data_month_grp = data_month_grp.sort_values(by="index")
data_month_grp.index = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]

Now sort data frame by index column and change the inbuilt index to name of months so that on graph we can see month names below each bar to make it intuitive.

data_month_grp.plot(kind="bar", title="Month wise release count")

output :

MonthWiseRelease

So most of the games are released during year end; mostly during October and November. Is it coincidence or there is actually some reason behind it? We need more data to answer that.

Well then, whatcha waiting for? Find more data and get cracking!

Advertisements

Android : How to write/draw using finger

Sometimes typing gets tedious, or at least I feel that way. So few days back I made a small application to draw or write on Android device screen using my finger. And as it turned out it’s rather more fun than necessity. So here is how I did it.
This is the layout of the main activity. As you can see we will be using SurfaceView which provides a drawing surface.

Please notice the TextView that is below SurfaceView. This is used to clear the surface view once you have written something.

Next we initialize the above views and paint object that is used to draw in MainActivity’s onCreate method.

    SurfaceView surfaceView;
    SurfaceHolder holder;
    TextView tvClear;
    Paint trailPaint;
    Path path;
    float startX,startY,lastX,lastY;
    Bitmap mBitmap;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);

        requestWindowFeature(Window.FEATURE_NO_TITLE);
        getWindow().setFlags(WindowManager.LayoutParams.FLAG_FULLSCREEN,
                WindowManager.LayoutParams.FLAG_FULLSCREEN);

        setContentView(R.layout.activity_main);

        surfaceView = (SurfaceView) findViewById(R.id.surface_view);
        surfaceView.getHolder().addCallback(this);
        tvClear = (TextView) findViewById(R.id.tv_clear);

        tvClear.setOnClickListener(new View.OnClickListener() {
            @Override
            public void onClick(View v) {
                Canvas canvas = new Canvas(mBitmap);
                Paint blackPaint = new Paint();
                blackPaint.setColor(Color.BLACK);
                blackPaint.setStyle(Paint.Style.FILL);
                canvas.drawPaint(blackPaint);
                canvas = holder.lockCanvas();
                canvas.drawColor(Color.TRANSPARENT, PorterDuff.Mode.CLEAR);
                holder.unlockCanvasAndPost(canvas);
            }
        });

        trailPaint = new Paint();
        trailPaint.setAntiAlias(true);
        trailPaint.setDither(true);
        trailPaint.setColor(Color.GREEN);
        trailPaint.setStyle(Paint.Style.STROKE);
        trailPaint.setStrokeJoin(Paint.Join.ROUND);
        trailPaint.setStrokeCap(Paint.Cap.ROUND);
        trailPaint.setStrokeWidth(12);

        surfaceView.setOnTouchListener(this);
    }
Below are the surface view holder callback methods :
    @Override
    public void surfaceCreated(SurfaceHolder holder) {
        this.holder = holder;
        //this.holder.addCallback(this);
        showLog("surfCreate","holder made");
    }

    @Override
    public void surfaceChanged(SurfaceHolder holder, int format, int width, int height) {
        this.holder = holder;
        this.holder.addCallback(this);
        mBitmap = Bitmap.createBitmap(width, height, Bitmap.Config. ARGB_8888);
        showLog("surfChanged","holder made");
    }

    @Override
    public void surfaceDestroyed(SurfaceHolder holder) {

    }

Notice how we are re-initializing mBitmap object in surfaceChanged method. This method is called when SurfaceView is created first time or when it’s dimensions change like when screen orientation changes.
Below are the methods used in this program (explained later in this post) :
private void drawTrail(float x, float y){
    if (mBitmap == null) return;  // not ready yet
    Canvas canvas = new Canvas(mBitmap);
    canvas.drawPath(path, trailPaint);

    canvas = holder.lockCanvas();
    canvas.drawBitmap(mBitmap,0,0,null);
    holder.unlockCanvasAndPost(canvas);
}

private void startPath(float x, float y){
    startX= x;
    startY = y;
    lastX = x;
    lastY = y;
    path = new Path();
    path.moveTo(x,y);
}

private void endPath(){
    path = new Path();
}

private void showLog(String tag, String msg){
    Log.e(tag,msg);
}

But the core of the application is yet to come. It’s the onTouch method called for surface view.
@Override
public boolean onTouch(View v, MotionEvent event) {
    String TAG = "onTouch";
    int x = (int)event.getRawX();
    int y = (int)event.getRawY();
    //Log.e(TAG,"imgWH:"+imgView.getWidth()+","+imgView.getHeight()+",bitWH:"+bill.getWidth()+","+bill.getHeight());
    /*int x = (int)event.getX();
    int y = (int)event.getY();*/
    //Log.e(TAG,"x:"+event.getRawX()+",y:"+event.getRawY());
    switch (event.getAction()) {
        case MotionEvent.ACTION_DOWN:
            TAG = "ACTION_DOWN";
            showLog(TAG,TAG);
            startPath(x,y);
            break;
        case MotionEvent.ACTION_MOVE:
            TAG = "ACTION_MOVE";
            showLog(TAG,TAG);
            path.quadTo(lastX,lastY,(lastX+event.getX())/2,(lastY+event.getY())/2);
            //path.lineTo(event.getRawX(),event.getRawY());
            drawTrail(event.getRawX(),event.getRawY());
            lastX = event.getX();
            lastY = event.getY();
            break;
        case MotionEvent.ACTION_UP:
            TAG = "ACTION_UP";
            showLog(TAG,TAG);
            endPath();
            break;
    }
    return true;
}

How Does It Work?

It’s quite simple actually. When application starts and onCreate method is called we initialize surface view, paint and add callbacks for surface view holder. So once surface view is ready it’s “surfaceChanged” method is called and we initialize the bitmap variable mBitmap. Now the obvious question is why is bitmap used? What happens here is we draw on bitmap and then draw that bitmap on surface view. So why not draw directly on surface view? I tried it but it caused problem while redrawing after clearing the canvas. What happens in the line

     Canvas canvas = new Canvas(mBitmap);

is a Canvas object canvas is created with specified bitmap object. Canvas class holds the draw calls and Bitmap class holds the pixels. As per Canvas documentation :

"The Canvas class holds the "draw" calls. To draw something, you need 4 basic components: 
A Bitmap to hold the pixels, a Canvas to host the draw calls (writing into the bitmap), 
a drawing primitive (e.g. Rect, Path, text, Bitmap), 
and a paint (to describe the colors and styles for the drawing)."

Now let’s continue with our explanation. When you put your finger of the screen the onTouch method of the surface view is called with action “ACTION_DOWN”, so startPath method is called which stores the initial x, y coordinates of touch and adds them to “path” variable.

Next when move your finger onTouch is called with “ACTION_MOVE”. Here the new touch coordinates are added to Path “path” using quadTo method which creates smooth curves. Then drawTrail method is called where we draw the path to mBitmap, then we draw the mBitmap on surfaceView and post it.
When you have finally drawn and lift your finger endPath is called which clears the “path” variable.
If you want to clear the screen click the “Clear” view at bottom which fills black paint on mBitmap hence clears previous contents.
Here is how it looks!

i am awsome

Screenshot_20170325-191015

That’s all there is to it. Enjoy!

Brainf**k tamed! (Tutorial and Interpreter)

If you are in a hurry or don’t want to hear me yapping, you can skip straight to tutorial section below or my Brainfuck interpreter.

Prologue

Few weeks back I was talking to some of my friends and somehow the discussion took a turn towards some esoteric programming languages. At that moment I realized that I had never tried my hand at any of those. So I decided to pick one and went ahead with Brainf**k since I had heard about it before and the name itself seemed formidable. Who doesn’t want a good brain teaser for weekend fun! Thus began my journey to conquer arguably one of the most difficult programming languages. I was prepared for some serious mental bout with a stranger opponent. But to my surprise it didn’t take much time to learn the language! For which I don’t deserve any credits. Brainfuck has only 8 operators. But don’t get too excited yet. Writing programs in this language is a different ball game. To give you an idea, following is the Brainfuck program for printing “Hello World!” :

++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.

Anyways once I started writing small programs I needed a compiler. I found a few online. But for some programs they gave weird outputs or differed among themselves. Like for program to multiply 2 numbers. I spent an hour trying to figure out what’s wrong with my program only to realize it’s the compiler which had issues. I thought why not write my own compiler for a change. So in this blog I will be sharing a short (since there are many other fantastic tutorials out there) tutorial and the interpreter.

Brainf**k Tutorial

Brainf**k has only 8 operators namely :

< > [ ] . , + -

The language operates on cells starting at index 0. we can’t use any variables like other languages. Each cell holds a value that we can operate on. A pointer is maintained which holds current cell index… the cell on which program is operating currently. Current cell can be changed using operators “<“ (previous cell) or “>” (next cell). Value of current cell can be incremented/decremented using operators “+” and “-“ respectively. “.” operator returns the value which current cell holds. “,” accepts and stores an input to current cell. “[“ and “]” operators are while loop equivalent of Brainf**k. “[“ checks if current cell value is greater than 0 (in which case next operator of program should be read by compiler) or less than 0 (in this case compiler should go to corresponding “]” in program). “]” checks if current cell value is greater 0 (if true go to corresponding “[“) or less than 0 (read next operator in program). This would be more clear when discuss an example.

To summarise the functions of operators :

  1. "<" - decrement cell pointer i.e. all the next program operators would operate on value in cell preceding the current cell.
  2. ">" - increments cell pointer. Now pointer points to next cell.
  3. "+" - increments the current cell value.
  4. "-" - decrements the current cell value.
  5. "." - returns the value held at current cell.
  6. "," - accepts and stores an input value at current cell position.
  7. "[" - starts a loop if current cell value is greater that 0.
  8. "]" - if current cell value greater than 0 go to corresponding "[" and continue with loop.

Now let’s make things clearer with some examples. With any other language the first example would have been “Hello World!” program but I won’t start with it since it’s quite big, looks messy and doesn’t involve any loops. Let’s look at the program to add two numbers and store the result in third cell.

,>,<[>>+<<-]>[>+<-]>.

At the beginning of program all the cells hold value 0 with pointer pointing to 0 index cell as shown below.

[0] [0] [0] [0] .... 
 ^
  1. When compiler reads the first operator of program “,” it accepts an input from user and stores it at current cell i.e. 0 index. Let’s say we input 2. Then cells would now look like :
[2] [0] [0] [0] .... 
 ^

2. Next the compiler moves to next operator in program “>” and thus increments the current cell pointer (let’s call it ccp from now on for brevity).

[2] [0] [0] [0] .... 
     ^

3. Next operator is again “,”. So compiler reads and stored user input but this time ccp is 1. Thus it’s store in cell number 2. Let’s assume you entered 3.

[2] [3] [0] [0] .... 
     ^

4. Then “<” operator moves the ccp to first cell.

[2] [3] [0] [0] .... 
 ^

5. “[” operator checks if value of current cell is greater than 0. Since in this case it is (as current cell holds 2), compiler moves on to operator next to “[“. Cell diagram remains same.

6. Following 2 operators are “>”, hence the ccp is moved to 3rd cell. The reason being we would like store result in 3rd cell.

[2] [3] [0] [0] .... 
         ^

7. “+” operator increments the value at current cell which is 0 at 3rd cell. Now cells look something like this :

[2] [3] [1] [0] .... 
         ^

8. Next two operators are “<“. So ccp points to cell number 0 now.

[2] [3] [1] [0] .... 
 ^

9. The succeeding “-” operator decrements value of current cell by 1.

[1] [3] [1] [0] .... 
 ^

10. “]” then checks if value at current cell is greater than 0 (here 1). So compiler jumps to corresponding “[” operator.

Now steps from 5 to 10 are repeated only this time value to first cell is 1 instead of 2. So by the time we reach step 10 in next iteration value at first cell would be 0. So at step 10 instead of moving to corresponding “]” operator compiler moves to next operator in program. So essentially we have copied the value of 1st cell into 3rd cell. Now the cells look like this :

[0] [3] [2] [0] ....  
 ^

11. “>” operator next to “]” is processed since value at first cell is 0.

[0] [3] [2] [0] ....  
     ^

12. “[” checks if value at ccp (2nd cell) is greater than 0. Since value is 3 here compiler moves to next operator in program. Cell diagram remains same.

13. “>” operator moves pointer to 3rd cell.

 [0] [3] [2] [0] ....  
          ^

14. “+” increments value at 3rd cell.

[0] [3] [3] [0] ....  
         ^

15. “<” operator moves pointer to previous cell.

[0] [3] [3] [0] ....  
     ^

16. “-” decrements current cell value by 1.

[0] [2] [3] [0] ....  
     ^

17. “]” checks if current cell (2nd cell) value is greater than 0. So compiler processes corresponding “[” operator.

Steps 12 to 17 are repeated till 2nd cell value is 0. And thus value of 3rd cell is increment that many times. Now the value of 3rd cell represents the sum of 2 inputs.

18. By the end of step 17 ccp points at 2nd cell. To get result we need to point to 3rd cell. This is achieved by “>”

[0] [0] [5] [0] ....  
         ^

19. Now all we have to do is fetch the result in 3rd cell. “.” comes to our rescue which returns value at current cell.

Phew!!! right? This is one of the simplest programs. How would you multiply two numbers? You would say “multiplication of a by b is simple addition of b, a times. So I can use above concept and do addition“. Correct. But on taking a closer look you would realize that after first iteration i.e. after adding b once using loop shown above you would have consumed it’s value and cell value would be 0 at the end of loop. How would you persist the value through iterations? I would suggest you try it yourself.

Here is my implementation of multiplication in Brainfuck :

,>,<[>[>+>+<<-]>>[<<+>>-]<<<-]>>.

This program has 2 nested loops. Cell 1 holds 1st number and Cell 2 holds 2nd number. Outer loop for 1st number and inner loop for 2nd number. But the trick is for 2nd number in each iteration when I decrement it’s value, I increment value at another cell. In other words in each iteration I decrement and copy the value of 2nd number. Once value of cell 2 becomes 0 and it comes out of loop I copy the value back to 2nd cell using this section “[<<+>>-]”. Thus preserving it’s value.

Although you won’t find any practical applications for this language writing programs or compiler for it would give you a good understanding of how compilers work and how difficult the things are that we take for granted in our daily programming. You can write other programs and try out my interpreter for checking outputs. You can find the source code in my github repo here. Please feel free to reach out to me if you hit a roadblock or if you have any suggestions.

Adios till next time.