Basic Statistical Analysis of CGPA Data of VIT Students using Python

Couldn't post much over the past 2 months or so, been preoccupied with internship work.
Just as I finished my final presentation earlier in the morning, I stumbled on a list of over 2000 VIT students, along with their registration numbers and CGPAs publicly available on a web portal that the placement office uses.
The statistics obsessed nerd in me now had something to spend the rest of the day on before logging off for the day, data like this had been hard to get a hold of. Here's what I did and learnt....

Note that this data is not a randomly selected subset of all students. It is the set of students that have voluntarily applied to sit for the selection process of Amazon which is very selective, consequently the average CGPA of this set must be slightly higher than that of the larger set of all students. Further there are students from all branches in this set including even many mechanical engineering students , these must be students inclined towards coding and are thus not a proper representative of their branch. Take all results of this analysis with a grain of salt.

The following has been written in Python3.x using Jupyter Notebooks (formerly known as iPython Notebook). If you are not familiar with it, do check it out as it's pretty handy for working on little things like this. However it's not required necessarily, if you have python on your system.

# In[1]:

from bs4 import BeautifulSoup as bs

import re

from numpy import percentile, mean

from scipy.stats import skew

import seaborn as sns

import matplotlib.pyplot as plt

sns.set(color_codes=True)

print("imported libs")

imported libs

Before this make sure you have the above packages installed, just type in "sudo pip3 install <package name>" into your terminal to do so.

# In[2]:

"Getting The Data"

with open("/home/rohan/Desktop/ci.html", 'rb') as f:

    print("opened file")

    soup=bs(f.read(),'html')

    #table=soup.find('table')

    headers=["name","branch",'regno','cgpa','campus']

    #tbody=table.find('tbody')

    #rows=tbody.find_all('tr')

    rows=soup.find_all('tr')

    data=[]

    for row in rows:

        eles=row.find_all('td')

        if len(eles)<5:

            continue

        dic={a:b for a,b in zip(headers, [ele.text.strip('\r\n') for ele in eles])}

        dic['cgpa']=float(dic['cgpa'])

        while dic['cgpa']>10:

            dic['cgpa']/=10

        for a in ["name","branch",'regno','campus']:

            dic[a]=re.sub(r'\W+', ' ', dic[a]).strip()

        dic['regno']=dic['regno'].upper()

        data.append(dic)

    print("parsed "+str(len(data))+" entries")

opened file parsed 2654 entries

Before this I just saved the webpage as ci.html and then opened it and removed everything other than the HTML table to make things simple, then in the python code I open that file and parse it using beautiful soup, a really convenient library for any web mining projects you may want to do.

# In[11]:

"Making Some Simple Calculations using numpy"

IGNORE_MASTERS=True

IGNORE_NON_MAIN_CAMPUS_IN_BRANCHWISE=True

branchWise={}

campusWise={}

for i in data:

    if IGNORE_MASTERS and i['regno'][2].upper()=='M':

        continue

    if i['campus'] not in campusWise.keys():

        campusWise[i['campus']]={'data': []}

    campusWise[i['campus']]['data'].append(i)

    if IGNORE_NON_MAIN_CAMPUS_IN_BRANCHWISE and i['campus'].lower()[0]!='v':

        continue

    if i['regno'][2:5] not in branchWise.keys():

        branchWise[i['regno'][2:5]]={'data': []}

    branchWise[i['regno'][2:5]]['data'].append(i)

def printDetails(dataDict): 

    print("-"*69)

    print("type\tn\tAvg\t1stQtl\tMedian\t3rdQtl\tSkew\tavg-median")

    print("-"*69)

    for i in sorted(dataDict.keys()): 

        thisData=[j['cgpa'] for j in dataDict[i]['data']]

        dataDict[i]['quartiles']=percentile(thisData, [25,50,75])

        dataDict[i]['avg']=mean(thisData)

        dataDict[i]['skewness']=skew(thisData)

        print(str(i), end="\t")

        print(len(dataDict[i]['data']), end="\t")

        print(round(dataDict[i]['avg'],3), end="\t")     

        print(round(dataDict[i]['quartiles'][0],3), end="\t")  

        print(round(dataDict[i]['quartiles'][1],3), end="\t") 

        print(round(dataDict[i]['quartiles'][2],3), end="\t") 

        print(round(dataDict[i]['skewness'],3), end="\t") 

        print(round(dataDict[i]['avg']-dataDict[i]['quartiles'][1],3), end="\n")

    print("\n\n")

printDetails(branchWise)

printDetails(campusWise)

printDetails({'Total':{'data': data}})

Here I have printed out statistical measures for the value of CGPA of the students, viz. number of entries, arithmetic mean, First Quartile, Median, Third Quartile, Skewness, Mean-Median; branch wise then campus wise and finally for all the data.
We can observe that across all branches the mean CGPA stays very close to the same value of 8.50, with major deviation occurring only in the case of branches with very few students in the dataset.
This shows that there is some consistency across the schools in terms of grading, thanks to the relative grading system. (However according to the system the mean should lie around 8.0 not 8.5, which is interesting)
The 1st and 3rd quartile seem to be equidistant from the mean (about 0.35) and the skewness is very close to 0 but a little to the negative side, as expected but not as much as expected.

Now, In the following few steps I generate a histogram of the CGPA of all B.Tech Students at VIT - Main Campus overlaid by a Kernel Density Estimate (wiki: Kernel_density_estimation) Then I repeat the same for a few Specific branches of engineering. If you aren't familiar with the nomenclature it is as follows:

BCE= Bachelors in Technology: Computer Science and Engineering
BCI= Bachelors in Technology: Computer Science and Engineering with Specialisation in Information Security (gimmick)
BIT= Bachelors in Technology: Information Technology
BME= Bachelors in Technology:Mechanical Engineering
etc.

# In[4]:

sns.distplot([i['cgpa'] for i in data if i['regno'][2]=='B'],  label='All')

plt.legend()

Fig.1

sns.distplot([i['cgpa'] for i in data if i['regno'][2:5]=='BCI'], bins=20, label='BCI')

plt.legend()

Fig. 2

sns.distplot([i['cgpa'] for i in data if i['regno'][2:5]=='BCE'], bins=50, label='BCE')

plt.legend()

Fig. 3

sns.distplot([i['cgpa'] for i in data if i['regno'][2:5]=='BIT'], bins=50, label='BIT')

plt.legend()

Fig. 4

Next I plot the Histogram and Kernel Density Estimate for students of each of the major B.Tech branches of our dataset against a KDE for the entire dataset of all B.Tech students.

# In[8]:

for branch in branchWise.keys():

    if len(branchWise[branch]['data'])<100:

        continue

    plt.figure()

    sns.kdeplot([i['cgpa'] for i in data], label="All")

    sns.distplot([i['cgpa'] for i in branchWise[branch]['data']], bins=30, label=branch)

    plt.legend()

Next I plot the KDE for the CGPA of Vellore Campus B.Tech Students against that of their Chennai campus counterparts.

There are some interesting observations to be made from these graphs...

As we observed earlier in the data table, all branches have similar means and variance, this consistency among the various schools yeilds a very clean histogram of the CGPA of all the B.Tech students, it is not a multi-modal distribution as one may speculate. (see Fig 1)
Contrary to the previous observation, the plot for BCI (CS with InfoSec) has a very clear Bi-Modal distribution, which is very odd, the same is observed to a much lesser degree for BCE (CS) students. (See Fig. 2,3) Some hypotheses I can come up with are (these are not substantiated by anything and I don't intend on changing that in this article, just take it with the pinch of salt) :

although grading is relative in each class at VIT (in order for consistency to exist), the peculiar actions professors can significantly alter the distribution of their class as compared to other classes, this can play a role in altering the distribution of the BCI students since they have only a few BCI specific professors as there are only a few BCI batches, this varied behavior can be: a tendency to give marks such that the absolute marks criteria for a perfect S grade is met by less or more people than in other classes and other things, but this alone means nothing, the anomaly could entirely be a fluke as the dataset had only 106 BCI students.

The graph for the Chennai campus is even sharper than that of the Vellore branch, this may just be because of the lesser number of students and less individual school's histograms superimposing to form a sort of plateau.
The average CGPA is much higher than expected from looking at the grading system where a perfectly mediocre student should ideally get a 8.0 CGPA. Which puts me and a lot of my pretty competent friends in the Average/Below Average category, which is either a rather humbling fact or a disillusioning and unsettling one, depending on how much or how less you believe in the system and in yourself, I'll let you decide. For me, it's just a mildly interesting observation and thats all I was looking for.

Hope you enjoyed the read, do drop feedback/queries in the comments.

Search This Blog

The Heuristic

Basic Statistical Analysis of CGPA Data of VIT Students using Python

There are some interesting observations to be made from these graphs...

Comments

Post a Comment

Popular posts from this blog

Exporting Google PlayMusic playlists to Spotify using Python libraries for their web APIs