Projects
1. West Nile Virus incidence rate prediction across US
- The main aim of this project was to predict the West Nile Virus (WNV) incedence rate across US for every county. This prediction would help the city authorities to spray for mosquitoes at right time thus killing the larvae of the disease causing mosquitoes
- Based on our research, there were several influencing factors like population, socio-economic featires, mosquito trap data, and weather that we found out to affect the WNV incedence rates (IR)
- The data was collected from variety of sources like NARR, Census board, NASA (for weather) and so on. All the large scale-multi dimensional datasets were cleaned and processed using Python scripting and OpenRefine
- The missing data points in Socio-economic features were imputed using Vector Auto-Regression method
- Exploratory data analysis was conducted to check the effect of Entropy, Cummulative degree days, temporal lag and neighouring county cases on the incedence rate of current county and also to understand outliers
- The packages used for visualization are matplotlib, bokeh, seaborn, plotly and ipywidgets
- Several Statistical models were built including Random Forest, Zero-Inflated Poisson Regression, Long Short Term Memory networks and Seasonal ARIMA to forecast the WNV IR and the LSTM model gave us an accuracy of more than 90%
2. Predicting mortality by Acute Lower Respiratory Diseases in Americas
- The main aim of this project was to understand the patterns of mortality caused by Acute Lower Respiratory Diseases and to predict the mortality rate for the year 2020 in Americas
- The primary data was provided to us by WHO which included the mortality data for all diseases. This data was cleaned and processed using Python scripting to include only Acute Lower Respiratory Diseases
- Several other indicators like population, number of physicians per 1000, and GDP data were integrated after downloading from data.world and processing it with python
- Created Auto-ARIMA model to check the time-series validity of the indicators
- Built Vector Auto-Regression model to perform Multivariate Time Series forecasting and predicted the death count for countries in the American continent with a Mean Absolute Percentage Error of 0.0822
—
Forecast example for Canada:
—
3. Oscar predictions
- The main aim of this project was to predict which movie will win an Oscar in a particular year based on other awards it has won
- The data from online movie databases like iMDb, rotten tomato, and movielens was downloaded and processed. Data for missing points was webscraped and cleaned for modelling using python scripts
- Exploratory analysis was conducted using python and juputer notebooks to identify trends and outliers, detect anomalies and understand the relationship between other awards, critic and user ratings with the movie winning the Oscar
- Implemented Logistic Regression, Naive Bayes, Decision Trees and Random Forest algorithms using sklearn package to compare different models and achieved an accuracy of 90% in resulting predictions
—