Machine Learning Developer, Data Scientist, Data Engineer, Data Analyst
I develop End-to-End Machine Learning solutions, perform statistical analysis and econometric modelling. I have worked on many projects in various domains such as healthcare, fintech, telco, IoT and more. I have engaged in every step of the Machine Learning Lifecycle from Business Understanding and Customer needs at the start to product Deployment or Visualization at the finish. As an Applied Mathematics graduate with a Master’s degree, I have a strong background in statistics, optimization, quantitative finance, econometrics and computer science.
Please look at my CV here or github profile for examples of my work or contact me for more details on my experience.
Summary of projects
Here is a brief summary of all projects I have worked on. The results vary from dockerized applications to presentations of results. They are ordered from the most recent. In addition, there are several smaller projects on my github repository or in the blog section of this page that are not listed here.
Detection of cut document
Specific personal identification documents are invalidated by cutting corners. We trained CNN segmentation model to localize document on image and employed computer vision techniques to validate all corners of the document. We had to account for various quality of photos and any kind of perspective rotation. We implemented a python microservice using gRPC
and dockerized it.
Air-quality and road meteorology IoT sensors
I worked on numerous data analyses and hypothesis tests of sensor accuracy. I designed a process to calibrate sensors of air quality using regression models and evaluated metrics in several environments (stats-models
). I implemented a method to spatially interpolate road temperature from several static stations (geopandas
,pykrige
,rasterio
,…). I created and managed TimescaleDB database to store measurements and results and visualized them in Grafana deployed on an in-house cloud (docker
,jenkins
,DCOS
).
Spatial estimation of traffic density
Some roads are manually monitored every 5 years. To estimate the rest I extracted publicly available map data features (population density, proximity to buildings, schools, factories, …). I used regression kriging to spatially estimate traffic density for every unaccounted road. (pykrige
, xgboost
, geopandas
, geospark
, scipy.spatial
for spatial indexing).
Uplift modelling - TELCO
Created a demo for estimating uplift - effect of giving an offer to a specific customer. In the end, we can choose the offer with the highest uplift for a customer. To estimate uplift we used causal modelling techniques - propensity score matching/weighting and causal trees.
Cancer incidence rates prediction
National Cancer Registry periodically publishes reports on cancer statistics. Cancer incidence is one of the important statistics describing the number of new cases and is often used to describe historical trends. Predicting incidence rates for future years can help plan countermeasures for a specific population. We used the Age-Period-Cohort model framework with dropping insignificant factors to predict long-term incidence rates. Since cancer incidence is several years behind, we used external data for better short-term prediction.
Spoof detection
Spoof detection in face biometrics is the ability to detect if a face is real or fake (face printed on paper, displayed on screen). It was required to detect on static photo. Using OpenCV
and dlib
I extracted normalized face images without distorting texture or artifacts, that are present when the photo is fake. I trained 3 different computer vision models for spoof detection. First uses SVM to classify fakes based on texture, reflection, color and sharpness features. Second model is CNN classifier with random patch region extraction to prevent overfitting, due to limited training data. Last model was a fully convolutional network, that is trained to estimate 3D face depth. Both convolutional models performed reasonable well considering the input is just static photo.
Classify fraud benefit claims
Due to the sensitive topic and data, I provide only a basic description. We trained xgboost
classifier on highly categorical big data. The categories have a hierarchical structure, which is important to account for similarities between different categories. Pyspark
was used for feature engineering.
Banking Process Mining
Each application for a bank product undergoes an approval process. We used pyspark
to extract features for each stage of the process and trained LSTM neural network (tensorflow
) to predict the next steps of the process.
Predict the demand for mortgages
Mortgage applications create load on the bank’s back-office - risk approval, real estate check, … Predicting demand for mortgages or the number of submitted applications several weeks ahead would help with back-office resource management and planning.
Customer Relation Management data in TELCO
We looked for valuable insights from the Customer Relation Management (CRM) data of a telecommunication company. Using nltk
and NLP techniques (tokenization, lemmatization and word embeddings), we extracted features from a free-text commentary of customer interaction with the company’s employees. Using TSNE dimensionality reduction, we visualized customers in clusters.
Effectivity of healthcare providers based on economic data
We calculated the economic effectivity of healthcare using Data Envelopment Analysis (DEA). DEA models estimate an effectivity frontier by optimizing input and output matrices for each subject in analysis. DEA was done using numpy
and scipy.optimize
, data manipulation with pyspark
.
Bill of Materials (BOM) checking
Learning patterns from correct examples to find mistakes in new designs. Read more
Geospatial analytics of Industry machinery and vehicles
Company developing IoT devices to collect data from Industry machines including GPS coordinates looks to expand the analytics of the data. Clustering the location points to classify if the vehicle is working or detect construction sites and more. Read more