Bruno Barreto

Data Scientist ยท bruno@barreto.us

I am a data scientist with a background in bioinformatics and a passion for designing large-scale, future-proof solutions to modern problems. My methodical and measured approach to solving novel problems allows me to create intuitive and polished machine learning pipelines that effectively clean real-world datasets, visualize underlying trends, and optimize finely-tuned models with an eye towards future considerations and model improvements.


Portfolio

Accident Severity Predictor



The goal of this project was to build a model that can predict the severity of a flight accident from a formal report detailing the events that led up to it. The model would then be used to determine the risk factors that lead to lethal accidents, so that future regulations and aviation guidelines could be directed towrds addressing them

To accomplish this, a dataset of 30,000 U.S. flight accident reports and associated metadata was assembled from the National Transportation Safety Board. This data was cleaned of missing or incomplete entries and underwent natural language processing to extract key feature information and remove self-referrential terms that would otherwise distort the model's results.

The highest performing predictive model had its feature weights analysed to determine the terms that best predicted accident severity. From this, we were able to determine that the improper installation and maintenance of key components such as the airframe or engine transmission were disproportionally important causes of high-severity flight accidents.

BTS Aviation Delay Model



The goal of this project was to work with the Bureau of Transportation Statistics to design a model of commercial flight delays in the U.S. that could be used to identify the most important data points for delay prediction and direct the Bureau's future data collection efforts towards the highest-priority information.

Data on 6,000,000 commercial flights collected by the BTS across 2022 was obtained and used to train multiple predictive classifer models. All data was exhausively cleaned of missing or irrelevant information. Final model performance highlighted the uncomfortable reality that the BTS' flight delay dataset has no meaningful connection with flight delays and connot be used to predict them. A cursory review of the scientific liturature was coupled with model results to direct the BTS on what data should be added to improve the predictive power of its information.

Subreddit Community NLP Classifier



The goal for this project was to design a model using natural language processing techniques that could determine whether a given post originated from the r/playstation or r/xbox subreddits and identify the differences in discussion topics that make these two communities distinct from one another.

Utilizing the Pushshift API, 10,000 subreddit posts from both r/playstation and r/xbox were collected. These posts were then cleaned of missing or self-referrential data that could allow the model to circumvent its task and subsequently vectorized for model training.

A logistic regression model with 85% accuracy was chosen as the highest performing model and analyzed to determine the topics within each community. As the lower model accuracy shows, the two communities are very similar overall and primarily differ in the platform-specific features prioritize discussing. The r/playstation subreddit focuses on playstation-exclusive title and network features while r/xbox discusses xbox-exclusive titles and GamePass offers.

Subreddit Community NLP Classifier



The goal for this project was to design a model using natural language processing techniques that could determine whether a given post originated from the r/playstation or r/xbox subreddits and identify the differences in discussion topics that make these two communities distinct from one another.

Utilizing the Pushshift API, 10,000 subreddit posts from both r/playstation and r/xbox were collected. These posts were then cleaned of missing or self-referrential data that could allow the model to circumvent its task and subsequently vectorized for model training.

A logistic regression model with 85% accuracy was chosen as the highest performing model and analyzed to determine the topics within each community. As the lower model accuracy shows, the two communities are very similar overall and primarily differ in the platform-specific features prioritize discussing. The r/playstation subreddit focuses on playstation-exclusive title and network features while r/xbox discusses xbox-exclusive titles and GamePass offers.

Attention-Based Movie Review Classifier



The goal of this project was to design a concise attention-based natural language model that could predict the rating given by movie reviews taken from IMDb using only the text of the movie review.

Eclipse Frequency Analysis



The goal of this project was to determine if there was any relationship between the location of a country on Earth and the frequency of eclipses that occur there. In addition, this project also aimed to determine if any patterns existed in the incidence of different types of eclipses within a given region.

Wikipedia User Analysis by Platform



The goal of this project was to determine if there are any quantifiable differences in pageview trends for wikipedia users depending on which platform (desktop or mobile) they view the use. In particular, this project aimed to determine if the peak or average pageview counts significantly differed between platforms.


Skills

Languages & Tools

- Python
- Jupyter Notebook
- Git
- GitHub
- Streamlit
- SQL
- Tensorflow
- PyTorch
- Azure
- PySpark
- Java

Data Analysis & Modeling

- Data collection
- Exploratory data analysis
- Feature engineering
- Regression modelling
- Natural language processing
- Classification modelling
- Deep learning
- Computer vision
- Clustering models
- Generative modelling


Education

University of Wshington

Masters of Science in Data Science
September 2023 - April 2025

General Assembly

Data Science Immersive
October 2022 - February 2023

University of Washington

Bachelor of Science in Bioengineering with Data Science
September 2018 - June 2022

Interests

As a data scientist, I enjoy reading up on the most cutting edge developments in the world of machine learning and seeing how they can be applied to my work. In my free time, I'm currently working on improving a Stable Diffusion image generator model for use in audio and protein modelling.

Outside of my work in data science, I can often be found enjoying my ever-growing collection of books and games, hiking across the trails of the midwest, or searching for yet another restaurant with a knack for preparing friend rice. Despite all these options, there is a good chance you'll instead find me playing Minecraft for the 10,000th time.