Predicting flood events
in Louisiana

First Springboard Capstone Project - June 2020

Purpose and Goal

This was my first capstone project for the Springboard Data Science bootcamp. The goal was to build a model that is capable of taking in forecast data and return the likelihood of a flood event occurring in Louisiana. The major phases were data cleaning, exploratory analysis, and training the model.

This page contains a high level overview, but a more detailed report can be found here.

Stack

Python (pandas, NumPy, pandarallel, seaborn, matplotlib, BeautifulSoup, requests, SciPy, scikit-learn), SQLite


Data Sources

NWS provided storm data containing statistics on personal injuries and damage estimates from 1950 to present. There were 34 different storm events including various types of floods, hurricanes, thunderstorms, hail, etc. There were 51 columns including damage, injuries, deaths, etc. I used Python scripts to download all 213 csv files, create a database, and ingest the data into the database.

Additional supporting data included historical meteorological data to analyze any potential correlations. NOAA's National Centers for Environmental Information had Daily Summaries data at numerous stations across the United States. This data included air temperature, precipitation, and wind speed.

A map of weather stations in Louisiana

Exploratory Data Analysis

There were 668 flash floods, 132 floods, and 11 coastal floods in the dataset. NWS provided documentation defining the storm events recorded in the database.

  • flash flood: life-threatening, rapid rise of water into a normally dry area beginning within minutes to multiple hours of the causative event (e.g., intense rainfall, dam failure, ice jam)
  • flood: any high flow, overflow, or inundation by water which causes damage
  • coastal flood: flooding of coastal areas due to vertical rise above normal water level caused by strong, persistent onshore wind, high astronomical tide, and/or low atmospheric pressure
The first feature I wanted to check out was precipitation, which I expected to be the most important feature for the model.

A box plot showing the distributions of precipitation by flood type

Wind speed was another anticipated feature of interest because these events are caused by coastal processes such as waves, tides, and storm surges, which are strongly influenced by wind.

A box plot showing the distributions of wind speed by flood type

The following image is the correlation matrix. The grey area indicates relationships that are with themselves (e.g., `avg_wind_spd` vs. `avg_wind_spd`, etc.), relationships that are nonsensical (e.g., region vs. season, etc.), or repeating values (e.g., `prcp` vs. `avg_wind_spd` shown only once). Any relationships that were not statistically significant were also greyed out.

A correlation matrix

Model

Two classification algorithms were used: logistic regression and random forest. Both algorithms were beneficial because they are both generally easy to interpret, allowing for easier stakeholder communication. Logistic regression was selected because it is the go-to method for binary classification problems. Random forest was selected because it works well with fitting categorical features and high dimensional data. Random forest performed the best with an F-2 score of 0.839 compared to logistic regression’s F-2 score of 0.836. Hyperparameter tuning was done with cross-validated grid search.

precision recall f1-score support
no flood 0.98 0.95 0.96 1247
flood 0.74 0.87 0.80 203
accuracy 0.94 1450
macro avg 0.86 0.91 0.88 1450
weighted avg 0.94 0.94 0.94 1450

Issues and Potential Improvements

Data cleaning and joining the two data sets together were the trickiest parts of this project.

  • The storm events data used parish to indicate location of the event, but several data points were not exact parish names and had to be cleaned to be one of the distinct 64 parishes in the state
  • The meteorological data did not include parishes for each station, which was how I joined it with the storm events data. Had to use an API to query for each parish based on coordinates of each station. Then the station data was aggregated to represent the parish in its entirety by taking the mean of all data points for stations within each station for each date
  • The data was sparse; when the storm events and meteorological data were joined, there were only 2,202 flood events, of which only 812 data points had complete meteorological data. A better data set would improve the model's capability to predict flood events

Further data exploration could be done in order to find other data sources or potentially feature engineering to strengthen the model. Predictive weather models are exceptionally complex, and there is likely much more that can be done to develop a stronger model with a higher F-2 score. Additionally, a web application could be developed to allow access for use by stakeholders.

Let's work together!

Feel free to reach out if you think I'd be a good fit for an open role, have a question, or just want to connect. :)