Data pipelines can improve efficiency and consistency while reducing cognitive load for analysis. After identifying commonly pulled and processed data, I was able to create a widely used dataset through a data pipeline. I was then able to automate this process. The result was an easy to access dataset which was automatically updated daily. Prior use of similar datasets were only for the most important products as it was difficult and slow for research scientists to pull this data and join it themselves.
A lot of data processing and transformations being done for analysis tend to be duplicate work and extremely time consuming for research scientists. Working with researchers, I was able to identify the most redundant processes taking time. Iterating on what they needed, I built a data pipeline process which created that data. I then automated this process to run on a schedule which they found ideal.
The end result was a new dataset available each day which research scientists could access as they wish. The ability for research scientists to access a ready to use dataset without the need to re-build the dataset themselves through SQL queries drastically improved efficiency and accuracy of work. Additionally, the improvement to the research scientist's workflow seemed to create an environment more conducive of new ideas. Instead of thinking about limitations of the database and how tables were joined, they were able to think about new ideas and products.
This project transformed the way research scientists use Snapshot Wisconsin data and opened the door for further conversations about how data engineering and data science practices can and should be implemented into their data product workflows.