B.
Back to projects

Football Analytics Engine

ERD showing connections between Player, Team and Manager entities

Overview

Built a Java-based data ingestion pipeline to ingest CSV-based football datasets into a normalised MySQL-based relational database. The project converted flat CSV rows containing repeated player, team and manager information into a Boyce-Codd Normal Form (a stricter form of 3NF) database that could be queried, validated and protected with referential integrity constraints. Entity Relationship Diagram above shows my schema.

What it does

  • Creates a MySQL database from a custom designed DDL schema.
  • Ingests CSV flatfiles and splits flat rows into player data, team data and manager data using a custom ingestion pipeline using Data Manipulation Language (DML).
  • Avoids attempting to insert duplicate team/manager inserts using HashSet-based deduplication logic.
  • Preserves relationships using Primary and Foreign keys.
  • Runs requested complex SQL queries.

Why it’s interesting

This project was an end-to-end analysis project that involved working with multiple flat,raw data files and ingesting them into MySQL database using JDBC and Java and then running verification and analytical queries. In this project I learned about schema design and the different normalisation levels and where they are most appropriate.

Key Technical Points

  • Multi-file ingestion layer: Read and parsed multiple CSV files, mapped columns into structured entities and populated MySQL database using Java & JDBC.
  • Consistent & Efficient storage: Highly normalised data to BCNF/3NF to reduce redundancy and hence reduce space requirements. Normalisation also helped me remove insertion and deletion errors.
  • Query validation: Used SQL deletion tests, aggregate queries, GROUP BY/HAVING checks and relational algebra reasoning to verify that the schema enforced referential integrity and produced sensible football summaries.
  • Integrity-first loading: Built the ingestion pipeline around the schema’s foreign key constraints by using HashSet-based deduplication to ensure shared teams and managers were always inserted before a dependent player’s record was added.

Tech Stack

Languages: Java, SQL
Backend: JDBC, MySQL