Writing a great story for data science projects - Fall 2025

This is a Report Template Quarto

Author

Student Name (Advisor: Dr. Cohen)

Published

August 26, 2025

Slides: slides.html ( Go to slides.qmd to edit)

Important

Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.

Carefully read this template since it has instructions and tips to writing!

Nice report!

Introduction

The introduction should:

Develop a storyline that captures attention and maintains interest.
Your audience is your peers
Clearly state the problem or question you’re addressing.

Introduce why it is relevant needs.
Provide an overview of your approach.

Example of writing including citing references:

This is an introduction to ….. regression, which is a non-parametric estimator that estimates the conditional expectation of two variables which is random. The goal of a kernel regression is to discover the non-linear relationship between two random variables. To discover the non-linear relationship, kernel estimator or kernel smoothing is the main method to estimate the curve for non-parametric statistics. In kernel estimator, weight function is known as kernel function (Efromovich 2008). Cite this paper (Bro and Smilde 2014). The GEE (M. Wang 2014). The PCA (Daffertshofer et al. 2004). Topology can be used in machine learning (Adams and Moy 2021)

For Symbolic Regression (Y. Wang, Wagner, and Rondinelli 2019) This is my work and I want to add more work…

Cite new paper (Su, Yan, and Tsai 2012)

Methods

Detail the models or algorithms used.
Justify your choices based on the problem and data.

The common non-parametric regression model is \(Y_i = m(X_i) + \varepsilon_i\), where \(Y_i\) can be defined as the sum of the regression function value \(m(x)\) for \(X_i\). Here \(m(x)\) is unknown and \(\varepsilon_i\) some errors. With the help of this definition, we can create the estimation for local averaging i.e. \(m(x)\) can be estimated with the product of \(Y_i\) average and \(X_i\) is near to \(x\). In other words, this means that we are discovering the line through the data points with the help of surrounding data points. The estimation formula is printed below (R Core Team 2019):

\[ M_n(x) = \sum_{i=1}^{n} W_n (X_i) Y_i \tag{1} \]\(W_n(x)\) is the sum of weights that belongs to all real numbers. Weights are positive numbers and small if \(X_i\) is far from \(x\).

Another equation:

\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \]

Analysis and Results

Data Exploration and Visualization

Describe your data sources and collection process.
Present initial findings and insights through visualizations.
Highlight unexpected patterns or anomalies.

A study was conducted to determine how…

Code

# loading packages 
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)

Code

import pandas as pd

Code

# Load Data
kable(head(murders))

state	abb	region	population	total
Alabama	AL	South	4779736	135
Alaska	AK	West	710231	19
Arizona	AZ	West	6392017	232
Arkansas	AR	South	2915918	93
California	CA	West	37253956	1257
Colorado	CO	West	5029196	65

Code

ggplot1 = murders %>% ggplot(mapping = aes(x=population/10^6, y=total)) 

  ggplot1 + geom_point(aes(col=region), size = 4) +
  geom_text_repel(aes(label=abb)) +
  scale_x_log10() +
  scale_y_log10() +
  geom_smooth(formula = "y~x", method=lm,se = F)+
  xlab("Populations in millions (log10 scale)") + 
  ylab("Total number of murders (log10 scale)") +
  ggtitle("US Gun Murders in 2010") +
  scale_color_discrete(name = "Region")+
      theme_bw()

Modeling and Results

Explain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Use visuals to support your claims.
Tell a story about what the data reveals.

Conclusion

Summarize your key findings.
Discuss the implications of your results.

References

Adams, Henry, and Michael Moy. 2021. “Topology Applied to Machine Learning: From Global to Local.” Frontiers in Artificial Intelligence 4: 668302.

Bro, Rasmus, and Age K Smilde. 2014. “Principal Component Analysis.” Analytical Methods 6 (9): 2812–31.

Daffertshofer, Andreas, Claudine JC Lamoth, Onno G Meijer, and Peter J Beek. 2004. “PCA in Studying Coordination and Variability: A Tutorial.” Clinical Biomechanics 19 (4): 415–28.

Efromovich, S. 2008. Nonparametric Curve Estimation: Methods, Theory, and Applications. Springer Series in Statistics. Springer New York. https://books.google.com/books?id=mdoLBwAAQBAJ.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.

Su, Xiaogang, Xin Yan, and Chih-Ling Tsai. 2012. “Linear Regression.” Wiley Interdisciplinary Reviews: Computational Statistics 4 (3): 275–94.

Wang, Ming. 2014. “Generalized Estimating Equations in Longitudinal Data Analysis: A Review and Recent Developments.” Advances in Statistics 2014.

Wang, Yiqun, Nicholas Wagner, and James M Rondinelli. 2019. “Symbolic Regression in Materials Science.” MRS Communications 9 (3): 793–805.