Simple Linear Regression Using Ruby

In the following post I am going to walk you through the basics of linear regression, and show you how you can perform simple linear regression using Ruby.

While Ruby is not commonly recognized as a tool for statistical analysis, there are times at Sharethrough when we need to perform basic statistical modeling in our web application, which is written using Ruby on Rails. In addition, Ruby’s elegant syntax makes computational regression very approachable.

Lets begin with a little mathematical review. Simple Linear Regression is a mathematical technique used to model the relationship between an dependent variable (y) and an independent variable (x). Since we are attempting to find a linear relationship between a dependent variable and a single independent variable, the basic equation is something that everyone should be familiar with.

$$ y = \beta_{0} + \beta_{1} x $$

Linear regression is finding the best values for $\beta_0$ and $\beta_1$. Finding these values will take up the remainder of this post.

The best values for $\beta_0$ and $\beta_1$ will minimize the error between our line and the dataset. Unless you have a perfectly linear dataset, which almost never occurs in the real world, you will never find perfect values for $\beta_0$ and $\beta_1$. Therefore, we’ll estimate the best possible values for $\beta_0$ and $\beta_1$. These estimates will be denoted $\hat{\beta_0}$ and $\hat{\beta_1}$.

We can now define the regression equation as:

$$ \hat{r}(x) = \hat\beta_{0} + \hat\beta_{1}x $$

The error between our model and the data is calculated using residual sums of squares which is defined as:

$$ \sum_{i=0}^{n}\hat{\varepsilon}_i^{2} = \sum_{i=0}^{n}(y_i - (\hat\beta_0 + \hat\beta_1 x_i))^{2} $$

The goal is to minimize the value of the sum of square error. If we expand the above quadratic we get the equations for $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$.

$$ \hat\beta_0 = \bar{Y} - \hat\beta\,\bar{X} $$
$$ \hat\beta_1 = \frac{ \sum_{i=1}^{n} (X_{i}-\bar{X})(Y_{i}-\bar{Y}) }{ \sum_{i=1}^{n} (X_{i}-\bar{X})^2 } $$

Now that we have the equations, let’s write some Ruby that solves them numerically.

We’ll start by attacking the simplest part of the equations: $\bar{X}$ and $\bar{Y}$. These symbols represent the mean of the x and y variables in the dataset. In Ruby we can write the following function to compute the mean:

Now that we have the easy part out of the way, lets attack the equation for $\beta_1$. To simplify, break the equation into two parts, the numerator and the denominator. The numerator becomes:

$$ \sum_{i=1}^{n} (x_{i}-\bar{x})(y_{i}-\bar{y}) $$

What this equation says is “for every value in x and y, multiply the difference between an observed x and the mean of x by the difference between the observed y and the mean of y.” In Ruby, this would be:

Once we have the numerator we can compute the denominator of our equation for $\beta_1$. The equation for the denominator is:

$$ \sum_{i=1}^{n} (x_{i}-\bar{x})^2 $$

Writing Ruby to compute this value is also pretty easy:

With the numerator and the denominator identified, we can put them together into a Ruby function that estimates slope:

Having solved for $\beta_1$, we can tackle the solution for $\beta_0$ which is the y-intercept of our regression line. The equation for $\beta_0$ is simply:

$$ \hat{\beta_0} = \bar{Y_n} - \hat{\beta_1} \hat{X_n} $$

Translated into English this equation says “the y-intercept can be estimated as the difference between the average of y and the average of x multiplied by the slope of our line.” In Ruby this is:

Now that you have both the slope and the y-intercept, you’ve written all the Ruby necessary to perform simple linear regression. Putting it all together, we end up with the following Ruby class:

Let’s try out our simple-linear regression class on a sample dataset: video views vs number of days a video has been online; available here, it’s a good example of data we analyze at Sharethrough.

We can use the code below to run our Ruby based regression on the sample dataset:

Our Ruby-based regression says the best fit line for our sample dataset is:

$$ y = 2463.53x + 25071.51 $$

This Ruby-based solution corresponds to the regression line generated using the following R code:

You can find all the code used in the this blog post on Github.

Stay tuned for part 2 where we will look at confidence intervals, model error, and the predictive power of our simple linear regression model.

If you’re interested in working on hard data problems in a dynamic, collaborative environment, we’re hiring!

Ryan Weald is a Data Scientist at Sharethrough. You can follow him on twitter @rweald.