Data Analysis - Predicting Kickstarter Campaign Success¶

Author: John Salisbury / Last Updated: Apr 11, 2023


Cover Photo


Summary¶

In this project, I clean and analyze data on over 250k Kickstarter crowdfunding campaigns that took place in the United States between 2009-2018, using logistic regression to identify factors that predict campaign success.

In this particular notebook, I run and interpret a logistic regression model, allowing me to determine if certain factors in our dataset can predict whether or not Kickstarter campaigns succeed. To view the exploration and cleaning of this dataset, visit this link (or click on "View Data Exploration/Cleaning" above).

--

Table of Contents:
Import Modules and Data
Pre-Model Checks
Run a Logistic Regression
Model Interpretation

--

Find the data for this project on Kaggle:
Kickstarter Projects


View Data Exploration/Cleaning


Import Modules and Data¶

To begin our analysis, we first import a number of common Python modules (e.g., NumPy, Pandas, etc.) to our project. We also import the statsmodels module, which will allow us to run a logistic regression in which we can easily interpret beta coefficients from the final model:

In [1]:
#IMPORT MODULES ------

#Stats modules
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

#Helper modules
import matplotlib.pyplot as plt   #Plotting support
import numpy as np   #Array support
import pandas as pd   #DataFrame support
import seaborn as sns   #Plotting support

#Suppress warnings in Jupyter 
import warnings
warnings.filterwarnings('ignore')

#IPython - HTML notebook formatting
from IPython.core.display import HTML

We'll also add some CSS to improve the look and feel of our notebook:

In [2]:
HTML("""
<style>
.jp-needs-light-background {
    display: block;
    margin: auto;
}

.jp-OutputArea-output.jp-RenderedHTMLCommon table {
    margin: 2em auto;
    background: #eae9e9;
    border: 1px solid #000;
    font-size: 12px;
}

.toc {
    font-size: 16px;
}

.nest-one {
    margin-left: 1em;
    font-style: italic;
    font-size: 14px;
}

.nest-two {
    margin-left: 3em;
    font-style: italic;
    font-size: 12px;
}

.faint {
    opacity: 0.2;
}
</style>
""")
Out[2]:

And finally, we import our cleaned data as a dataframe:

In [3]:
#IMPORT DATA ------

#Import CSV as DataFrame
data = pd.read_csv("../data/cleaned_data.csv")

Let's just look at the head of our dataframe to verify that everything looks OK:

In [4]:
data.head()
Out[4]:
Name Goal Backers State fund_days name_len years_since s_winter s_spring s_summer s_fall
0 Grace Jones Does Not Give A F$#% T-Shirt (limi... 1000 30 0 39.123056 59 0 0 1 0 0
1 CRYSTAL ANTLERS UNTITLED MOVIE 80000 3 0 87.994525 30 0 0 1 0 0
2 drawing for dollars 20 3 1 8.088854 19 0 0 1 0 0
3 Offline Wikipedia iPhone app 99 25 1 79.266424 28 0 0 1 0 0
4 Pantshirts 1900 10 0 28.409271 10 0 0 1 0 0

Looks good! As a reminder our response variable is State, a categorical variable that represents the outcome of each Kickstarter campaign. State has two levels, 0 for "Failed" and 1 for "Successful". Additionally, we have the following explanatory variables that we may decide to integrate into our logistic regression model:

  • Goal (fundraising goal; continuous)
  • Backers (number of individuals who donated; continuous)
  • fund_days (duration of campaign in days; continuous)
  • name_len (length of campaign name in characters; discrete)
  • years_since (years since Kickstarter launch in 2009)
  • Season (represented by four columns with binary outcomes; 0 indicates that the campaign was not launched in the given season, whereas 1 indicates that the campaign was launched in that season)
    • s_winter (indicates if campaign was launched during winter months; categorical)
    • s_spring (indicates if campaign was launched during spring months; categorical)
    • s_summer (indicates if campaign was launched during summer months; categorical)
    • s_fall (indicates if campaign was launched during fall months; categorical)

Let's move on with our analysis below.


Pre-Model Checks¶

The next step in our analysis is to conduct a few pre-model checks and ensure that our data meets some of the basic conditions necessary to use it in a logistic regression model.

Response variable has binary outcome¶

The first condition our data needs to meet is that the response variale in our model has a binary outcome. As described in the section above, our response variable is State, which is categorical and has values of 0 (indicating a failed Kickstarter campaign) and 1 (indicating a successful campaign).. Since State has only two possible outcomes, we can confirm that this condition has been met.


Observations are independent¶

The next condition our data needs to meet is that the observations in our dataset are independent of each other. The data in our dataset represents over 250+ unique Kickstarter campaigns from across the United States that were launched at different points between 2009-2018. There are no duplicate records or repeated measurements in our dataset. Additionally, we do not have any type of nested or hierarchical structure in our data that would necessitate a more complex type of model. Thus, we can probably go ahead and assume that the observations in our dataset are independent of each other.


No colinearity among explanatory variables¶

Next, we need to verify that there is no colinearity between our numeric explanatory variables. If colinearity is present, we will need to remove correlated variables until colinearity is no longer an issue. To check for colinearity between our explanatory variables, we can (1) pull these variables into a dataframe and (2) create a correlation matrix that displays correlation coefficients between all possible combinations of these variables:

In [5]:
#Investigate colinearity between potential explanatory variables

#Isolate numeric variables
cols = ["Goal", "Backers", "fund_days", "name_len", "years_since"]
expl = data[cols]

#Create correlation matrix
expl.corr()
Out[5]:
Goal Backers fund_days name_len years_since
Goal 1.000000 0.006245 0.020847 -0.006465 0.015745
Backers 0.006245 1.000000 -0.001251 0.020870 0.025893
fund_days 0.020847 -0.001251 1.000000 0.017794 -0.197098
name_len -0.006465 0.020870 0.017794 1.000000 -0.059036
years_since 0.015745 0.025893 -0.197098 -0.059036 1.000000

Table 1. Correlation matrix of numeric explanatory variables. Each value represents a correlation coefficient between a pair of variables.



We can also make this correlation amtrix easier to read by plotting it:

In [6]:
#Plot correlation matrix
expl.corr().style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)
Out[6]:
  Goal Backers fund_days name_len years_since
Goal 1.00 0.01 0.02 -0.01 0.02
Backers 0.01 1.00 -0.00 0.02 0.03
fund_days 0.02 -0.00 1.00 0.02 -0.20
name_len -0.01 0.02 0.02 1.00 -0.06
years_since 0.02 0.03 -0.20 -0.06 1.00

Figure 1. Plot of correlation matrix for explanatory variables listed in Table 1. Values indicate correlation coefficients between variables.



In Table 1 and Figure 1 above, we can see that there is very weak correlation between our explanatory variables. We do see a correlation coefficient of $r=-0.197$ between the variables years_since and fund_days. However, this value is well below the $r=0.7$ threshold that is commonly used as a rule of thumb for determining if colinearity is present between two variables. Thus, we will keep both of these variables for the moment and move on to our assessment of multicolinearity below.


No multicolinearity between explanatory variables¶

Above, we determined that there is no evidence of colinearity between our numeric explanatory variables. However, multicolinearity can still occur even if no pair of variables is highly correlated. Thus, we should test for multicolinearity between our numeric explanatory variables by examining the variance inflation factor (VIF) for each variable.

We can test for multicolinearity with the variance_inflation_factor() function from the statsmodels module, which returns a VIF value for each numeric explanatory variable:

In [7]:
vif = [variance_inflation_factor(expl.values, i) for i in range(len(expl.columns))]
print(pd.DataFrame(vif, expl.columns, columns=["VIF"]))
                  VIF
Goal         1.002171
Backers      1.015185
fund_days    4.494515
name_len     4.357918
years_since  4.089828

The statsmodels documentation suggests that VIF values greater than five indicate multicolinearity. As seen above, each of our VIF values is below five, suggesting that multicolinearity is not present among our explanatory variables.


Sufficient sample size¶

According to Bujang et al. (2018), studies that utilize logistic regression should have a minimum sample size ($n$) of 500, or a sample size where $n = 100 + 50(i)$ (where $i$ is equivalent to the number of explanatory variables in the model). As described in the first section of this analysis, we have nine explanatory variables of interest in our dataset. So, using the rule of thumb above, we would need a sample size of $n = 100 + 50(9) = 550$ observations.

Let's take a look at the number of observations in our dataset below:

In [8]:
print(data.shape[0])
261358

We have 261,358 unique observations in our sample. Thus, it appears that we have a sample size that is sufficient for use in a logistic regression model.


Run a Logistic Regression¶

We have verified that our data meets the basic requirements for logistic regression, and are now ready to actually fit our logistic regression model!

First, we'll break up our response and explanatory variables into separate dataframes:

In [9]:
#Split dataset into 'features' and 'target variable'
feature_cols = ['Backers', 'fund_days', 'Goal', 'name_len', 's_spring', 's_summer', 's_fall', 'years_since']
x = data[feature_cols]
y = data['State']

Note that we omit the s_winter column from our explanatory variables. This will allow us to interpret model coefficients for s_spring, s_summer, and s_fall relative to that for s_winter.

Next, we'll create a logistic regression model using the Logit() function from the statsmodels module and display the results below:

In [10]:
#Add an intercept (i.e., a column of 1's) to x
x = sm.add_constant(x)

#Describe the model (statsmodels.discrete.discrete_model.Logit)
model = sm.Logit(endog=y, exog=x, missing='none')

#Fit the model
result = model.fit()

#Print model results (beta coefficients, p-values, and confidence intervals)
res = result.summary2()
print(res)
Optimization terminated successfully.
         Current function value: 0.325689
         Iterations 12
                          Results: Logit
==================================================================
Model:              Logit            Pseudo R-squared: 0.521      
Dependent Variable: State            AIC:              170260.9627
Date:               2023-04-11 14:08 BIC:              170355.2255
No. Observations:   261358           Log-Likelihood:   -85121.    
Df Model:           8                LL-Null:          -1.7765e+05
Df Residuals:       261349           LLR p-value:      0.0000     
Converged:          1.0000           Scale:            1.0000     
No. Iterations:     12.0000                                       
-------------------------------------------------------------------
              Coef.   Std.Err.      z      P>|z|    [0.025   0.975]
-------------------------------------------------------------------
const        -0.1337    0.0304    -4.4001  0.0000  -0.1933  -0.0741
Backers       0.0563    0.0003   209.1841  0.0000   0.0558   0.0568
fund_days    -0.0134    0.0005   -28.3361  0.0000  -0.0143  -0.0125
Goal         -0.0002    0.0000  -167.6861  0.0000  -0.0002  -0.0002
name_len      0.0048    0.0004    12.3536  0.0000   0.0040   0.0055
s_spring      0.0301    0.0175     1.7184  0.0857  -0.0042   0.0644
s_summer     -0.1262    0.0175    -7.2137  0.0000  -0.1605  -0.0919
s_fall       -0.0431    0.0179    -2.4067  0.0161  -0.0782  -0.0080
years_since  -0.0889    0.0031   -28.7278  0.0000  -0.0949  -0.0828
==================================================================

The final step in this process is to back-transform coefficients and confidence intervals:

In [11]:
#Define coefficient table from results summary
coef_table = res.tables[1]

#Correct column names
cols = coef_table.columns
coef_table = coef_table[1:]
coef_table.columns = cols.astype(str).str.strip(" ")

#Add back-transformed columns for beta coefficients and confidence intervals
coef_table["OR"] = np.exp(coef_table["Coef."].astype(float))
coef_table["CI_Lower"] = np.exp(coef_table["[0.025"].astype(float))
coef_table["CI_Higher"] = np.exp(coef_table["0.975]"].astype(float))

#Display table
coef_table = pd.concat([coef_table.iloc[:,-3:], coef_table.iloc[:,3]], axis=1)
coef_table
Out[11]:
OR CI_Lower CI_Higher P>|z|
Backers 1.057927 1.057369 1.058485 0.000000e+00
fund_days 0.986670 0.985755 0.987587 1.240029e-176
Goal 0.999760 0.999757 0.999763 0.000000e+00
name_len 1.004784 1.004024 1.005546 4.657128e-35
s_spring 1.030556 0.995777 1.066550 8.572709e-02
s_summer 0.881415 0.851699 0.912168 5.444942e-13
s_fall 0.957812 0.924775 0.992031 1.609632e-02
years_since 0.914976 0.909446 0.920540 1.717033e-181

Table 2. Odds ratios, back-transformed confidence intervals, and p-values for each explanatory variable in our logistic regression model.



In Table 2 above, we have odds ratios as well as lower and upper bounds of our confidence intervals and associated p-values in the OR, CI_Lower, CI_Higher, and P>|z| columns, respectively. We are now ready to interpret our model.


Model Interpretation¶

Using the information contained in Table 2, we can make the following determinations:

Campaigns with more backers have higher odds of success. For every 10 backers that support a campaign, the odds of that campaign's success increase by 75.61% (p < 0.00; 95% CI [74.69%, 76.54%]).

Campaigns with higher fundraising goals have lower odds of success. For every $1,000 USD increase in a campaign's fundraising goal, the odds of that campaign's success decrease by 21.34% (p = 0.00; 95% CI [-21.57%, -21.10%]).

Longer campaign duration is associated with lower odds of success. For every week in a campaign's fundraising period, the odds of that campaign's success decrease by 8.97% (p < 0.00; 95% CI [-9.56%,-8.37%]).

Campaigns launched in winter have higher odds of success than campaigns launched in summer or fall. The odds of success are 11.86% lower for campaigns launched during summer instead of winter (p < 0.00; 95% CI [-14.83%, -8.78%]), whereas the odds of success are 4.22% lower for campaigns launched during fall instead of winter (p = 0.02; 95% CI [-7.52%, -0.80%]). The odds of success for campaigns launched in spring were not significantly different from those for campaigns launched in winter.

Newer campaigns have lower odds of success than those that were launched closer to when Kickstarter was founded. For every year since 2009 - the year in which Kickstarter was initially launched - the odds of campaign success decrease by 8.50% (p < 0.00; 95% CI [-9.06%, -7.95%]).

The effect of campaign name length on odds of success is significant but neglible. For each character in a campaign's name, the odds of success increase by 0.48% (p < 0.00; 95% CI [0.40%, 0.55%]). This effect size is substantially smaller than those of other explanatory variables in our model.

So, according to our model, we see that Kickstarter campaigns with a high number of backers, low fundraising goals, and short fundraising periods have the highest odds of success. Additionally, campaigns that are launched in winter months may have higher odds of success than those launched in other seasons (i.e., summer and fall). These factors appear to be particularly important in predicting campaign success, whereas other factors - such as campaign name length - appear to be less important.

Our results seem like common sense - i.e., that popular campaigns with a lot of supporters and relatively modest fundraising goals are more likely to succeed than unpopular campaigns with high fundraising goals. Regardless, our results could be useful to individuals who are planning to launch Kickstarter campaigns and want to maximize their odds of success!



To explore how this data was cleaned prior to analysis, visit the link below:

View Data Exploration/Cleaning