Community service is one of the largest government-run practices as part of law enforcement and criminal justice. It came into effect in the states inspired by the ancient legal systems based on restitution, criminals labored to compensate their particular victims for injury or loss. It is considered as the “productive form of punishment”, and rightfully so. Over time robust systems were built to ensure the appropriate formats for community service, which varies from state to state, were in place. It allowed the offenders to take a genuine shot at redemption and lead a civil life.
The current state of affairs relating to community service shows that it’s proving to be counterproductive. Because studies show that the ratio of individuals on parole who successfully reintegrate back into society versus those who fail to clear the probation period is very low. According to the justice authorities, more than 40% of the prisoners are a result of parole violations. This is unfortunate for both the individual and the community, as the loss incurred in terms of the effort and time is substantial. The failures create a loop, which they refer to as “revolving door” admissions, where the individual is detained and continues to serve his prison sentence. This phenomenon costs the state a staggering 9.3 billion dollars annually.
The community service officials fulfill a complex set of duties that involve the right amount of surveillance of individuals on parole to safeguard the community and simultaneously direct them to the most suitable rehabilitative programs to help them overcome challenges. Any factor in the parole routine that is not scaled appropriately can risk re-entry and fail to prevent recidivism. Evidently, its a sensitive balance that needs to be equipped with intricate designs to promote re-entry. Hence the criticality boils down to determining the recidivism risk of an individual. This opens a window for data analytics to help provide necessary insights.
The Recidivism Forecasting problem conducted by the NIJ is an open challenge to the citizens and businesses of the country. The aim of the NIJ is to safeguard and improve communities by reducing the amount of recidivism. The results from the challenge is expected to provide critical information to community corrections departments that may help facilitate more successful reintegration into society for people previously incarcerated and on parole.
The data for this challenge is released by the NIJ state authorities. It is released under the terms and conditions for research purposes only, which abstracts the identity of the individual.
The data contains 53 fields for every prisoner, corresponding to various aspects of the prison sentence, personal details, and probation.
Download the data here
The challenge is to predict in which of the 3 years of community service, the individual recidivates. The target variable in the training data is the year in which the individual recidivated.
Exploratory Data Analysis
Structural details of the data are as follows:
- The data comprises a total of 53 variables, of which 3 are the target variables (Recidivism-Year 1, 2 & 3)
- The data-types present in the data are, Boolean : 20, Float : 8, Integer : 2, Object : 23
- The following figure shows the percentage of the missing values in the columns.
Data imputation has been done to fill the missing values.
- .Race : [‘BLACK’ ‘WHITE’], Gender : [‘M’ ‘F’]
Count : 18028
Black freqency : 10313
White freqency : 7715
Male freqency : 15811
Female freqency : 2217
Approximately, 58 % of the inmates are Black and 42% are White.
No significant bias in the racial distribution of the data. Whereas, there is a significant gender bias
- The important factor to seek into, is the type of data distribution it falls under. Or the closest match to help us understand the data characteristics.
The given data does not qualify into any known probability density function.
- The next deterministic factor would be to identify the independent and correlated features. This is to ensure the optimal number of features are included in the training data to avoid information sparsity and reduce dimensions.
The following figure shows a correlation graph between all the features with a color-coded match.
Fig: Red indicates highest correlation, and grey being the least
As the plot suggests, there aren’t any significant correlations between any two given features from the data.
Principal Component Analysis
The PCA is a method to reduce the dimensionality of the data. This helps visualize if there are any non-contributing features in the given data-set towards predicting the target variable.
The no: of components i.e features can be determined by plotting the variance of the given dataset with respect to the no: of components.
The above plot indicates that cumulative variance of the dataset is explained by almost all the features.
Initially we approached it with the classification model. Applied various supervised learning algorithms (classification) on the individual dataset. Three different models were built for each year as the target variable.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piece-wise constant approximation.
In which ft is the probability that was forecast, ot the actual outcome of the event at instance t( if it does not happen and 1 if it does happen) and N is the number of forecasting instances. In effect, it is the mean squared error of the forecast.
Given data is highly imbalanced for the target variable.