Lab members: Kitakado, Ndiaye, Lurkpranee, Hiwatashi, Nagatsuka (leader), Mitsuyu, Tsubosaka
Guests: Sheng-Ping Wang, Wen-Pei Tsai, Sung Il Lee, Hee Won Park, Jung-Hyun Lim, Matsuzaka
The binomial distribution is a discrete probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. It is widely used in statistics and various fields to model binary outcomes, such as pass/fail, yes/no, or success/failure.
Mathematically, a binomial distribution with parameters \(n\) (number of trials) and \(p\) (probability of success in each trial) is defined by the probability mass function (PMF):
\[ P(Y = y) = \binom{n}{y} p^y (1-p)^{n-y} \]
where \(\binom{n}{y} = \frac{n!}{y!(n-y)!}\) is a binomial coefficient, and \(Y\) is the random variable representing the number of successes.
The moment generating function can be derived using the binomial theorem (where \(q = 1 - p\)):
\[ M(t) = \mathbb{E}[e^{tX}] = \sum_k \binom{n}{k} (pe^t)^k q^{n-k} = (pe^t + q)^n \]
Thus, the characteristic function is
\[ \phi(t) = (pe^{it} + q)^n \]
Let\(X \sim \text{Bin}(n_1, p), Y \sim \text{Bin}(n_2, p)\).Then,
\[ \phi_{X+Y}(t) = \phi_X(t) \phi_Y(t) = (pe^{it} + q)^{n_1} (pe^{it} + q)^{n_2} = (pe^{it} + q)^{n_1 + n_2} \]
Thus,\(X + Y \sim \text{Bin}(n_1 + n_2, p)\), demonstrating the property of binomial distribution under convolution.
Detection of defective products: Suppose you check for products with a 1% chance of being defective. The probability of defective products follows a binomial distribution with \(n = 100\) and \(p = 0.01\).
The Poisson distribution is one of the most commonly used discrete-type distributions, along with the binomial distribution; One of the two major differences is that while the binomial distribution has a lower limit (0) and an upper limit (N) for the number of occurrences within N trials, the Poisson distribution, similarly taking values of 0 or more, does not have a specific upper limit. This difference is easy to understand when considering the following difference in observation procedures. In the sampling of an organism, in the binomial distribution, the observation is a breakdown of “how many males out of N sampled”, whereas in the Poisson distribution, the observation is a breakdown of “how many individuals are to be sampled” or “how many males are to be sampled”, and so on. The interest is different, as in ‘how many individuals are caught’. Thus, the Poisson distribution is a discrete probability distribution describing the number of occurrences of rare events in a given time interval or spatial domain. It is particularly applicable when the occurrences of events are independent and the average number of event occurrences in a given time interval or spatial domain is constant. It is widely used in statistics and in various fields to model randomly occurring events, such as the number of incoming telephone calls, the volume of car traffic, the number of natural disasters, etc.
The probability mass function (PMF) of the Poisson distribution is expressed by the following equation:
\[ P(Y = y) = e^{-\lambda} \frac{\lambda^y }{y!}, y=0,1,2,...(\lambda>0) \]
Here, \(y\) is the number of observed events, and \(\lambda\) is the average rate of occurrence (the average number of events per unit time). The Poisson distribution has a mean of \(\lambda\) and a variance of \(\lambda\).
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The expected value is obtained as follows
\[ \begin{array}{ll} E[Y] &= \sum_{y=0}^{\infty} y \cdot P(Y = y) \\ &= \sum_{y=1}^{\infty} y \cdot e^{-\lambda} \frac{\lambda^y}{y!} \quad (\text{When } y=0, \text{ Since } y \cdot P(Y = y) = 0 ) \\ &= \lambda \cdot e^{-\lambda} \sum_{y=1}^{\infty} \frac{\lambda^{y-1}}{(y-1)!} \\ &= \lambda \cdot e^{-\lambda} \sum_{z=0}^{\infty} \frac{\lambda^z}{z!} \quad (\text{Let } z = y-1, \text{ So } z=0,1,\ldots) \\ &= \lambda \cdot e^{-\lambda} \cdot e^{\lambda} \\ &= \lambda \end{array} \]
The expected value \(E[Y]\) is \(\lambda\).
Variance is obtained as follows. The following formula is known as the formula to obtain variance.
\[ \begin{array}{ll} V[Y] &= E[(Y - E[Y])^2] \\ &= E[Y^2 - 2YE[Y] + E[Y]^2] \\ &= E[Y^2] - 2E[Y] \cdot E[Y] + E[Y]^2 \\ &= E[Y^2] - E[Y]^2 \\ &= E[Y(Y - 1) + Y] - E[Y]^2 \\ &= E[Y(Y - 1)] + E[Y] - E[Y]^2 \end{array} \] From a calculation almost similar to that for \(E[Y]\), we obtain \(E[Y(Y - 1)]\) as follows.
\[ \begin{array}{ll} E[Y(Y - 1)] &= \sum_{y=0}^{\infty} y(y - 1) \cdot P(Y = y) \\ &= \sum_{y=2}^{\infty} y(y - 1) \cdot e^{-\lambda} \frac{\lambda^y}{y!} \quad (\text{When } y=0,1, \text{ Since } y(y-1) \cdot P(Y = y) = 0 ) \\ &= \lambda^2 \cdot e^{-\lambda} \sum_{y=2}^{\infty} \frac{\lambda^{y-2}}{(y-2)!} \\ &= \lambda^2 \cdot e^{-\lambda} \sum_{z=0}^{\infty} \frac{\lambda^z}{z!} \quad (\text{Let } z = y-2, \text{ So } z=0,1,\ldots) \\ &= \lambda^2 \cdot e^{-\lambda} \cdot e^{\lambda} \\ &= \lambda^2 \end{array} \]
Therefore, the variance is
\[ \begin{array}{ll} V[Y] &= E[Y(Y - 1)] + E[Y] - E[Y]^2 \\ &= \lambda^2 + \lambda - \lambda^2 \\ &= \lambda \end{array} \] Thus, in the Poisson distribution, the expectation and variance are equal, which is one of the characteristics of the Poisson distribution.
In a binomial distribution with parameters \(p\) and \(p = \lambda / n\), if we keep \(\lambda\) constant and let \(n\) approach infinity, the distribution approaches a Poisson distribution with mean \(\lambda\). That is,
\[ \lim_{{\lambda=np, n \to \infty}} \binom{n}{k} p^k (1 - p)^{n - k} = \frac{\lambda^k e^{-\lambda}}{k!} \]
This is known as the Poisson limit theorem. This theorem shows that the Poisson distribution is derived as the limit of the binomial distribution.
The detailed derivation is shown below. The following relation is used in the calculation:
\[ \lim_{{n \to \infty}} \left(1 - \frac{\lambda}{n}\right)^n = e^{-\lambda} \]
Here, if \(p = \frac{\lambda}{n}\), then
\[ \begin{array}{ll} \lim_{{n \to \infty}} P(Y = k) &= \lim_{{n \to \infty}} \binom{n}{k} p^k (1 - p)^{n - k} \\ &= \lim_{{n \to \infty}} \frac{n!}{(n - k)! k!} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n - k} \\ &= \lim_{{n \to \infty}} \frac{n (n - 1) (n - 2) \cdots (n - k + 1)}{n^k} \cdot \frac{\lambda^k}{k!} \left(1 - \frac{\lambda}{n}\right)^n \left(1 - \frac{\lambda}{n}\right)^{-k} \end{array} \]
The limit exists and becomes
\[ \frac{\lambda^k e^{-\lambda}}{k!} \]
Let \(k \ge 0\) be an integer. Since \(X_1\) and \(X_2\) are independent, \[ \begin{array}{ll} P(Y_1 + Y_2 = k) &= \sum_{n=0}^{k} P(Y_1 = n, Y_2 = n-k) \\ &= \sum_{n=0}^{k} P(Y_1 = n)P(Y_2 = n-k) \\ &= \sum_{n=0}^{k} \frac{\lambda_1^n}{n!} e^{-\lambda_1} \frac{\lambda_2^{\ n-k}}{(n-k)!} e^{-\lambda_2} \\ &= \sum_{n=0}^{k} \frac{1}{n!(k-n)!} \lambda_1^n \lambda_2^{k-n} e^{-(\lambda_1 + \lambda_2)} \end{array} \]
By the binomial theorem,
\[ \begin{array}{ll} (\lambda_1 + \lambda_2)^k &= \sum_{n=0}^{k} \frac{k!}{n!(k-n)!} \lambda_1^n \lambda_2^{k-n} \end{array} \]
Therefore,
\[ \sum_{n=0}^{k} \frac{1}{n!(k-n)!} \lambda_1^n \lambda_2^{k-n} e^{-(\lambda_1 + \lambda_2)} = \frac{(\lambda_1 + \lambda_2)^k}{k!} e^{-(\lambda_1 + \lambda_2)} \]
Hence, \(P(X_1 + X_2 = k) = \frac{(\lambda_1 + \lambda_2)^k}{k!} e^{-(\lambda_1 + \lambda_2)}\) and we can see that \(X_1 + X_2 \sim \text{Poisson}(\lambda_1 + \lambda_2)\).
Suppose that on a certain coastline, seabird dives are observed an average of 7 times per hour. Answer the following questions in this situation:
set.seed(2024)
# Average seabird dive observation counts
lambda <- 7
# Generate observation data for 1000 hours
random_data <- rpois(1000, lambda)
# Plot the histogram
hist(random_data, breaks = 30, main = "Histogram of Poisson Distribution (Seabird Dive Observations)", xlab = "Dive Counts", ylab = "Frequency", col = "skyblue")
## [1] 0.000911882
## [1] 0.1695041
\[ P(3 \leq X \leq 8) = P(X \leq 8) - P(X \leq 2) \]
## [1] 0.6994551
The NB distribution parameter (the rate parameter) represents both the mean and the variance.
In nearly all practical problems, a count variable’s variance is most likely larger than its mean, a feature known as over-dispersion.
As a result, when using the Poisson distribution, we often consider options to model the response variable variance that is larger than the mean.
The negative binomial distribution is often used in ecological studies to model over-dispersed count data.
The negative binomial distribution is seen as an extension of the Poison distribution.
Fisher et al. [1943] derived the negative binomial distribution as a natural extension of the Poison distribution by using a \(\chi^{2}\) distribution to model the Poisson distribution parameter.
The negative binomial distribution has three different parameterizations. The initial definition of the distribution is through independent and identical Bernoulli trials (each with the same probability of success of p) as in the binomial distribution.
In the binomial distribution, we consider the number of successes out of n trials as the random variable. In the negative binomial distribution, the random variable \(Y\) is the number of failures before r successes are achieved.
As the last trial is the \(r^{th}\) success, the probability of \(Y = k\) can be derived through the binomial distribution:
\[ P(Y=y \vert r,p)=\binom{y+r-1}{r-1}p^r (1-p)^{y} \]
This parameterization is also known as the Pascal distribution. It is a generalization of the geometric distribution (when \(r = 1\)). When the definition of \(r\) is extended to real values (not limited to integer values as in the Pascal distribution), the negative binomial is also known as the Polya distribution. When \(r\) takes a real value, the binomial coefficient is extended to its real value definition through the gamma function:
\[ P(Y=k \vert r,p)=\dfrac{\Gamma(k+r)}{k!\Gamma(r)} p^r (1-p)^{k} \]
When the negative binomial distribution is used in the generalized linear model, we want to relate predictor variables to the mean of the distribution, which is \(\mu = \dfrac{rp}{1-p}\). Hence, the distribution is reparameterized by replacing \(p\) with \(\mu\), that is, setting \(p = \dfrac{r}{\mu+r}\) and \(1-p = \dfrac{\mu}{\mu+r}\), resulting in the third parameterization
\[ P(Y=k \vert r,p)=\dfrac{\Gamma(k+r)}{k!\Gamma(r)} \left(\dfrac{r}{\mu+r}\right)^r \left(\dfrac{\mu}{\mu+r}\right)^{k} \]
Because the gamma distribution is the conjugate prior of the Poison rate parameter, we also define the negative binomial through the gamma distribution parameter \(\alpha\) and \(\beta\), by setting \(r = \alpha\) and \(\mu = \alpha/\beta\):
\[ P(Y=k \vert \alpha,\beta)=\dfrac{\Gamma(\alpha+k)}{k!\Gamma(\alpha)} \left(\dfrac{\beta}{\beta+1}\right)^\alpha \left(\dfrac{1}{\beta+1}\right)^{k} \]
For example, suppose that, based on past experience, we know that, on average, 3 fish are caught. Using the Negative Binomial distribution we can calculate the probability that \(Y = 0\) fish are caught or that \(Y = 1\) fish are caught. The distribution function allows us to enter values for \(\mu\) and \(k\) to calculate the probability of any value of \(Y\). Let us assume for the present that \(\mu = 3\) and \(k = 1\).
Calculating the probability for every possible \(Y\) value forms the distribution function.
In the following following we plotted negative binomial distributions for three \(\mu\) and \(k\) values. - Note that for large \(k\) values, the negative binomial distribution becomes a Poisson distribution. For small values of \(k\) the negative binomial distribution allows for a large number of zeros.
NOTE: The negative binomial distribution can only be used for a response variable that has values equal to, or greater than, 0 and is discrete.
As mentioned above, there are various ways of presenting the negative binomial distribution. Its density function is given by
\[ f(y; k,\mu)=\dfrac{\Gamma(y+k)}{\Gamma(k)\Gamma(y+1)}\times \left(\dfrac{k}{\mu+k}\right)^{k} \left(1-\dfrac{k}{\mu+k}\right)^{y} \] The distribution function has the parameters \(\mu\) and \(k\). The symbol \(\Gamma\) is defined as \(\Gamma(y + 1) = y!\). The first component in the density function is only a number.
The other two components are power functions of the two unknown parameters. The mean and variance of \(Y\) of a negative binomial GLM are given by
\[ E(Y)=\mu ~~~~~~~ \text{var}(Y)=\mu + \dfrac{\mu^2}{k} \] If \(k\) is large relative to \(\mu^2\), the term \(\dfrac{\mu^2}{k}\) is close to 0 and the variance of \(Y\) is \(\mu\). In such cases the negative binomial converges to the Poisson distribution.
Hilbe (2011) uses a different notation for the variance: \[ \text{var}(Y) = \mu + \alpha \times \mu^2 \]
This notation is slightly simpler, as \(\alpha = 0\) means that the quadratic term disappears. Under this parameterization, which is used in most commercial statistics packages (e.g. SAS, Stata, SPSS, Genstat, Limdep) a negative binomial model with a dispersion or heterogeneity (scale) parameter equal to zero, \(\alpha = 0\), is Poisson.
Larger values of \(\alpha\) indicate increasing levels of variability, or dispersion, over what is expected given the Poisson distribution.
When the negative binomial distribution upon which the negative binomial GLM is based is parameterized with \(\alpha =\dfrac{1}{k}\) such that its variance is \(\mu + \alpha \times \mu^2\), there is a direct relationship between \(\mu\) and the dispersion parameter, \(\alpha\).
When the parameterization of the negative binomial results in a variance of \(\mu+\dfrac{\mu^2}{k}\), the relationship is indirect.
The coefficients, standard errors, and fitted values of the direct and indirect parameterizations are the same; only the dispersion parameter and statistics using it differ.
Recall that the dispersion statistic is defined as:
\[ \text{dispersion statistic}=\dfrac{\chi^{2}}{\text{residual degrees of freedom}} \]
The residual degrees of freedom is equal to the number of observations minus the number of parameters, including the intercept. We also write this as \(N – p\).
For both parameterizations, values of the dispersion statistic greater than 1 indicate overdispersion. An ideally fitted Poisson or negative binomial GLM has a dispersion statistic of 1.0.
Do not confuse the dispersion statistic with the dispersion parameter, \(\alpha\).
A new R function called nbinomial displays negative binomial GLM results based on the direct parameterization. It is in the msme package on CRAN.
But for now we will continue using the indirect parameterization of the dispersion statistic in our discussion of the negative binomial, since it is used in glm.nb, which comes with the standard R software.
Suppose a start-up is looking for 5 investors. They ask investors repeatedly where each independently has a \(20\%\) chance of saying yes. Let \(𝑋\) = total # of investors that they ask and note that \(𝑋\) does not follow a negative binomial distribution. Find \(f(x)\) and \(f(10)\).
Solution :
Let \(Y\) = # who say no before 5 say yes. We know \(𝑋 = 𝑌 + 5\).
\[ 𝑌 ∼ NB(5, 0.2) \]
The data used in this workshop were presented in Bailey et al. (2008), who investigated the impact of commercial fishing on deep-sea fish populations. They analysed data of a bottom trawl survey conducted in the northeast Atlantic Ocean from 1979 to 1989 and from 1997 to 2002.
head(Fish,5)
## Site TotAbund Dens MeanDepth Year Period Xkm Ykm SweptArea
## 1 1 76 0.002070281 804 1978 1 98.75575 -57.46692 36710.00
## 2 2 161 0.003519799 808 2001 2 76.80388 178.64798 45741.25
## 3 3 39 0.000980515 809 2001 2 103.79283 -50.05184 39775.00
## 4 4 410 0.008039216 848 1979 1 91.53227 146.44797 51000.00
## 5 5 177 0.005933375 853 2002 2 107.14419 -37.07544 29831.25
The underlying biological question is whether the stress imposed by commercial fishing on a portion of this depth gradient has affected species assemblages throughout the entire depth gradient.
Following Bailey et al. (2009), we use total abundance, defined as the total number of fish per site.
We investigate the relationship between total fish abundance and mean depth for each of the time periods.
Before applying any statistical model, we present a scatterplot (Figure below).
The graph shows a pattern of decline in total abundance at greater depths.
M1 <- glm(TotAbund ~ MeanDepth,
data = Fish,
family = poisson(link = "log"))
summary(M1)
##
## Call:
## glm(formula = TotAbund ~ MeanDepth, family = poisson(link = "log"),
## data = Fish)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.64334 0.01273 521.70 <2e-16 ***
## MeanDepth -0.62870 0.00670 -93.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 27779 on 145 degrees of freedom
## Residual deviance: 15770 on 144 degrees of freedom
## AIC: 16741
##
## Number of Fisher Scoring iterations: 5
Given a Poisson GLM the dispersion statistic may be obtained by
#Checking for overdispersion
N <- nrow(Fish)
p <- length(coef(M1))
Dispersion <- sum(E1^2) / (N - p)
Dispersion
## [1] 115.5856
The dispersion statistic = 115.5856
There is substantial overdispersion in the data, possibly to the extent that even a negative binomial GLM will not sufficiently adjust for the variation. We will need to apply the negative binomial GLM to determine if that is the case.
Simulation tests show that a true or perfectly fitted Poisson model has a dispersion statistic of 1.0. Values greater than 1 indicate likely overdispersion; values smaller than 1 indicate under- dispersion. Overdispersion, which is common in count data, occurs when the variation in the data exceeds the theoretical or expected amount of variability based on Poisson distribution assumptions.
If our Poisson GLM has a dispersion statistic greater than 1, we should check the following possible sources before concluding that the model is truly overdispersed.
If we have eliminated all these possibilities, and the model still shows a dispersion statistic greater than 1, the model is genuinely overdispersed.
There is a variety of methods for handling Poisson overdispersed data. One may simply adjust the standard errors by scaling, which is what the quasipoisson family does.
However, only the standard errors will be adjusted, not the coefficients. In many cases overdispersion affects model coefficients. The most common method used by statisticians to adjust for Poisson overdispersion is the negative binomial GLM.
The negative binomial distribution contains a second parameter called the dispersion or heterogeneity parameter, which is used to accommodate Poisson overdispersion.
The quasi-Poisson distribution only modifies the standard errors of the parameters in the Poisson GLM and not the parameter estimates. The mechanism behind overdispersion is likely to affect the regression parameters, and therefore quasi-Poisson is less useful as a solution for overdispersion.
We are using the same fishery data to determine whether negative binomial GLM will adequately adjust for the Poisson overdispersion in the data. We use the following model:
\[ TotAbund_i \sim NB(\mu_i,k) \]
\[ \text{E}(TotAbund_i) =\mu_i ~~~~~~~~~~~ \text{var}(TotAbund_i) = \mu_i +\dfrac{\mu_i^2}{k} \]
\[ \text{log}(\mu_i)=\eta_i \]
\[ \eta_i = \beta_1 + \beta_2 \times MeanDepth_i+ \beta_3 \times Period_i + \beta_4 \times MeanDepth_i\times Period_i+ \text{log}(SweptArea_i) \]
The same covariates are used as for the Poisson GLM. The only difference is that we now use a negative binomial distribution.
Fish$fPeriod <- factor(Fish$Period)
Fish$LogSA <- log(Fish$SweptArea)
M4 <- glm.nb(TotAbund ~ MeanDepth * fPeriod + offset(LogSA),
data = Fish)
summary(M4)
##
## Call:
## glm.nb(formula = TotAbund ~ MeanDepth * fPeriod + offset(LogSA),
## data = Fish, init.theta = 1.950355727, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.25515 0.16064 -20.263 <2e-16 ***
## MeanDepth -1.03764 0.06032 -17.202 <2e-16 ***
## fPeriod2 -0.61216 0.27395 -2.235 0.0254 *
## MeanDepth:fPeriod2 0.07571 0.10209 0.742 0.4583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.9504) family taken to be 1)
##
## Null deviance: 461.62 on 145 degrees of freedom
## Residual deviance: 158.32 on 142 degrees of freedom
## AIC: 1754.7
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.950
## Std. Err.: 0.219
##
## 2 x log-likelihood: -1744.729
Let’s check first for overdispersion:
E4 <- resid(M4, type = "pearson")
N <- nrow(Fish)
p <- length(coef(M4)) + 1 #"+1' is due to k
Dispersion <- sum(E4^2) / (N - p)
Dispersion
## [1] 1.001618
The ’+1” used for determining the number of parameters (\(p\)) is due to the \(k\). There is no overdispersion. Recall that the Poisson GLM AIC statistic was 816741 and the dispersion statistic was 115.5856, it is easy to conclude that the negative binomial is a far superior model to the Poisson. Here we have an AIC of 1754.7 and a dispersion statistic of 1.0016.
As with linear regression, we need to use residuals in a model validation.
Here are using use Pearson residuals. These are standardized ordinary residuals and are calculated by
\[ \varepsilon_i=\dfrac{TotAbund_i−\text{E}(TotAbund_i)}{\sqrt{\text{var}({TotAbund_i})}} \] - Deviance residuals, used by default by some R functions, quantify the contribution of each observation to the residual deviance.
NOTE: It should be noted that we are not looking for normality in the Pearson or deviance residuals. The issue is lack of fit, and we look for patterns in the Pearson residuals.
E1 <- resid(M1, type = "pearson")
Now that we have obtained the residuals, we plot them against fitted values, against each covariate in the model, and each covariate not in the model. If sampling took place at different time points or spatial points, we also need to investigate temporal or spatial correlation in the residuals. The auto-correlation function and variogram functions can be used for this.
We refit the model without the interaction term and present a portion of the numerical output.
#drop1(M4, test = "Chi")
M5 <- glm.nb(TotAbund ~ MeanDepth + fPeriod + offset(LogSA),
data = Fish)
#drop1(M5, test = "Chi")
summary(M5)
##
## Call:
## glm.nb(formula = TotAbund ~ MeanDepth + fPeriod + offset(LogSA),
## data = Fish, init.theta = 1.943752383, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.31134 0.13703 -24.164 < 2e-16 ***
## MeanDepth -1.01372 0.04876 -20.788 < 2e-16 ***
## fPeriod2 -0.43027 0.12650 -3.401 0.000671 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.9438) family taken to be 1)
##
## Null deviance: 460.09 on 145 degrees of freedom
## Residual deviance: 158.28 on 143 degrees of freedom
## AIC: 1753.2
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.944
## Std. Err.: 0.218
##
## 2 x log-likelihood: -1745.204
We use this model for the validation step. We need to extract Pearson residuals and plot them against fitted values and also versus each covariate in the model and not in the model (Figure below). There are no obvious problems.
We also plot the Pearson residuals versus Mean Depth per Period (Figure below).
We also need to inspect the Pearson residuals for spatial correlation. To do this we can make a variogram of the Pearson residuals (e.g. per Period).
As in linear regression, the Akaike Information Criterion (AIC) can be used to find the optimal model.
In linear regression we use the total sum of squares and the residual sum of squares to calculate R2, which informs us of the explained variance. In a GLM we use the deviance to gain this information.
The null deviances: is defined as maximum likelihood equivalents of the total sum of squares
Residual deviances: is defined as the residual sum of squares
We do not have an \(R_2\) in GLM models; the closest we can get is the explained deviance, which is calculated as Null deviance: 461.62 on 145 degrees of freedom Residual deviance: 158.32 on 142 degrees of freedom \[ 100\times \dfrac{\text{null deviance}-\text{residual deviance}}{\text{null deviance}}=100\times\dfrac{461.62-158.32}{461.62}=65.70\% \] So the explanatory variables explain \(65.70\%\) of the variation in total fish abundance.
Next let’s talk about the estimated parameters.
For each regression parameter we have an estimate, standard error, \(z\)-value (the ratio of the estimate to the standard error), and a \(p\)-value.
Under the null hypothesis of no effect (\(\text{H}0: \beta = 0\)), we expect a \(z\)-value of \(0\).
The larger this value, the more unlikely is the null hypothesis.
The \(p\)-value informs us of the likelihood of this hypothesis.
In this case both the intercept and slopes are significantly different from 0 at the 5% level.
The estimated model is given by
\[ TotAbund_i \sim NB(\mu_i,1.94) \]
\[ \text{E}(TotAbund_i) =\mu_i ~~~~~~~~~~~ \text{var}(TotAbund_i) = \mu_i +\dfrac{\mu_i^2}{1.94} \]
$$ \[\begin{array}{rcl} \text{Period 1:}~~ \mu & = & e^{\eta_i} \\[0.2cm] \eta_i & =& −3.31−1.01\times MeanDepth_i + \text{log}(SweptArea_i) \\[0.2cm] \text{Period 2:}~~ \mu & = & e^{\eta_i} \\[0.2cm] \eta_i & =& -3.74 −1.01\times MeanDepth_i + \text{log}(SweptArea_i) \\[0.2cm] \end{array}\]$$
Hence we have two exponential curves, each with the same slope for Mean Depth but with different intercepts. The best way to demonstrate the effects of covariates on the fitted values is to plot the fitted values relative to the mean depth. A practical problem is that the model also contains the log of Swept Area as an offset. To avoid plotting 3-dimensional graphs, we substitute the log of the mean Swept Area values for the offset. Using R we calculated the two fitted lines (one per period).
The Delta-model is a two-component model used to handle data that contains a large number of zeros and positive continuous values. This structure is particularly useful in ecological and environmental studies, as well as in other fields where zero-inflated data is common.
The Delta-model can be categorized as a mixture model as it is divided into two separate but related modeling steps:
Delta Distribution (Discrete Part): This component models the occurrence of zero values in the data. It uses a binary distribution (such as the Bernoulli distribution) to represent the probability of observing a zero versus a non-zero value.
To represent the binary indicator for zeroes and non-zeroes:
\[ X = \begin{cases} 1 & \text{if non-zeroes} \\ 0 & \text{otherwise} \end{cases} \] This means 𝑋 takes the value 1 in an event that “non-zeroes” occurs, and 0 if otherwise. This is a deterministic way of defining 𝑋 based on the occurrence of the event.
Bernoulli Distribution
The Bernoulli distribution is a discrete probability distribution for a random variable which takes the value 1 with probability 𝑝 and the value 0 with probability 1−𝑝. It is denoted as:
\[ X \sim \text{Bernoulli}(p) \]
Lognormal Distribution (Continuous Part): This component models the positive continuous values. Positive values are assumed to follow a lognormal distribution, which is appropriate for non-negative data that is right-skewed.
Given that 𝑋= 1 (the value is positive), the positive values 𝑋 follow a lognormal distribution: \[ \log(X) \sim \mathcal{N}(\mu, \sigma^2) \] Finally, an overall distribution was obtained by multiplying the outcomes of the first (probability) and second (lognormal) steps.
Parameter Estimation
To fit the Delta-Lognormal model with the data, the following parameters need to be estimated:
𝑝: The probability 𝑝 of observing a non-zero value is estimated as the proportion of non-zero observations in the dataset.
𝜇 and 𝜎: The parameters 𝜇 and 𝜎 of the lognormal distribution are estimated using the log-transformed positive values. This can be done using the maximum likelihood estimation (MLE) or other appropriate statistical techniques.
Probability Density Function (PDF)
The probability density function (pdf) of the Delta-Lognormal model can be expressed as a mixture of a discrete probability mass at zero and a continuous lognormal distribution for positive values:
\[ f(x) = \begin{cases} 1 - p & \text{if } x = 0 \\ p \cdot \frac{1}{x \sigma \sqrt{2\pi}} \exp\left(-\frac{(\log x - \mu)^2}{2 \sigma^2}\right) & \text{if } x > 0 \end{cases} \]
Where:
\(\mu_{LN}\) is mean of log-normal distribution
\[ E[X] = p\cdot\mu_{LN} \] \(\sigma^2_{LN}\) is variance of log-normal distribution
\[ V[X] = p\cdot\sigma_{LN}^2+p(1-p)\mu_{LN}^2 \]
We denote a random variable \(X\) following a log-normal distribution with parameters \(\mu\) and \(\sigma^2\) as:
\[ X \sim \ln N(\mu, \sigma^2) \]
The probability density function of the log-normal distribution follow by:
\[ f_X(x) = \frac{1}{x\sigma\sqrt{2\pi}} \cdot \exp\left[-\frac{(\ln x - \mu)^2}{2\sigma^2}\right] \]
A log-normally distributed random variable is defined as the exponential function of a normal random variable:
\[ Y \sim N(\mu, \sigma^2) \Rightarrow X = \exp(Y) \sim \ln N(\mu, \sigma^2) \]
For a univariate lognormal distribution, where \(y = \{y_1, \ldots, y_n\}\) are lognormally distributed random variables.
The maximum likelihood estimates for the mean \(\mu\) and variance \(\sigma^2\) are given by:
\[ \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} \log(y_i) \] and
\[ \hat{\sigma} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\log(y_i) - \hat{\mu})^2} \ \]
The probability density function of the normal distribution is:
\[ f_Y(y) = \frac{1}{\sigma\sqrt{2\pi}} \cdot \exp\left[-\frac{(y - \mu)^2}{2\sigma^2}\right] \]
The change of variables technique involves the inverse function:
\[ X = g(Y) = \exp(Y) \] with the inverse function \[ Y = g^{-1}(X) = \ln(X) \] Because the derivative of \(\exp(Y)\) is always positive, \(g(Y)\) is strictly increasing and we can calculate the probability density function of a strictly increasing function as \[ f_X(x) = \begin{cases} {f_Y(g^{-1}(x))}\frac{dg^{-1}(x)}{dx}, & \text{if } x \in X \\ 0, & \text{if } x \notin X \end{cases} \]
where \(X = \{x = g(y) : y \in Y\}\).With the probability density function of the normal distribution.
And the derivative of the inverse function with respect to x is:
\[ \frac{dg^{-1}(x)}{dx} = \frac{1}{x} \]
Substituting these into the normal distribution’s PDF gives us the log-normal distribution’s PDF:
\[ \begin{array}{lcl} f_X(x) &=& f_Y(g^{-1}(x)) \cdot \frac{dg^{-1}(x)}{dx} \\ &=& \frac{1}{\sqrt{2\pi}\sigma} \cdot\ exp\left[-\frac{1}{2}\left(\frac{g^{-1}(x) - \mu^2}{\sigma^2}\right)\right]\cdot \frac{dg^{-1}(x)}{dx} \\ &=& \frac{1}{\sqrt{2\pi}\sigma} \cdot\ exp\left[-\frac{1}{2}\left(\frac{\ln x - \mu^2}{\sigma^2}\right)\right]\cdot \frac{d(lnx)}{dx} \\ &=& \frac{1}{\sqrt{2\pi}\sigma} \cdot\ exp\left[-\frac{1}{2}\left(\frac{(\ln x - \mu)^2}{\sigma}\right)\right]\cdot \frac{1}{x} \\ &=& \frac{1}{x\sigma\sqrt{2\pi}} \cdot \exp\left[-\frac{(\ln x - \mu)^2}{2\sigma^2}\right] \end{array} \]
Which simplifies to:
\[ f_X(x) = \frac{1}{x\sigma\sqrt{2\pi}} \cdot \exp\left[-\frac{(\ln x - \mu)^2}{2\sigma^2}\right] \]
Soch, Joram, et al. (2024). StatProofBook/StatProofBook.github.io: The Book of Statistical Proofs (Version 2023). Zenodo. https://doi.org/10.5281/ZENODO.4305949
Generalized additive model (GAM) is a statistical modeling technique which allows non-linear relationships between the response and predictor variables, while maintaining additivity, following the formula:
\[ y_i = f(x_{i1}) + \cdots + f(x_{ip}) + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2) \quad (i=1,2,...,n) \] where \(f\) are spline functions allowing a flexible and accurate modeling of the relationships between the response and predictor variables.
Step 1: Binomial Model for Probability of Non-Zero Values
Objective: Estimate the probability that the response variable is non-zero.
Method: Use a Generalized Additive Model (GAM) with a binomial family to model the probability of a non-zero response based on predictor variables.
The probability \(p_i\) that the response \(Y_i\) is non-zero is modeled using a binomial distribution by the logistic function applied to a linear predictor involving smooth functions of the covariates:
\[ \text{logit}(p_i) = \eta_i = \beta_0 + s_1(x_{i1}) + s_2(x_{i2}) + \cdots + s_k(x_{ik}) \]
Where:
The model can be fitted using a Bernoulli distribution:
\[ Y_i \sim \text{Bernoulli}(p_i) \]
Step 2: Lognormal Model for Positive Values
For the positive response values \(Y_i\), the log-transformed values are modeled as: \[ \log(Y_i) = \mu_i + \epsilon_i \] Where:
\[ \mu_i = \beta_0 + s_1(x_{i1}) + s_2(x_{i2}) + \cdots + s_k(x_{ik}) \] And: \[ \epsilon_i \sim \mathcal{N}(0, \sigma^2) \]
Where:
In the context of GAM, the inverse link function is used to transform the linear predictor back to the scale of the response variable. For the logistic (logit) link function and the lognormal link function, the inverse link functions are as follows:
The inverse logit link function, which transforms the linear predictor \(\eta_i\) back to the probability \(p_i\), is given by: \[ \text{logit}(p_i)=\eta_i = \log\left(\frac{p_i}{1 - p_i}\right) \]
\[ \begin{array}{lcl} e^{\eta_i} &=& \frac{p_i}{1-p_i} \\ p_i &=& \frac{e^{\eta_i}}{1 + e^{\eta_i}} \end{array} \]
For the lognormal distribution, the link function involves the natural logarithm. The log link function is defined as: \[ \log(Y_i) = \mu_i \] The inverse log link function, which transforms the linear predictor \(\eta_i\) back to the original response scale \(Y_i\), is given by: \[ Y_i = e^{\mu_i} \]
To obtain the overall expected value of \(E[Y_i]\) , the probability of being non-zero with the expected value of the positive values are combined:
\[ E[Y_i] = p_i \cdot E[Y_i \mid Y_i > 0] \] Substituting the expressions: \[ E[Y_i] = p_i \cdot e^{\mu_i} \]
By multiplying the probability from the logistic model with the values from the lognormal model. The complete distribution of \(Y_i\) is obtained. This approach integrates both the discrete and continuous components of the Delta-Lognormal model.
# Load necessary library
library(ggplot2)
# Set the seed for reproductive property
set.seed(123)
# Generate synthetic data
n <- 1000
x <- rnorm(n)
y <- rbinom(n, 1, plogis(1 * x - 0.25))
#rbinom(n, size, prob)
#plogis(x) has consequently been called the ‘inverse logit’.
# Create a data frame
data <- data.frame(x, y)
# Plot the data and the logistic regression curve
ggplot(data, aes(x = x, y = y)) +
geom_point(alpha = 0.3,color = "black") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "red") +
labs(title = "Logistic Regression", x = "x", y = "Probability of Success (y)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Load necessary library
library(ggplot2)
# Set the seed for reproductive property
set.seed(123)
# Generate synthetic data
n <- 1000
x <- rlnorm(n) # Generate x from a lognormal distribution
y <- rlnorm(n, meanlog = 0, sdlog = 1) # Generate y from a lognormal distribution
# Create a data frame
data <- data.frame(x, y)
# Plot density plot of y
density_y <- ggplot(data, aes(x = y)) +
geom_density(fill = "lightblue", color = "black") +
labs(title = "Density Plot of y (Lognormal Distribution)", x = "y", y = "Density") +
theme_minimal()
density_y
# Load necessary libraries
library(MASS)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.
##
## Attaching package: 'mgcv'
## The following object is masked from 'package:VGAM':
##
## s
# Set the seed for reproductive property
set.seed(123)
# Number of data points
n <- 1000
# Generate covariates
SST <- runif(n, min = 10, max = 35)
MLD <- runif(n, min = 5, max = 300)
CHL <- runif(n, min = 0.1, max = 2)
# Define coefficients for the covariates
beta_SST <- 0.7
beta_MLD <- -0.02
beta_CHL <- 1.5
intercept <- -8
# Step 1: Generate binary indicator for zero and non-zero values
# Logistic function to determine probability
logit_p <- intercept + beta_SST * SST + beta_MLD * MLD + beta_CHL * CHL
prob_nonzero <- exp(logit_p) / (1 + exp(logit_p))
zero_indicator <- rbinom(n, 1, prob_nonzero)
# Define coefficients for Step 2
beta_mu_SST <- 0.1
beta_mu_MLD <- -0.002
beta_mu_CHL <- 0.2
intercept_mu <- 1
# Step 2: Generate lognormal values for the non-zero cases
# Parameters for lognormal distribution
mu <- intercept_mu + beta_mu_SST * SST - beta_mu_MLD * MLD + beta_mu_CHL * CHL#
#+0.05 * SST^2 - 0.001 * MLD^2 + 0.1 * log(CHL) # Adding non-linear terms
sigma <- 0.7
lognormal_values = exp(rnorm(n, mu, sigma))
# Combine into a single dataset
data <- zero_indicator * lognormal_values
# Create a data frame
data_df <- data.frame(response = data, SST = SST, MLD = MLD, CHL = CHL)
# Step 3: Fit the Delta-Lognormal model using GAM
# Fit the binomial part (probability of non-zero)
binary_val <-ifelse(data_df$response >0,1,0)
binomial_model <- gam(binary_val ~ s(SST) + s(MLD) + s(CHL), family = binomial, data = data_df)
# Fit the lognormal part (positive values)
pos.data <-subset(data_df, response >0)
pos.data$log_response <- log(pos.data$response)
lognormal_model <- gam(log_response ~ s(SST) + s(MLD) + s(CHL), family = gaussian, data = pos.data)
# Display the summary of the models
summary(binomial_model)
##
## Family: binomial
## Link function: logit
##
## Formula:
## binary_val ~ s(SST) + s(MLD) + s(CHL)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.3857 0.4547 11.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(SST) 1.002 1.004 126.44 < 2e-16 ***
## s(MLD) 1.347 1.615 84.24 < 2e-16 ***
## s(CHL) 1.557 1.929 33.46 6.21e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.64 Deviance explained = 64.5%
## UBRE = -0.65903 Scale est. = 1 n = 1000
summary(lognormal_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log_response ~ s(SST) + s(MLD) + s(CHL)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.93060 0.02351 167.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(SST) 1.905 2.389 308.645 <2e-16 ***
## s(MLD) 1.000 1.000 47.461 <2e-16 ***
## s(CHL) 1.000 1.000 3.435 0.0642 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.504 Deviance explained = 50.6%
## GCV = 0.4578 Scale est. = 0.45507 n = 823
# Predicting using the models
# Predict the probability of non-zero
data_df$pred_prob <- predict(binomial_model, type = "response")
# Predict the lognormal values for non-zero cases
data_df$pred_lognormal <- exp(predict(lognormal_model, newdata = data_df))
# Combine the predictions
data_df$predicted_values <- data_df$pred_prob * data_df$pred_lognormal
#or
#data_df$predicted_values <- (1/(1+exp(-predict(binomial_model,data_df)))) * exp(predict(lognormal_model,data_df))
# Display the first few rows of the dataset with predictions
head(data_df)
## response SST MLD CHL pred_prob pred_lognormal
## 1 13.63902 17.18944 85.71871 0.4033806 0.9329159 21.15832
## 2 96.72935 29.70763 180.19075 0.3745801 0.9997597 89.35591
## 3 16.85298 20.22442 52.25452 0.3834427 0.9942429 26.15406
## 4 327.01686 32.07544 256.76192 1.0774251 0.9998741 142.39534
## 5 179.55899 33.51168 255.08305 1.0363719 0.9999467 165.76603
## 6 0.00000 11.13891 145.97661 1.2710513 0.2447985 15.18996
## predicted_values
## 1 19.738929
## 2 89.334438
## 3 26.003485
## 4 142.377416
## 5 165.757188
## 6 3.718479
library(mgcViz)
## Warning: package 'mgcViz' was built under R version 4.3.3
## Loading required package: qgam
## Warning: package 'qgam' was built under R version 4.3.3
##
## Attaching package: 'qgam'
## The following object is masked from 'package:VGAM':
##
## log1pexp
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Registered S3 method overwritten by 'mgcViz':
## method from
## +.gg GGally
##
## Attaching package: 'mgcViz'
## The following object is masked from 'package:lattice':
##
## qq
## The following objects are masked from 'package:stats':
##
## qqline, qqnorm, qqplot
binomial_model.plot <- getViz(binomial_model)
print(plot(binomial_model.plot), pages = 1)
lognormal_model.plot <- getViz(lognormal_model)
print(plot(lognormal_model.plot), pages = 1)
# Calculate deviance explained for both parts
deviance_explained_binomial <- 1 - (binomial_model$deviance / binomial_model$null.deviance)
deviance_explained_lognormal <- 1 - (lognormal_model$deviance / lognormal_model$null.deviance)
# Print deviance explained
cat("Deviance explained (binomial model):", deviance_explained_binomial, "\n")
## Deviance explained (binomial model): 0.6452954
cat("Deviance explained (lognormal model):", deviance_explained_lognormal, "\n")
## Deviance explained (lognormal model): 0.5062341
#check the residuals between actual variable and predicted variable
resi <- data_df$response-(data_df$predicted_values)
plot(resi)
Zero-inflated Poisson (ZIP) regression is a model for count data with excess zeros. As explained in chapter 2.2, Poisson distribution with parameter \(\lambda\) is used to model many natural occurring events where \(\gamma\) represents number of events per unit of time or space.
The normal Poisson distribution can be expressed as follows:
\[ P(Y = y_i) = \frac{e^{-\lambda} \cdot \lambda^{y_i}}{y_i!} \]
However, the Poisson distribution may not be useful when Y takes excess number of zeroes more than it is supposed to take in the distribution manner.
In such a case it can fit to data badly and will generate poor quality predictions, therefore a modified Poisson distribution called zero-inflated Poisson distribution(ZIP) which is going to be introduced from now becomes useful.
So, let’s proceed!
In order to understand the difference between Normal Poisson distribution and ZIP, here are the probability distributions respectively.
Note that \(\omega\) is the structural zero probability, and \(\lambda\) is the mean in underlying Poisson distribution.
plot(0:12, dpois(0:12, lambda=3),
type='h',
main='ZIP(ω=0, λ = 3) = Poi(λ=3)',
ylab='Probability',
xlab ='',
lwd=10)
plot(0:12, dzipois(0:12, lambda=3, pstr0 = 0.3, log = FALSE),
type='h',
main='ZIP(ω=0.3, λ = 3)',
ylab='Probability',
xlab ='',
lwd=10)
See that the parameter \(\omega\) gives the extra probability at the value 0, and when \(\omega = 0\) the ZIP reduces to \(Poi(λ)\).
The variable is denoted with respective probability as follows:
\[ y_i = 0 \quad with \; probability \; ω_i\\ y_i >0 \quad with \; probability \; 1-ω_i\\ \]
Thus, the ZIP distribution has 2 parameters \(\omega\) and \(\lambda\), denoted by \(ZIP(\omega, \lambda)\) has the following probability function:
\[ P(Y=y_i) = ω_i+(1-ω_i)\exp(-λ_i) \quad if \; y_i=0\\ P(Y=y_i) = (1-ω_i)\exp(-λ_i)λ_i^{y_i}/y_i! \quad if \; y_i>0\\ \]
Note that the variable 0 can be either a structural zero with probability \(\omega\) or from underlying Poisson distribution with probability \(1-\omega\). When the variable is a positive value, then it is only from the underlying Poisson distribution with probability \(1-\omega\).
But, how do you get to know the probability \(\omega\)? Well, we introduce a logistic regression for this purpose.
The probability parameter \(\omega (\omega_1, \omega_2,... \omega_n)\) satisfies:
\[ \rm{logit(ω_i)}=\log( \frac{ω_i}{1-ω_i})=β_0+β_1*x_i \]
where X are covariate.
And, the Poisson mean \(\lambda (\lambda_1, \lambda_2,... \lambda_n)\) satisfies:
\[ \log(λ_i)=α_0+α_1*x_i \]
where \(\alpha\) are covariate.
Note that the covariate for Logistic function and for Poisson model may or may not be different depending on the analysis case.
We are going to follow the Maximum Likelihood Estimation(MLE) for parameter estimation.
The likelihood function for ZIP is defined as:
\[ L_1 = \prod_{\left\{i; y_i=0 \right\}}(\frac{\exp{(β_0+β_1*x_i)}}{1 + \exp{(β_0+β_1*x_i)}})+(\frac{\exp(-e^{(α_0+α_1*x_i})}{1 + \exp{(-(β_0+β_1*x_i))}})\\ L_2 = \prod_{\left\{i; y_i>0 \right\}}(\frac{1}{1 + \exp{((β_0+β_1*x_i)}})(\frac{\exp(-e^{(α_0+α_1*x_i)})\exp(y_i(α_0+α_1*x_i))}{y_i!}) \]
where \(L_1\) is the likelihood for the variables \(y_i\) which are zero, and \(L_2\) is the likelihood for the variables \(y_i\) which are positive integers.
Then, we take the log to yield the following formula:
\[ L(α,β;y_i) = \prod_{y_i=0}\log(\exp{(β_0+β_1*x_i)}+\exp{(-(α_0+α_1*x_i))})+\prod_{y_i>0}y_i(α_0+α_1*x_i)-\exp(α_0+α_1*x_i)-\prod_{i=1}^{n}\log(1+\exp(β_0+β_1*x_i))-\prod_{y_i>1}\log(y_i!) \]
Now let’s see how ZIP performs.
First of all, the data is generated by poisson distribution with mean \(\exp(0.5x + 2)\) where x is the covariate.
set.seed(1)
N <- 1000
X <- runif(N, 0, 3)
lambda <- exp(0.5*X + 2)
y <- rpois(N, lambda=lambda)
plot(X,y, main="scatter plot for Normal poisson distribution")
hist(y)
Let’s estimate parameters both by normal Poisson and by ZIP.
#negative log likelihood for Poisson
NLL.P <- function(params){
lambda = exp(params[1]*X + params[2])
NLL = -sum(dpois(y, lambda, log=T))
return(NLL)
}
#negative log likelihood for ZIP
NLL.ZIP <- function(params){
lambda = exp(params[1]*X + params[2])
omega <- 1/(1 + exp(-(params[3]*X + params[4])))
NLL = -sum(dzipois(y, lambda=lambda, pstr0 = omega, log = T))
return(NLL)
}
init_params = c(0, 0, 1, 3)
res.P = optim(init_params, NLL.P, method="BFGS")
res.ZIP = optim(init_params, NLL.ZIP, method="BFGS")
#compare the result
true_values <- c(0.5, 2, 1, -2)
pred_by_poi <- c(matrix(res.P$par)[1:2], NA, NA)
pred_by_ZIP <- c(matrix(res.ZIP$par))
#the upper two for poisson mean predicts, the lower for logistic predicts
df_result <- data.frame(true_values, pred_by_poi, pred_by_ZIP)
df_result
## true_values pred_by_poi pred_by_ZIP
## 1 0.5 0.4933441 0.49336548
## 2 2.0 2.0038842 2.00386552
## 3 1.0 NA 0.03157685
## 4 -2.0 NA -9.77710742
#AIC for both models, Poisson and ZIP
print(res.P$value*2+3)
## [1] 5626.762
print(res.ZIP$value*2+5)
## [1] 5628.881
Then, let’s replace some data with zero using Logistic function.
Notice that there exists many zeroes.
omega <- 1/(1 + exp(-(X - 2)))
Z <- rbinom(N, 1, omega)
y[as.logical(Z)] <- 0
plot(X,y, main="scatter plot for ZIP")
hist(y)
In this case, the percentage of zero data are as shown below:
head(y, n=100)
## [1] 11 0 12 29 11 36 22 0 16 2 12 0 0 0 24 0 17 39 10 23 0 8 0 13 14
## [26] 15 8 0 0 12 13 17 0 8 23 0 0 13 20 15 0 9 0 0 11 0 7 0 0 0
## [51] 0 0 0 0 12 7 11 0 0 0 25 10 0 14 0 18 23 0 7 0 18 0 0 11 16
## [76] 0 0 11 0 0 20 24 0 17 0 11 0 6 12 7 9 14 0 0 0 22 0 12 0 0
sum(y==0)/1000*100
## [1] 40.9
Let’s again estimate parameters both by normal Poisson and by ZIP.
#negative log likelihood for Poisson
NLL.P <- function(params){
lambda = exp(params[1]*X + params[2])
NLL = -sum(dpois(y, lambda, log=T))
return(NLL)
}
#negative log likelihood for ZIP
NLL.ZIP <- function(params){
lambda = exp(params[1]*X + params[2])
omega <- 1/(1 + exp(-(params[3]*X + params[4])))
NLL = -sum(dzipois(y, lambda=lambda, pstr0 = omega, log = T))
return(NLL)
}
init_params = c(0, 0, 1, 3)
res.P = optim(init_params, NLL.P, method="BFGS")
res.ZIP = optim(init_params, NLL.ZIP, method="BFGS")
init_params = c(0, 0, 1, 3)
res.P = optim(init_params, NLL.P, method="BFGS")
res.ZIP = optim(init_params, NLL.ZIP, method="BFGS")
#compare the result
true_values <- c(0.5, 2, 1, -2)
pred_by_poi <- c(matrix(res.P$par)[1:2], NA, NA)
pred_by_ZIP <- c(matrix(res.ZIP$par))
#the upper two for poisson mean predicts, the lower for logistic predicts
df_result <- data.frame(true_values, pred_by_poi, pred_by_ZIP)
df_result
## true_values pred_by_poi pred_by_ZIP
## 1 0.5 0.1054291 0.5048724
## 2 2.0 2.0027277 1.9843436
## 3 1.0 NA 0.9879240
## 4 -2.0 NA -1.9115003
#AIC for both models, Poisson and ZIP
print(res.P$value*2+2*2)
## [1] 13901.81
print(res.ZIP$value*2+2*4)
## [1] 4441.978
So far, we have discussed zero-inflated data in detail, but there also exists data where samples with zero values are underrepresented (zero-deflated data). In particular, data where the value zero is never observed is called zero-truncated data. Here, we will discuss probability distributions related to such data.
A discrete probability distribution where the subject is only positive integers. It is applied in cases where zero data is not expected to be observed, such as the size of school. Below, I will explain using the Zero-truncated Poisson distribution as an example.
The distribution reallocates the probability mass of a Poisson-distributed random variable \(Y\) being \(Y=0\) to the probability mass of \(Y>0\).
Below shows the difference between a Poisson distribution and ZTP with \(\lambda=3\).
The probability mass function is defined as follows,
\[ P(Y=k \mid Y>0)=\frac{Poisson(k;\lambda)}{1-Poisson(0;\lambda)}=\frac{\lambda^k e^{-\lambda}}{k!(1-e^{-\lambda})} \]
Derivation of Mean and Deviance.
Since we are performing non-trivial operations such as redistributing a probability mass of 0, we need to derive them according to the definition.
Mean:
\[ \begin{array}{lcl} E[Y] &=& \sum_{k=1}^\infty k\frac{\lambda^k e^{-\lambda}}{k!(1-e^{-\lambda})} \\ &=& \frac{\lambda}{(1-e^{-\lambda})}\sum_{\textbf{k=1}}^\infty \frac{\lambda^{k-1} e^{-\lambda}}{(k-1)!} \\ \text{let }k'&=&k-1\\ E[Y] &=& \frac{\lambda}{(1-e^{-\lambda})}\sum_{\textbf{k'=0}}^\infty \frac{\lambda^{k'} e^{-\lambda}}{k'!} \\ &=& \frac{\lambda}{(1-e^{-\lambda})}\\ \end{array} \]
Variance:
\[ \begin{array}{lcl} V[Y] &=& E[Y^2]-(E[Y])^2 \\ &=& \sum_{k=1}^\infty {k^2\frac{\lambda^k e^{-\lambda}}{k!(1-e^{-\lambda})}} - \frac{\lambda^2}{(1-e^{-\lambda})^2}\\ &=& \sum_{k=1}^\infty {(k(k-1)+k)\frac{\lambda^k e^{-\lambda}}{k!(1-e^{-\lambda})}} - \frac{\lambda^2}{(1-e^{-\lambda})^2}\\ &=& \sum_{k=1}^\infty {k(k-1)\frac{\lambda^k e^{-\lambda}}{k!(1-e^{-\lambda})}}+ \sum_{k=1}^\infty {k\frac{\lambda^k e^{-\lambda}}{k!(1-e^{-\lambda})}} - \frac{\lambda^2}{(1-e^{-\lambda})^2}\\ &=& \frac{\lambda^2}{(1-e^{-\lambda})}+\frac{\lambda}{(1-e^{-\lambda})} - \frac{\lambda^2}{(1-e^{-\lambda})^2}\\ &=& \frac{\lambda^2+\lambda}{(1-e^{-\lambda})}- \frac{\lambda^2}{(1-e^{-\lambda})^2}\\ \end{array} \]
Let’s analyze the data(\(n=1000\)) on discount rates and counts of the number of products itemd by each customer.
In this document we consider the following data.
## num_items
## discount_rate 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 5 96 57 21 5 0 0 0 0 0 0 0 0 0 0 0
## 10 54 63 22 9 0 2 0 0 0 0 0 0 0 0 0
## 15 33 52 41 21 6 3 2 0 0 0 0 0 0 0 0
## 20 12 29 45 30 26 15 10 4 1 0 0 0 0 0 0
## 25 2 13 21 35 34 31 18 16 5 4 0 1 0 0 0
## 30 0 0 3 5 17 21 21 22 20 19 14 6 7 1 5
What can be seen at quick review of the data is that
The analysis is performed on this data using the usual Poisson distribution and ZTP under the following assumptions.
Assumptions:
$$ \[\begin{array}{lcl} \lambda_i &=& \text{exp}(\beta_0+\beta_1*D_i)\\ Model1 &:& N_i\sim Poisson(\lambda_i)\\ Model2 &:& N_i\sim ZTP(\lambda_i)\\ \end{array}\]$$
where \(D_i\) and \(N_i\) are the discount rate and number of items for the \(i\)-th sample.
The analysis proceeds in as follows
print(res.P$value*2+4) #AIC for Model1
## [1] 3688.319
print(res.ZTP$value*2+4) #AIC for Model2
## [1] 3473.479
A discrete probability distribution where the subject is integers greater than or equal to a specific baseline.
It is a shifted version of the standard Poisson distribution by a shift amount \(s, s\in \mathbb{Z}\).
Below shows the difference between a Poisson distribution, Shifted Poisson Distribution, and ZTP with\(\lambda=3, s=1\)
As shown below, similar to ZTP, it excludes values of zero from the standard Poisson distribution, but its shape is different from ZTP.
The probability mass function is defined as follows,
\[ P(Y=k \mid Y \geq s) = Poisson(Y=k-s \mid Y \geq 0) \]
Derivation of Mean and Deviance.
Let \(Y\sim Shifted\text{-}Poisson(\lambda, s)\), \(Z\sim Poisson(\lambda)\), then \(Y=Z+s\).
Therefore, it is clear that \(E[Y] = \lambda + s, V[Y] = \lambda\)
It can also be daringly derived as follows.
Mean:
\[ \begin{array}{lcl} E[Y] &=& \sum_{k=s}^\infty k\frac{\lambda^{(k-s)} e^{-\lambda}}{(k-s)!} \\ \text{let } k'&=&k-s\\ E[Y] &=& \sum_{k'=\textbf{0}}^\infty k'\frac{\lambda^{k'} e^{-\lambda}}{k'!} \\ &=& \sum_{k'=\textbf{0}}^\infty k'\frac{\lambda^{k'} e^{-\lambda}}{k'!} + s\sum_{k'=\textbf{0}}^\infty \frac{\lambda^{k'} e^{-\lambda}}{k'!}\\ &=& \lambda + s \end{array} \]
Variance:
\[ \begin{array}{lcl} V[Y] &=& E[Y^2]-(E[Y])^2 \\ &=& \sum_{k=s}^\infty {k^2\frac{\lambda^{(k-s)} e^{-\lambda}}{(k-s)!}} - (\lambda+s)^2\\ \text{let } k'&=&k-s\\ V[Y] &=& \sum_{k'=0}^{\infty}{(k'(k-1)+k+2ks+s^2)\frac{\lambda^{k'} e^{-\lambda}}{k'!}} - (\lambda+s)^2\\ &&\sum_{k'=0}^{\infty}{(k'(k-1)+k)\frac{\lambda^{k'} e^{-\lambda}}{k'!}}=\lambda^2+\lambda\\ &&2s\sum_{k'=0}^{\infty}{k\frac{\lambda^{k'} e^{-\lambda}}{k'!}}=2s\lambda\\ &&s^2\sum_{k'=0}^{\infty}{\frac{\lambda^{k'} e^{-\lambda}}{k'!}}=s^2\\ V[Y]&=&\lambda \end{array} \]
Consider applying Shifted-Poisson to the data used during the ZTP example. The data were as follows.
## num_items
## discount_rate 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 5 96 57 21 5 0 0 0 0 0 0 0 0 0 0 0
## 10 54 63 22 9 0 2 0 0 0 0 0 0 0 0 0
## 15 33 52 41 21 6 3 2 0 0 0 0 0 0 0 0
## 20 12 29 45 30 26 15 10 4 1 0 0 0 0 0 0
## 25 2 13 21 35 34 31 18 16 5 4 0 1 0 0 0
## 30 0 0 3 5 17 21 21 22 20 19 14 6 7 1 5
Here the minimum number of items is 1, and we can naturally assume \(s=1\).So consider the following model.
The likelihood calculation and parameter estimation in this model is same as model 3’ below.
The analysis will take this into account.
print(res.P$value*2+4) #AIC for Model1(Poisson)
## [1] 3688.319
print(res.ZTP$value*2+4) #AIC for Model2(ZTP)
## [1] 3473.479
print(res.SP$value*2+4) #AIC for Model3(Shifted-Poisson)
## [1] 3458.384
Model 3 has the lowest AIC.In fact, the data was generated as shown in the following code.
set.seed(19990725)
n = 1000
discount_rate = as.integer(runif(n)*30)%/%5 * 5 + 5
lambda = exp(discount_rate/10 - 1)
num_items = rpois(n, lambda = lambda) + 1 #Value shift!
This code generated random numbers based on the Poisson distribution and then added 1 to all values.
This means that the Shifted-Poisson assumption is consistent with the true data generation model. And the model selection by AIC is still Shifted-Poisson.
In the context of habitat analysis, the “zero-inflated distribution” tries to understand the meaning of “0” as “the site is a Not Good habitat and fish are absent” + “the site is a Good habitat but fish are not found”. On the other hand, in the hurdle model, the zero probability is simply expressed as a single additional parameter, and the positive probability is expressed using a standard distribution.
For example, the zero-altered Poisson (ZAP) is as follows: \[ \begin{array}{lcl} Pr(Y=0) &=& \theta \\ Pr(Y=y) &=& (1-\theta) \cdot e^{-\lambda} \frac{\lambda^y}{y!} \cdot \frac{1}{1-e^{-\lambda}} \quad (y=1,2....) \end{array} \] On the contrary, the ZIP is \[ \begin{array}{lcl} Pr(Y=0) &=& \theta + (1-\theta) \cdot e^{-\lambda}\\ Pr(Y=y) &=& (1-\theta) \cdot e^{-\lambda} \frac{\lambda^y}{y!} \quad (y=1,2,...) \end{array} \]
This type of distribution is sometimes used to extend basic distributions by accounting for heterogeneity in the parameters, assuming the randomness of an underlying parameter.
Let \(f(y|\theta, \phi)\) denote a basic probability distribution for a random variable \(Y\), parameterized by two different parameters, \(\theta\) and \(\phi\). Suppose that \(\theta\) can vary by sampling, treating \(\theta\) as a latent random variable.
To clarify the model, we shall assume the following models.
\[ \begin{array}{lcl} Y_i|\theta_i &\sim& f(\cdot|\theta_i, \phi) \quad (i=1,2,\dots,n) \\ \theta_i &\sim& f(\cdot|\lambda) \quad (i=1,2,\dots,n) \\ \end{array} \] Then the marginal distribution of \(Y_i\) can be expressed as follows: \[ f(y_i|\phi, \lambda) = \int f(y_i|\theta_i, \phi) \cdot f(\theta_i|\lambda) d\theta_i \]
To simplify the expression, we define the mixture distribution as follows:
\[ f(y|\phi, \lambda) = \int f(y|\theta, \phi) \cdot f(\theta|\lambda) d\theta \]
For example, - the mean and variance of \(Y\) - the reproductive property like the sum of distributions keep the same distributional assumption (e.g. \(Y_i \sim Po(\lambda_i)\) then \(\sum_{i=1}^n Y_i \sim Po(\sum_{i=1}^n \lambda_i)\)) and so on.
Please add some typical examples of the distribution (e.g., coin tossing, male/female in the binomial distribution). In my mixture case, the context is different from others, so I have left some examples of mixture distributions themselves.
\[ \begin{array}{lcl} Y|\lambda &\sim& Poisson (\lambda) \\ \lambda &\sim& Gamma(\alpha,\beta) \\ \Rightarrow && \\ f(y) &=& \displaystyle \frac{\Gamma(\alpha+y)}{\Gamma(\alpha) y!} \left( \frac{1}{1+\beta} \right)^\alpha \left( \frac{\beta}{1+\beta} \right)^y \end{array} \]
\[ \begin{array}{lcll} Y|I=i &\sim& N(\mu_i, \sigma_i^2) &(i=1,2,\dots,K) \\ Pr(I=i) &=& \omega_i & (i=1,2,\dots,K) \\ \Rightarrow && \\ f(y) &=& \sum_{i=1}^K \omega_i f(y|\mu_i, \sigma_i^2) \end{array} \]
Assume that \(N\) is a random variable with a Poisson distribution \[ N \sim \operatorname{Poisson}(\lambda) \] and that \(X_1, X_2, \dots, X_N\) are independent and identically distributed random variables. Then the probability distribution of \[ Y = \sum_{i=1}^{N} X_i \] is called a compound Poisson distribution. It’s also called a stopped-sum distribution.
There is a certain probability of \(Pr(N=0)=e^{-\lambda}\), which implies \(Pr(Y=0)=e^{-\lambda}\) for any types of underlying distributions of \(X\)’s.
The details are omitted, but when knowing properties of probability distribution is not easy, sometimes we can use
“probability generating function (pgf)” (for discrete
distribution particularly)
\[
G(z) = E[z^Y], \quad Pr(Y=n) = G^{(n)}(0), \quad E[Y(Y-1)\cdots(Y-n)] =
G^{(n)}(1).
\]
“moment generating function (mgf)” (as Laplace transformation) \[ M(t) = E[e^{tY}], \quad E[Y^n] = M^{(n)}(0). \]
“cumulant generating function (cgf)” (log of mgf) \[ K(t) = \log M(t), \quad K'(0)=M'(0)=E[Y], \quad K^{''}(0)=V[Y] \]
“characteristic function (cf) Fourier transformation”
\[
\phi(z) = E[e^{itY}], \quad E[Y^n] = \phi^{(n)}(0).
\]
The distribution and each of the function has 1:1 property, so you can check the distribution of the compound distribution using such functions. For example, \[ G_Y(z) = E[z^Y] = E[z^{\sum_{i=1}^{N} X_i}] = E[E[\prod_{i=1}^N z^{X_i} | N]] = E[G_X(z)^N] = e^{\lambda (G_X(z)-1)} \]
This distribution is a continuous distribution with a positive probability mass of 0. With this special characteristics, this is sometime used for Catch, CPUE, and biomass etc. The derivation is a bit complicated, but if focusing on the use of fisheries, the Tweedie distribution is derived as a special type of compound Poisson distribution with gamma distribution. Also, the Tweedie is characterized through a special mean-variance relationship, \[ V[Y] = \phi V(\mu) = \phi \mu^p \] Using the notation in the above relations, the Tweedie distribution is denoted as \[ Y \sim TW_p(\mu, \phi). \]
If \(p=0\), \(E[Y]=\phi\), \(Y \sim TW_0(\mu,\phi) = N(\mu, \phi)\)
If \(p=1\), \(E[Y]=\phi \mu\), \(Y\phi \sim TW_1(\mu,\phi) = Po(\mu / \phi)\)
If \(p=2\), \(E[Y]=\phi \mu^2\), \(Y\sim TW_2(\mu,\phi) = Ga(\mu, \phi)\)
…
The Tweedie distribution is normally characterized by the cumulant function. Before defining the Tweedie, let’s define a natural exponential family as \[ f(y; \theta) = \exp \{y \theta - \kappa(\theta) \} h(y) \] In this model, \[ \int f(y; \theta) dy = \int \exp \{y \theta - \kappa(\theta) \} h(y) dy = 1 \] \[ M(t;\theta) = E[e^{tY}] = \exp\{ \kappa(\theta + t) - \kappa(\theta) \} \\ K(t;\theta) = \log M(t,\theta) = \kappa(\theta + t) - \kappa(\theta) \]
The (reproductive) exponential dispersion model is defined as a scaled transformed natural exponential family as \[ f(y; \theta, \phi) = \exp \left\{\frac{y \theta - \kappa(\theta)}{\phi} \right\} g(\phi) h(y) \]
\[ M(t;\theta, \phi) = E[e^{tY}] = \exp \left\{ \frac{\kappa(\phi \theta + t) - \kappa(\theta)}{\phi} \right\} \\ K(t;\theta, \phi) = \log M(t,\theta, \phi) = \frac{\kappa(\phi \theta + t) - \kappa(\theta)}{\phi} \]
Here, \[ K'(0) = \kappa'(\theta) = \tau(\theta) = \mu \\ K{''}(0) = \phi \kappa{''}(\theta) = \phi \tau' (\theta) = \phi V(\mu) \] and we are incorporating Taylor’s power law \(V(\mu)=\mu^p\) into the variance function to define a cumulant function \(\kappa(\theta)\) for the Tweedie. So, the equation to be solved is the following differential equation: \[ \tau' (\theta) = \mu^p \Leftrightarrow \tau' (\theta) = \tau^p(\theta) \]
If \(p=0\), then \(\tau(\theta)=\theta\) and \(\kappa(\theta)=\theta^2/2\) (normal).
If \(p=1\), then \(\tau(\theta)=e^\theta\) and \(\kappa(\theta)=e^\theta\) (Poisson).
If \(p=2\), then \(\tau(\theta)= -1/\theta \ (\theta<0)\) and \(\kappa(\theta)=-\log (-\theta)\) (gamma).
If $p , 1, 2, \[ \begin{array}{rcl} \tau'(\theta) &=& \tau^p(\theta) \\ \Rightarrow \tau(\theta) &=& \left\{ (1-p)\theta \right\}^\frac{1}{1-p} = \mu\\ \Rightarrow \kappa(\theta) &=& \frac{1}{2-p} \left\{ (1-p)\theta \right\}^\frac{2-p}{1-p} = \frac{1}{2-p} \mu^{2-p}\\ \end{array} \] Here \(\theta=\frac{\mu^{1-p}}{1-p}\) and therefore the shape of Tweedie distribution is essentially expressed as follows: \[ \begin{array}{rcl} f(y; \theta, \phi, p) &=& \exp \left\{ \frac{\theta y - \kappa(\theta)}{\phi} \right\} g(\phi) h(y) \\ && \exp \left\{ \frac{1}{\phi} \left[ y \frac{\mu^{1-p}}{1-p} - \frac{1}{2-p} \mu^{2-p} \right] \right\} g(\phi) h(y) \end{array} \] where I omitted to show the normalizing constant \(g(\phi) h(y)\).
The Tweedie can be generated through a compound Poisson-gamma. (sorry for my laziness)
https://github.com/romunov/AED?tab=readme-ov-file https://www.highstat.com/index.php/our-books?view=article&id=16&catid=18
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::collapse() masks nlme::collapse()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ParasiteCod <- read.table("ParasiteCod.txt", header=T)
head(ParasiteCod, 10)
## Sample Intensity Prevalence Year Depth Weight Length Sex Stage Age Area
## 1 1 0 0 1999 220 148 26 0 0 0 2
## 2 2 0 0 1999 220 144 26 0 0 0 2
## 3 3 0 0 1999 220 146 27 0 0 0 2
## 4 4 0 0 1999 220 138 26 0 0 0 2
## 5 5 0 0 1999 220 40 17 0 0 0 2
## 6 6 0 0 1999 220 68 20 0 0 0 2
## 7 7 0 0 1999 220 52 19 0 0 0 2
## 8 8 0 0 1999 194 3848 77 0 0 0 3
## 9 9 0 0 1999 194 2576 67 0 0 0 3
## 10 10 0 0 1999 194 1972 60 0 0 0 3
tail(ParasiteCod, 10)
## Sample Intensity Prevalence Year Depth Weight Length Sex Stage Age Area
## 1245 1245 50 1 2001 228 484 38 2 1 3 4
## 1246 1246 55 1 2001 228 704 45 1 1 4 4
## 1247 1247 75 1 2001 140 884 46 2 1 3 4
## 1248 1248 84 1 2001 260 910 48 2 1 4 2
## 1249 1249 89 1 2001 260 1414 56 1 1 6 2
## 1250 1250 90 1 2001 228 224 31 1 1 2 4
## 1251 1251 104 1 2001 140 690 43 2 1 3 4
## 1252 1252 125 1 2001 140 754 44 2 1 3 4
## 1253 1253 128 1 2001 140 1270 55 2 4 7 4
## 1254 1254 257 1 2001 228 370 35 2 1 3 4
dim(ParasiteCod)
## [1] 1254 11
str(ParasiteCod)
## 'data.frame': 1254 obs. of 11 variables:
## $ Sample : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Intensity : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Prevalence: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
## $ Depth : int 220 220 220 220 220 220 220 194 194 194 ...
## $ Weight : int 148 144 146 138 40 68 52 3848 2576 1972 ...
## $ Length : int 26 26 27 26 17 20 19 77 67 60 ...
## $ Sex : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Stage : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Area : int 2 2 2 2 2 2 2 3 3 3 ...
ParasiteCod$fArea <- factor(ParasiteCod$Area)
ParasiteCod$fYear <- factor(ParasiteCod$Year)
I1 <- is.na(ParasiteCod$Intensity) | is.na(ParasiteCod$Length)
ParasiteCod2 <- ParasiteCod[!I1, ]
plot(table(ParasiteCod2$Intensity), ylab = "Frequencies", xlab = "Observed intensity values")
dim(ParasiteCod2)
## [1] 1191 13
# Number of 0 data
sum(ParasiteCod2$Intensity==0)
## [1] 651
y <- ParasiteCod2$Intensity
NLL.pois <- function(par) (-1.0)*sum(dpois(y, exp(par), log=T))
Res.pois <- optim(0, NLL.pois, method="BFGS")
Res.pois
## $par
## [1] 1.826011
##
## $value
## [1] 13859.25
##
## $counts
## function gradient
## 32 6
##
## $convergence
## [1] 0
##
## $message
## NULL
lambda.pois <- exp(Res.pois$par)
lambda.pois
## [1] 6.209067
domain <- 0:max(y+1)
hist(y, freq=F, breaks=domain, main="Fitting to Pois")
points(domain, dpois(domain, lambda.pois), col="red")
Res.IID.pois <- glm(y ~ 1, family = "poisson", data=ParasiteCod2)
summary(Res.IID.pois)
##
## Call:
## glm(formula = y ~ 1, family = "poisson", data = ParasiteCod2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.82601 0.01163 157 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 25797 on 1190 degrees of freedom
## Residual deviance: 25797 on 1190 degrees of freedom
## AIC: 27721
##
## Number of Fisher Scoring iterations: 7
y <- ParasiteCod2$Intensity
Res.glm.pois <- glm(y ~ 0 + fYear*fArea + Length, family = "poisson", data=ParasiteCod2)
summary(Res.glm.pois)
##
## Call:
## glm(formula = y ~ 0 + fYear * fArea + Length, family = "poisson",
## data = ParasiteCod2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## fYear1999 2.9985075 0.0565707 53.005 < 2e-16 ***
## fYear2000 3.2079397 0.0789588 40.628 < 2e-16 ***
## fYear2001 0.6014964 0.1840927 3.267 0.00109 **
## fArea2 -0.5189996 0.0711025 -7.299 2.89e-13 ***
## fArea3 -1.0343800 0.0722101 -14.325 < 2e-16 ***
## fArea4 1.0659077 0.0453673 23.495 < 2e-16 ***
## Length -0.0287971 0.0008796 -32.740 < 2e-16 ***
## fYear2000:fArea2 -0.3399908 0.1329193 -2.558 0.01053 *
## fYear2001:fArea2 2.6265442 0.2010904 13.062 < 2e-16 ***
## fYear2000:fArea3 1.5536092 0.1029887 15.085 < 2e-16 ***
## fYear2001:fArea3 2.5895857 0.2018590 12.829 < 2e-16 ***
## fYear2000:fArea4 0.3906449 0.0815499 4.790 1.67e-06 ***
## fYear2001:fArea4 2.2480938 0.1870790 12.017 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 40395 on 1191 degrees of freedom
## Residual deviance: 18430 on 1178 degrees of freedom
## AIC: 20378
##
## Number of Fisher Scoring iterations: 7
AIC(Res.glm.pois)
## [1] 20377.86
y <- ParasiteCod2$Intensity
logit <- function(p) log(p/(1-p))
ilogit <- function(x) 1.0/(1.0+exp(-x))
NLL.nb <- function(par) (-1.0)*sum(dnbinom(y, prob=ilogit(par[1]), size=exp(par[2]), log=T))
Res.nb <- optim(c(0,0), NLL.nb, method="BFGS")
## Warning in dnbinom(y, prob = ilogit(par[1]), size = exp(par[2]), log = T): NaNs
## produced
Res.nb
## $par
## [1] -3.657231 -1.831582
##
## $value
## [1] 2608.355
##
## $counts
## function gradient
## 31 11
##
## $convergence
## [1] 0
##
## $message
## NULL
prob.nb <- ilogit(Res.nb$par[1])
size.nb <- exp(Res.nb$par[2])
AIC.nb <- 2*Res.nb$value + 2*2
AIC.nb
## [1] 5220.709
domain <- 0:max(y+1)
hist(y, freq=F, breaks=domain, main="Fitting to Pois")
points(domain, dnbinom(domain, prob=prob.nb, size=size.nb), col="red")
library(MASS)
Res.IID.nb <- glm.nb(y ~ 1, data=ParasiteCod2)
summary(Res.IID.nb)
##
## Call:
## glm.nb(formula = y ~ 1, data = ParasiteCod2, init.theta = 0.1601896019,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.82601 0.07333 24.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.1602) family taken to be 1)
##
## Null deviance: 999.77 on 1190 degrees of freedom
## Residual deviance: 999.77 on 1190 degrees of freedom
## AIC: 5220.7
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.16019
## Std. Err.: 0.00837
##
## 2 x log-likelihood: -5216.70900
y <- ParasiteCod2$Intensity
Res.glm.nb <- glm.nb(y ~ 0 + fYear*fArea + Length, data=ParasiteCod2)
summary(Res.glm.nb)
##
## Call:
## glm.nb(formula = y ~ 0 + fYear * fArea + Length, data = ParasiteCod2,
## init.theta = 0.2153077102, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## fYear1999 2.864463 0.305823 9.366 < 2e-16 ***
## fYear2000 2.982258 0.408627 7.298 2.92e-13 ***
## fYear2001 0.423865 0.396433 1.069 0.28498
## fArea2 -0.442768 0.295274 -1.500 0.13374
## fArea3 -1.153175 0.255511 -4.513 6.39e-06 ***
## fArea4 0.921721 0.282246 3.266 0.00109 **
## Length -0.023835 0.004729 -5.040 4.66e-07 ***
## fYear2000:fArea2 -0.387188 0.528897 -0.732 0.46413
## fYear2001:fArea2 2.506942 0.485119 5.168 2.37e-07 ***
## fYear2000:fArea3 1.492543 0.466798 3.197 0.00139 **
## fYear2001:fArea3 2.614731 0.445608 5.868 4.42e-09 ***
## fYear2000:fArea4 0.503002 0.511290 0.984 0.32522
## fYear2001:fArea4 2.263310 0.468251 4.834 1.34e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.2153) family taken to be 1)
##
## Null deviance: 2786.8 on 1191 degrees of freedom
## Residual deviance: 1013.1 on 1178 degrees of freedom
## AIC: 5030.7
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.2153
## Std. Err.: 0.0120
##
## 2 x log-likelihood: -5002.6730
AIC(Res.glm.nb)
## [1] 5030.673
Questions:
Findings: