Name | Plot | Equation |
---|---|---|

Identity | ||

Binary step | ||

Logistic | ||

TanH | ||

Rectified linear unit (ReLU)^{[9]} |

Name | Plot | Derivative (with respect to x) |
---|---|---|

Identity | ||

Binary step | ||

Logistic | ||

TanH | ||

Rectified linear unit (ReLU)^{[9]} |

- Advantages: - Training is very fast - Easy to learn complex functions over few variables - Can give back confidence intervals in addition to the prediction - *Often wins* if you have enough data | - Disadvantages: - Slow at query time - Query answering complexity depends on the number of instances - *Easily fooled by irrelevant attributes* (for most distance metrics) - "Inference" is not possible |

$x=$HasKids $y=$OwnsDoraVideo ------------- ------------------- Yes Yes Yes Yes Yes Yes Yes Yes No No No No Yes No Yes No | - From the table, we can estimate $P(Y=\mathrm{Yes}) = 0.5 = P(Y=\mathrm{No})$. - Thus, we estimate $H(Y) = 0.5 \log \frac{1}{0.5} + 0.5 \log \frac{1}{0.5} = 1$. |

$x=$HasKids $y=$OwnsDoraVideo ------------- ------------------- Yes Yes Yes Yes Yes Yes Yes Yes No No No No Yes No Yes No | *Specific conditional entropy* is the uncertainty in $Y$ given a particular $x$ value. E.g., - $P(Y=\mathrm{Yes}|X=\mathrm{Yes}) = \frac{2}{3}$, $P(Y=\mathrm{No}|X=\mathrm{Yes})=\frac{1}{3}$ - $H(Y|X=\mathrm{Yes}) = \frac{2}{3}\log \frac{1}{(\frac{2}{3})} + \frac{1}{3}\log \frac{1}{(\frac{1}{3})}$ $\approx 0.9183$. |

$x=$HasKids $y=$OwnsDoraVideo ------------- ------------------- Yes Yes Yes Yes Yes Yes Yes Yes No No No No Yes No Yes No | - *The conditional entropy, $H(Y|X)$*, is the average specific conditional entropy of $y$ given the values for $x$: $$H(Y|X)=\sum_x P(X=x)H(Y|X=x)$$ - $H(Y|X=\mathrm{Yes}) = \frac{2}{3}\log \frac{1}{(\frac{2}{3})} + \frac{1}{3}\log \frac{1}{(\frac{1}{3})}$ $\approx 0.9183$ - $H(Y|X=\mathrm{No}) = 0 \log \frac{1}{0} + 1 \log \frac{1}{1} = 0$. - $H(Y|X) = H(Y|X=\mathrm{Yes})P(X=\mathrm{Yes}) + H(Y|X=\mathrm{No})P(X=\mathrm{No})$ $= 0.9183 \cdot \frac{3}{4} + 0 \cdot \frac{1}{4}$ $\approx 0.6887$ - Interpretation: the expected number of bits needed to transmit $y$ if both the emitter and the receiver know the possible values of $x$ (but before they are told $x$’s specific value). |