# Learning What’s in a Name with Graphical Models

An overview of the structure and statistical assumptions behind linear-chain graphical models — effective and interpretable approaches to sequence labeling problems such as named entity recognition.

Which of these words refer to a named entity?

Starting prediction server…

## I. Probabilistic Graphical Models

### Factorizing Joint Distributions

$p(a, b) = p(a) \cdot p(b|a)$

Notation: the shorthand p(a) means p(A = a), that is, the probability of variable A taking value a.

$p(a, b, c, d) = p(a) \cdot p(b|a) \cdot p(c) \cdot p(d|a, b, c)$

### Directed Acyclic Graphs

Sampled Population

## II. Hidden Markov Models

### The Hidden Layer

$p(s_i | s_1, s_2,…, s_{i-1}) = p(s_i | s_{i-1}) \\ \footnotesize\textrm{for all i\,{\scriptscriptstyle \in}\,\{2,…, N\}}$
$p(s_i | s_{i-1}) = p(s_{i+1} | s_i) \\ \footnotesize\textrm{for all i\,{\scriptscriptstyle \in}\,\{2,…, N-1\}}$

### The Observed Layer

$p(o_i | s_1, s_2,…, s_N) = p(o_i | s_i) \\ \footnotesize\textrm{for all i\,{\scriptscriptstyle \in}\,\{1, 2,…, N\}}$

### Representing Named Entities

Representation: rather than labeling each node using the name of the variable it represents (X₁, Y₁) as we have until this point, we'll instead display the value of that variable (“O”, “Great”). This helps make the graphs easier to read.

### Results

Name tag predictions by HMM:

Starting prediction server…

Accuracy

90.1%

Precision

64.2%

Recall

55.8%

F₁ Score

59.7%

Precision

TagNo OOV1+ OOV
ORG0.80.21
PER0.850.62
LOC0.870.06
MISC0.780.12
ALL0.840.39

Recall

TagNo OOV1+ OOV
ORG0.640.33
PER0.580.59
LOC0.710.05
MISC0.540.06
ALL0.640.41

### Limitations

TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.610.680.30.120.28
PER0.960.670.7
LOC0.880.36
MISC0.90.460.240.12
ALL0.770.610.310.120.29
Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.860.3
20.80.50.52
30.780.150.060
40.420.22000
50.670.2200.14

## III. Maximum Entropy Markov Models

### Word Features

$b(o_t) = \begin{cases} \textrm{1 if o_t has shape “Xxx”} \\ \textrm{0 otherwise} \end{cases}$
$f_{\langle b, s \rangle}(o_t, s_t) = \begin{cases} \textrm{1 if b(o_t) = 1 and s_t = s} \\ \textrm{0 otherwise} \end{cases}$

### State Transitions

$p_{s\prime}(s|o) = \frac{1}{Z(o, s\prime)} \exp\left(\sum_{a}{\lambda_a \, f_a(o, s)}\right)$

Calculating pO(B-LOC | “UK”)

With weights retrieved from a MEMM trained on CoNLL-2003 data. Numbers are rounded to 3 decimal places for clarity.
Feature-State Pair (a)λafaλafa

pO(B-LOC | “UK”) = eSUM(λafa)  / Z
≈ e0 / 1
≈ 0

### Results

Most Informative Features when Previous State is

Current Word FeatureCurrent StateWeight
word='germany'B-LOC11.492
word='van'B-PER8.972
word='wall'B-ORG8.525
word='della'B-PER7.86
lowercase='della'B-PER7.86
is_not_title_caseB-PER-6.949
word='de'B-PER6.781
shape='X.X.'O-6.713
shape='xxxx'B-ORG-6.642
word='CLINTON'B-ORG6.456
Name tag predictions by MEMM:

Starting prediction server…

Accuracy

93.1%

Precision

72.9%

Recall

63.5%

F₁ Score

67.9%

Precision

TagNo OOV1+ OOV
ORG0.810.36
PER0.820.8
LOC0.820.17
MISC0.740.14
ALL0.80.54

Recall

TagNo OOV1+ OOV
ORG0.680.12
PER0.720.57
LOC0.890.29
MISC0.780.02
ALL0.790.37

Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.830.34
20.760.720.55
30.60.560.590.5
40.160.20
50.60.5
TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.760.690.840.360.8
PER0.590.910.660.25
LOC0.80.330.350
MISC0.820.570.290.18
ALL0.770.70.580.160.43

## IV. Conditional Random Fields

### Markov Random Fields

$\phi(X, Y) = \begin{cases} \textrm{3 if X = 1 and Y = 1} \\ \textrm{2 if X = 0 and Y = 0} \\ \textrm{1 otherwise} \end{cases}$
$p(A,B,C) = \frac{1}{Z}\,\phi(A,B)\,\phi(B,C)\,\phi(C,A) \\ \footnotesize\textrm{where Z is a normalization factor}$
$p(x_1, x_2,…) = \frac{1}{Z}\prod_{c\,{\scriptscriptstyle \in}\,C}{\phi_c(x_c)} \\ \footnotesize\textrm{where Z is a normalization factor} \\ \footnotesize\textrm{and C is the set of cliques in \mathcal{G}}$

Calculating p(A, B, C, D, E, F)

p(1, 1, 1, 0, 0, 0)
=1/Z
⨯ ɸABC(1, 1, 1)
⨯ ɸAB(1, 1)
⨯ ɸBC(1, 1)
⨯ ɸAC(1, 1)
⨯ ɸCD(1, 0)
⨯ ɸDEF(0, 0, 0)
⨯ ɸDE(0, 0)
⨯ ɸEF(0, 0)
⨯ ɸDF(0, 0)

=1 / 28,915
⨯ 3
⨯ 3
⨯ 3
⨯ 3
⨯ 1
⨯ 2
⨯ 2
⨯ 2
⨯ 2

0.0448

ɸABC = ɸAB = ɸ(x) = 3 if x1 = x2 = … = 1
2 if x1 = x2 = … = 0
1 otherwise

### Conditional Form

$p(y|x) = \frac{1}{Z(x)}\prod_{c\,{\scriptscriptstyle \in}\,C}{\phi_c(y_c, x_c)} \\ \footnotesize\textrm{where Z is a normalization factor} \\ \footnotesize\textrm{and C is the set of cliques in the} \\ \footnotesize\textrm{graph \mathcal{G} representing the labels y}$
Linear-chain CRF where the hidden layer depends on the current, previous, and future observations.

### Exponential Factors

$\phi_c(y_c, x_c) = \exp\left(\sum_{a}{\lambda_a \, f_a(y_c, x_c)}\right) \\ \footnotesize\textrm{where f_a is a feature function defined for clique c} \\ \footnotesize\textrm{and \lambda_a is the weight parameter for f_a}$

## References

1. MUC-7 Named Entity Task Definition (Version 3.5) PDF
Nancy Chinchor. 1998. In Seventh Message Understanding Conference (MUC-7).
2. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
Daphne Koller and Nir Friedman. 2009. The MIT Press.
3. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
Judea Pearl. 1988. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA.
4. Genes, Themes and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis
Hagit Shatkay, Stephen Edwards, W John Wilbur, and Mark Boguski. 2000. In Proceedings of the  International Conference on Intelligent Systems for Molecular Biology, 317–328.
5. Information Extraction Using Hidden Markov Models
Timothy Robert Leek. 1997. Master’s Thesis, UC San Diego.
6. Information Extraction with HMMs and Shrinkage PDF
Dayne Freitag and Andrew McCallum. 1999. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction (AAAI Techinical Report WS-99-11), 31–36.
7. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
Lawrence R Rabiner. 1989. Proceedings of the IEEE 77, 2: 257–286. https://doi.org/10.1109/5.18626
8. An Algorithm that Learns What’s in a Name PDF
Daniel M. Bikel, Richard Schwartz, and Ralph M. Weischedel. 1999. Machine Learning 34, 1: 211–231. https://doi.org/10.1023/A:1007558221122
9. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm
A. Viterbi. 1967. IEEE Transactions on Information Theory 13, 2: 260–269.
10. Appendix A.4 — Decoding: The Viterbi Algorithm PDF
Daniel Jurafsky and James H. Martin. 2021. In Speech and Language Processing. 8–10.
11. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition PDF
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142–147.
12. Maximum Entropy Markov Models for Information Extraction and Segmentation PDF
Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML ’00), 591–598.
13. Maximum Entropy Models for Antibody Diversity Link
Thierry Mora, Aleksandra M. Walczak, William Bialek, and Curtis G. Callan. 2010. Proceedings of the National Academy of Sciences 107, 12: 5405–5410. https://doi.org/10.1073/pnas.1001705107
14. Human Behavior Modeling with Maximum Entropy Inverse Optimal Control PDF
Brian Ziebart, Andrew Maas, J. Bagnell, and Anind Dey. 2009. In Papers from the 2009 AAAI Spring Symposium, Technical Report SS-09-04, Stanford, California, USA, 92–97.
15. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes PDF
Andrew Ng and Michael Jordan. 2001. In Advances in Neural Information Processing Systems.
16. Inducing Features of Random Fields
S. Della Pietra, V. Della Pietra, and J. Lafferty. 1997. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 4: 380–393. https://doi.org/ 10.1109/34.588021
17. Une Approche théorique de l’Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole
Léon Bottou. 1991. Université de Paris X.
18. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data PDF
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01), 282–289.
19. The Label Bias Problem Link
Awni Hannun. 2019. Awni Hannun — Writing About Machine Learning.
20. Discriminative Probabilistic Models for Relational Data Link
Ben Taskar, Pieter Abbeel, and Daphne Koller. 2013. https://doi.org/10.48550/ARXIV.1301.0604
21. Accurate Information Extraction from Research Papers using Conditional Random Fields Link
Fuchun Peng and Andrew McCallum. 2004. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, 329–336.
22. Discriminative Fields for Modeling Spatial Dependencies in Natural Images PDF
Sanjiv Kumar and Martial Hebert. 2003. In Advances in Neural Information Processing Systems.
23. Multiscale Conditional Random Fields for Image Labeling Link
Xuming He, R.S. Zemel, and M.A. Carreira-Perpinan. 2004. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, II–II. https://doi.org/10.1109/CVPR.2004.1315232
24. Conditional Random Fields as Recurrent Neural Networks Link
Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. 2015. In 2015 IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2015.179
25. Convolutional CRFs for Semantic Segmentation Link
Marvin T. T. Teichmann and Roberto Cipolla. 2018. https://doi.org/10.48550/arxiv.1805.04777
26. RNA Secondary Structural Alignment with Conditional Random Fields Link
Kengo Sato and Yasubumi Sakakibara. 2005. Bioinformatics 21: ii237–ii242. https://doi.org/10.1093/bioinformatics/bti1139
27. Protein Fold Recognition Using Segmentation Conditional Random Fields (SCRFs)
Yan Liu, Jaime Carbonell, Peter Weigele, and Vanathi Gopalakrishnan. 2006. J. Comput. Biol. 13, 2: 394–406.
28. Introduction to Markov Random Fields
Andrew Blake and Pushmeet Kohli. 2011. In Markov Random Fields for Vision and Image Processing. The MIT Press.