Parsing
TL;DR
Representing and Analyze Sentence Structure
Phrase structure grammar
- Context free grammar
- Problems:
- Ambiguities : PP Attachment
Traditional Approaches
Stochastically Parsing
Probabilistic Context Free Grammar
CYK Algorithm
Transition-based parsing
Grammaticality
Common approach in statistical natural language processing: n-gram Language Model
E.g., tri-gram
$$ \begin{array}{l} P\left(w_{1}, \ldots, w_{n}\right) \\ =P\left(w_{1}\right) * P\left(w_{2} \mid w_{1}\right) * P\left(w_{3} \mid w_{1} w_{2}\right) \ldots \\ \approx P\left(w_{n} \mid w_{n-2} w_{n-1}\right) \end{array} $$Problems of Language Models
- Generalization: even with very long context there are sentence you cannot model with a n-gram language model
- Overall sentence structure
How can we model what a grammatically correct sentence is?
- Need arbitrary context
- Use grammar describing generation of the sentence
Phrase structure grammar
Describe sentence structure by grammar (Constituency relation)
Phrase structure organizes words into nested constituents (can represent the grammar with CFG rules)
Units in the grammar: Constituency
- Can be moved around
- I saw you today
- Today, I saw you
- expand/contract
- I saw the boy
- I saw him
- I saw the old boy
In syntactic analysis, a constituent is a word or a group of words that function as a single unit within a hierarchical structure. The constituent structure of sentences is identified using tests for constituents.
A phrase is a sequence of one or more words (in some theories two or more) built around a head lexical item and working as a unit within a sentence. A word sequence is shown to be a phrase/constituent if it exhibits one or more of the behaviors discussed below.
Phrase structure rules
- Describe syntax of language
- Example
s
–>NP
VP
(Sentence consists of a noun phrase and a verb phrase)NP
–>Det
N
(A noun phrase consists of a determiner and a noun)
- Only looking at the syntax
- No semantics
Wiki:
In linguistics, phrase structure grammars are all those grammars that are based on the constituency relation, as opposed to the dependency relation associated with dependency grammars; hence, phrase structure grammars are also known as constituency grammars
The fundamental trait that these frameworks all share is that they view sentence structure in terms of the constituency relation.
Example: Constituency relation Vs. Dependency relation
Context Free Grammar
Constituency = phrase structure grammar = context-free grammars (CFGs)
Introduced by Chomsky
4-tuple:
$$ G = (V, \Sigma, R, S) $$- $V$: finite set of non-terminals
- variables describing the phrases (NP, VP, …)
- $\Sigma$: finite set of terminals
content of the sentence
all words in the grammar
- $R$: finite relation $V$ to $(V \cup \Sigma)^{\*}$
- Rules defining how non-terminals can be replaced
- E.g.:
s
–>NP
VP
- $S$: start symbol
- $V$: finite set of non-terminals
Example
Dependency Structure
Different approach to describe sentence structure
Identify semantic relations!
Idea:
- Which words depend on which words
- Which word modifies which word
Example:
The (finite) verb is taken to be the structural center of clause structure. All other syntactic units (words) are either directly or indirectly connected to the verb in terms of the directed links, which are called dependencies.
A dependency structure is determined by the relation between a word (a head) and its dependents. Dependency structures are flatter than phrase structures in part because they lack a finite verb phrase constituent, and they are thus well suited for the analysis of languages with free word order, such as Czech or Warlpiri.
Difficulties
Ambiguities!!!
E.g.: Prepositional phrase attachment ambiguity
Parsing
Automatically generate parse tree for sentence
Given:
Grammar
Sentence
Find: hidden structure
Idea: Search for different parses
Applications
Question – Answering
Named Entity extraction
Sentiment analysis
Sentence Compression
Traditional approaches
Hand-defined rules: restrict rules by hand to have at best only one possible parse tree
🔴 Problems
- Many parses for the same sentence
- Coverage Problem (Many sentences could not be parsed)
- Time and cost intensive
Statistical parsing
Use machine learning techniques to distinguish probable and less probable trees
- Automatically learn rules from training data
- Hand-annotated text with parse trees
- still many parse trees for one sentence 🤪
- But weights define most probable
- Tasks
- Training: learn possible rules and their probabilities
- Search: find most probable parse tree for sentence
Annotated Data
Treebank:
human annotated sentence with structure
Words
POS Tags
Phrase structure
👍 Advantages:
Reusable
High coverage
Evaluation
Example
Probabilistic Context Free Grammar
Extension to Context Free Grammar
Formel definition: 5 tuple
$$ G = (V, \Sigma, R, S, P) $$- $V, \Sigma, R, S$: same as Context Free Grammar
- $P$: set of Probabilities on production rules
- E.g.:
s
–>NP
VP
0.5
- E.g.:
Properties
Probability of derivation is product over all rules
$$ P(D)=\prod_{r \in D} P(r) $$Sum over all probabilities of rules replacing one non-terminal is one
$$ \sum_{A} P(S \rightarrow A)=1 $$Sum over all derivations is one
$$ \sum_{D \in S} P(D)=1 $$
Training
Input: Annotated training data (E.g.: Treebank)
Training
- Rule extraction: Extract possible rules from the trees of the training data
- Probability estimation
- Assign probabilities of the rules
- Maximum-likelihood estimation
Example:
Search
Find possible parses of the sentence
Statistical approach: Find all/many possible parse trees
Return most probable one
Strategies:
Top-Down
Bottom up
Shift reduce algorithm
- Shift: advances in the input stream by one symbol. That shifted symbol becomes a new single-node parse tree.
- Reduce: applies a completed grammar rule to some of the recent parse trees, joining them together as one tree with a new root symbol.
Example
Dynamic Programming
CYK Parsing
Avoid repeat work
Use Dynamic Programming
- Transform grammar in Chomsky normal form
- Store best trees for subphrases
- Combine tree from best trees of subphrases
All rules must have the following form
- A –> BC
- A, B, C non-terminals
- B, C not the start symbol
- A –> a
- A non-terminal
- a terminal
- S –> $\epsilon$
- Create empty string if it is in the grammar
- A –> BC
Every context-free grammar can be transferred into one having Chomsky normal form
Binarization
Only rules with two non-terminals
Idea:
Introduce additional non-terminal
Replace one rules with three non-terminals by two rules with two non- terminals each
Example
Remove unaries
- Remove intermediate rules
Problems
- Very strong indepedence assumption
- Label is bottleneck
Grammar
Analyse the sentence “she eats a fish with a fork” with the CYK algorithm:
result:
Transition-based Dependency Parsing
Model Dependency structure
Predict transition sequence: Transition between configuration
Arc-standard System
Configuration
- Stack
- Buffer
- Set of Dependency Arcs
- Initial configuration: [Root], $w_1,\dots, w_n$, {}
- All words are in the buffer
- The stack is empty
- The dependency graph is empty
- Terminal configuration
- The buffer is empty
- The stack contains a single word
Example
Transistions
Left-arc
$$ ([\sigma|i| j], B, A) \Rightarrow([\sigma \mid j], B, A \cup\{j, I, i\}) $$- Add dependency between top and second top element of the stack with label l to the arcs
- Remove second top element from the stack
Right-arc
$$ ([\sigma|i| j], B, A) \Rightarrow([\sigma \mid i], B, A \cup\{i, I, j\}) $$- Add dependency between second top and top element of the stack with label l to the arcs
- Remove top element from the stack
Shift: Move first elemnt of the buffer to the stack
Example
- Initial configuration
- Shift
- Shift
- Left arc
- Shift
- Shift
- Left arc
- Shift
- Right arc
Right arc
Problems
- Sparsity
- Incompleteness
- Expensive computation
Neural Network-based prediction
Feed forward neural network to predict operation
Inpupt
- Set of words $S^w$, pos-tags $S^t$ adn labels $S^l$
- Fixed number
- Map to continuous space
Output
- Operation
- $2N_l + 1$
Example structure
Evaluation
Label precision/recall
Describe tree as set of triple (Label, start, end)
Calculate precision/recall/f-score of reference and hypothesis