Troubleshooting OSPF Without Crying (Much)

A Network Architect’s Guide to Diagnosing Problems Before They Become Resume-Generating Events

Welcome to the grand finale.
You’ve survived the theory, the design, the LSAs, the SPF math, and the war stories.
Now it’s time for the part every network engineer eventually faces:

Troubleshooting OSPF when it decides to set itself on fire at 2:13 AM.

This post is your practical, real-world playbook for finding the root cause fast, before your routing table starts looking like a Jackson Pollock painting.

Let’s go.

1. Start With the Basics: Is the Neighbor Even There?

Before jumping into LSAs, SPF, and conspiracy theories about ghost packets…

Check the fundamentals:
• Interface up/up?
• IP addressing correct?
• Mask correct?
• Link is actually connected to the device you think it is?
• Layer 1/2 errors?
• MAC/ARP behaving?

80% of OSPF “problems” are actually Layer 1–3 problems wearing an OSPF costume.

2. Verify Hello Packets: The OSPF Handshake Test

Use packet captures or debugging:
• Are Hellos being sent?
• Are Hellos being received?
• Do they match? (Critical!)

Check these fields for mismatches:
• Hello interval
• Dead interval
• Area ID
• Network type
• Stub flags
• Authentication

OSPF is picky—one mismatch and it ghosts the relationship.

3. Stuck in 2-Way?

Check whether you’re on:
• Broadcast or NBMA → 2-way may be normal
• Point-to-point → 2-way = BAD

If point-to-point interfaces never go beyond 2-Way, look for:
• Duplex mismatch
• MTU mismatch
• Filtered Hellos
• Unidirectional link

4. Stuck in ExStart/Exchange? MTU Is Your Culprit

The #1 cause of ExStart/Exchange purgatory:

MTU mismatch

Routers exchange DD packets, compare MTUs, and if mismatched…
they stare at each other in passive-aggressive contempt forever.

Fix by:
• Matching interface MTUs
• Use ip ospf mtu-ignore if you must, but understand: you’ve now created a future outage with your name on it.
• Checking tunnels, subinterfaces, and VPNs

5. Missing Routes? Follow the Path Backward

When a route is missing:
• Is it in the LSDB?
• Is the LSA valid?
• Is the LSA too old?
• Is the ABR summarizing it away?
• Is it filtered? (distribute-list, prefix-list, route-map)
• Is the router an ASBR/ABR?
• Is it seen in the correct area?

Remember: if it’s not in the LSDB, it was never real.

If it is in the LSDB but not in the routing table, SPF is ignoring it.

Possible causes:
• Better route via another protocol
• Route suppressed by area type
• Path is technically unreachable
• Next-hop missing

6. LSA Storms and SPF Thrashing

Symptoms:
• High CPU
• Frequent SPF runs
• Log messages about topology changes
• Massive LSDB churn
• Unstable neighbors

Typical causes:
• Flapping interfaces
• Misbehaving ABRs
• Redistributing unstable external routes
• Too many LSAs
• Giant, flat areas
• Poor summarization

Fix the root instability, not the symptoms.

7. DR/BDR Election Problems

Common issues:
• DR is a low-powered device
• DR keeps changing
• BDR stuck in weird states
• Too many routers on the segment
• Non-broadcast networks with wrong neighbor configuration

Best fixes:
• Set explicit OSPF priorities
• Reduce broadcast domain size
• Use point-to-point network type where possible
• Stabilize the DR once elected (or force your choice)

8. Virtual Link Troubles? Blame Design.

Issues include:
• Wrong transit area type (cannot be stub/NSSA)
• MTU mismatches
• Non-contiguous backbone
• Flapping transit links
• Authentication mismatches

Virtual links are fragile.
If one breaks, assume your design wants to be refactored.

9. External Route Chaos (Type 5/7 Issues)

If you have:
• Missing external routes
• Excessive external LSAs
• Type 7 not converting to Type 5
• ASBR unreachable

Check:
• NSSA flags
• ABR placement
• Redistribution policies
• Route-maps
• Filters
• Summary boundaries

External routes get messy fast. If you don’t document redistribution, you’re planting a landmine for Future You.

10. The Systematic Troubleshooting Checklist

Here is the ultimate checklist used in real troubleshooting war rooms:

Step 1: Layer 1/2
• Interface up?
• Errors?
• Duplex?
• Speed?

Step 2: IP + Mask
• Correct network?
• Mask mismatch?
• Point-to-point vs broadcast?

Step 3: OSPF Basics
• Correct area ID?
• Same network type?
• Same timers?
• Same authentication?
• Stub flags aligned?

Step 4: Neighbor State
• Down → no Hellos
• Init → unidirectional
• 2-Way → DR/BDR or mismatch
• ExStart → MTU
• Exchange → MTU
• Loading → LSA requests failing
• Full → happy router

Step 5: LSDB
• LSA present?
• Is the LSA valid?
• Is the LSA checksum correct?
• Scope correct?

Step 6: SPF
• SPF log entries
• Overloaded CPU
• SPF timers triggering frequently

Step 7: Routing Table
• Route present?
• Next-hop reachable?
• Competing protocols?
• Redistribution loops?

Step 8: Architecture
• Summaries?
• ABR placement?
• Area type mismatches?
• Bad topology?

Architect’s Corner: What Experience Teaches You

• OSPF rarely breaks “randomly.” – There is always a reason.
• Most OSPF outages trace back to:
• MTU
• Timers
• Bad design choices
• Overly large areas
• Redistribution
• A forgotten config from 2009
• Always capture Hellos first—they tell you everything.
• Debugs are useful, but packet captures don’t lie.
• If SPF is running constantly, fix the instability before you fix OSPF.
• Design prevents troubleshooting.
• And finally:
Never assume OSPF will work automatically just because you typed “network x.x.x.x” in one router.

You Made It — OSPF Mastered.

This concludes our 7-part OSPF series.
You now understand:
• Neighbor formation
• Area design
• LSA structure
• SPF calculation
• Real-world architecture
• Troubleshooting like a pro

Leave a Comment

Your email address will not be published. Required fields are marked *