←Back to Documentation
Troubleshooting
Common issues, diagnostics, and solutions for entropyDB
Overview
Troubleshooting areas:
- • Connection Issues: Can't connect to database
- • Performance Problems: Slow queries, high latency
- • Deadlocks: Transaction conflicts
- • Disk Space: Running out of storage
- • Replication Lag: Replica behind primary
Connection Issues
-- Problem: Can't connect to database -- Error: "FATAL: no pg_hba.conf entry for host" -- Solution: Update pg_hba.conf # Add client IP address host all all 192.168.1.0/24 md5 # Reload configuration SELECT pg_reload_conf(); -- Problem: "FATAL: too many connections" -- Check current connections SELECT count(*) FROM pg_stat_activity; -- Check max_connections setting SHOW max_connections; -- Kill idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '1 hour'; -- Increase max_connections (requires restart) ALTER SYSTEM SET max_connections = 300; -- Restart database -- Better: Use connection pooling (PgBouncer) -- Problem: "FATAL: password authentication failed" -- Check user exists SELECT usename FROM pg_user WHERE usename = 'myuser'; -- Reset password ALTER USER myuser WITH PASSWORD 'newpassword'; -- Check authentication method SELECT * FROM pg_hba_file_rules;
Performance Diagnostics
-- Problem: Slow queries -- Find slow queries SELECT query, calls, mean_exec_time, max_exec_time, total_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10; -- Check if query uses indexes EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email = 'user@example.com'; -- Look for: -- - "Seq Scan" (bad for large tables) -- - "Index Scan" (good) -- - High "Buffers" read vs hit ratio -- Problem: High CPU usage -- Check active queries SELECT pid, usename, state, query, NOW() - query_start as duration FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start; -- Kill long-running query SELECT pg_cancel_backend(pid); -- Graceful SELECT pg_terminate_backend(pid); -- Force -- Problem: High memory usage -- Check work_mem usage SELECT pid, usename, query, state, pg_size_pretty(pg_backend_memory_contexts.total_bytes) as memory FROM pg_stat_activity JOIN pg_backend_memory_contexts ON pg_backend_memory_contexts.pid = pg_stat_activity.pid WHERE state != 'idle' ORDER BY pg_backend_memory_contexts.total_bytes DESC; -- Problem: Cache hit ratio low SELECT sum(heap_blks_read) as heap_read, sum(heap_blks_hit) as heap_hit, round(sum(heap_blks_hit) * 100.0 / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0), 2) as cache_hit_ratio FROM pg_statio_user_tables; -- Should be > 99%. If lower, increase shared_buffers
Deadlock Detection
-- Problem: "ERROR: deadlock detected" -- Enable deadlock logging ALTER SYSTEM SET log_lock_waits = on; ALTER SYSTEM SET deadlock_timeout = '1s'; -- Find blocked queries SELECT blocked_locks.pid AS blocked_pid, blocked_activity.usename AS blocked_user, blocking_locks.pid AS blocking_pid, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement, blocking_activity.query AS blocking_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted; -- Kill blocking query SELECT pg_terminate_backend(blocking_pid); -- Prevention: Always acquire locks in same order -- BAD: Transaction 1 locks A then B, Transaction 2 locks B then A -- GOOD: Both lock A then B -- Use lock timeouts SET lock_timeout = '5s';
Disk Space Issues
-- Problem: "ERROR: could not extend file... No space left on device" -- Check database size SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) as size FROM pg_database ORDER BY pg_database_size(pg_database.datname) DESC; -- Check table sizes SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size, pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) as index_size FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10; -- Check for table bloat SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size, n_dead_tup, n_live_tup, round(n_dead_tup * 100.0 / NULLIF(n_live_tup, 0), 2) as dead_pct FROM pg_stat_user_tables WHERE n_dead_tup > 1000 ORDER BY n_dead_tup DESC; -- Solution: VACUUM FULL (requires exclusive lock) VACUUM FULL bloated_table; -- Or pg_repack (no downtime) pg_repack -t bloated_table -d mydb -- Check WAL directory size SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir(); -- Archive old WAL files SELECT pg_switch_wal(); -- Delete old logs -- In PostgreSQL data directory: find pg_log -name "*.log" -mtime +30 -delete
Replication Issues
-- Problem: Replication lag
-- Check lag on primary
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_pretty
FROM pg_stat_replication;
-- Check lag on replica
SELECT
NOW() - pg_last_xact_replay_timestamp() AS replication_lag;
-- Causes of lag:
-- 1. Network issues
-- 2. Slow disk on replica
-- 3. Heavy queries on replica blocking apply
-- Find queries blocking replication
SELECT
pid,
usename,
application_name,
state,
query
FROM pg_stat_activity
WHERE state = 'active'
AND backend_type = 'client backend';
-- Kill blocking queries on replica
SELECT pg_terminate_backend(pid);
-- Problem: Replication slot full
-- Check replication slots
SELECT
slot_name,
slot_type,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;
-- Drop unused slot
SELECT pg_drop_replication_slot('slot_name');Error Code Reference
Common Error Codes
- • 23505: Unique constraint violation
- • 23503: Foreign key violation
- • 40001: Serialization failure (retry transaction)
- • 40P01: Deadlock detected
- • 53300: Too many connections
- • 53400: Configuration limit exceeded
- • 57P03: Database in recovery mode
- • 58000: System error (disk full, etc.)
Best Practices
Prevention
- • Monitor metrics continuously
- • Set up alerts for issues
- • Regular maintenance (VACUUM)
- • Use connection pooling
Diagnosis
- • Check logs first
- • Use EXPLAIN for slow queries
- • Monitor system resources
- • Test in staging environment