PgSQL &#183; 特性分析 &#183; PostgreSQL Aurora方案与DEMO

2024-07-21 02:51:35

字体：大中小

来源：转载

供稿：网友

前言

亚马逊推出的Aurora数据库引擎，支持一份存储，一主多读的架构。这个架构和Oracle RAC类似，也是共享存储，但是只有一个实例可以执行写操作，其他实例只能执行读操作。相比传统的基于复制的一主多读，节约了存储和网络带宽的成本。

我们可以使用PostgreSQL的hot standby模式来模拟这种共享存储一主多读的架构，但是需要注意几点，hot standby也会对数据库有写的动作，例如recovery时，会修改控制文件，数据文件等等，这些操作是多余的。另外很多状态是存储在内存中的，所以内存状态也需要更新。

还有需要注意的是：

pg_xlogpg_logpg_clogpg_multixactpostgresql.confrecovery.confpostmaster.pid最终实现一主多备的架构，需要通过改PG内核来实现：
这些文件应该是每个实例对应一份。postgresql.conf, recovery.conf, postmaster.pid, pg_controlhot standby不执行实际的恢复操作，但是需要更新自己的内存状态，如当前的OID，XID等等，以及更新自己的pg_control。在多实例间，要实现主到备节点的OS脏页的同步，数据库shared buffer脏页的同步。模拟过程
不改任何代码，在同一主机下启多实例测试，会遇到一些问题。(后面有问题描述，以及如何修改代码来修复这些问题)
主实例配置文件：
 # vi postgresql.conflisten_addresses='0.0.0.0'port=1921max_connections=100unix_socket_directories='.'ssl=onssl_ciphers='EXPORT40'shared_buffers=512MBhuge_pages=trymax_PRepared_transactions=0max_stack_depth=100kBdynamic_shared_memory_type=posixmax_files_per_process=500wal_level=logicalfsync=offsynchronous_commit=offwal_sync_method=open_datasyncfull_page_writes=offwal_log_hints=offwal_buffers=16MBwal_writer_delay=10mscheckpoint_segments=8archive_mode=offarchive_command='/bin/date'max_wal_senders=10max_replication_slots=10hot_standby=onwal_receiver_status_interval=1shot_standby_feedback=onenable_bitmapscan=onenable_hashagg=onenable_hashjoin=onenable_indexscan=onenable_material=onenable_mergejoin=onenable_nestloop=onenable_seqscan=onenable_sort=onenable_tidscan=onlog_destination='csvlog'logging_collector=onlog_directory='pg_log'log_truncate_on_rotation=onlog_rotation_size=10MBlog_checkpoints=onlog_connections=onlog_disconnections=onlog_duration=offlog_error_verbosity=verboselog_line_prefix='%ilog_statement='none'log_timezone='PRC'autovacuum=onlog_autovacuum_min_duration=0autovacuum_vacuum_scale_factor=0.0002autovacuum_analyze_scale_factor=0.0001datestyle='iso,timezone='PRC'lc_messages='C'lc_monetary='C'lc_numeric='C'lc_time='C'default_text_search_config='pg_catalog.english' # vi recovery.donerecovery_target_timeline='latest'standby_mode=onprimary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60' # vi pg_hba.conflocal   replication     postgres                                trusthost    replication     postgres 127.0.0.1/32            trust启动主实例。
postgres@digoal-> pg_ctl start启动只读实例，必须先删除postmaster.pid，这点PostgreSQL新版本加了一个PATCH，如果这个文件被删除，会自动关闭数据库，所以我们需要注意，不要使用最新的PGSQL，或者把这个patch干掉先。
postgres@digoal-> cd $PGDATApostgres@digoal-> mv recovery.done recovery.confpostgres@digoal-> rm -f postmaster.pidpostgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"查看当前控制文件状态，只读实例改了控制文件，和前面描述一致。
postgres@digoal-> pg_controldata |grep stateDatabase cluster state:               in archive recovery连到主实例，创建表，插入测试数据。
psql -p 1921postgres=# create table test1(id int);CREATE TABLEpostgres=# insert into test1 select generate_series(1,10);INSERT 0 10在只读实例查看插入的数据。
postgres@digoal-> psql -h 127.0.0.1 -p 1922postgres=# select * from test1; id----  1  2  3  4  5  6  7  8  9 10(10 rows)主实例执行检查点后，控制文件状态会改回生产状态。
psql -p 1921postgres=# checkpoint;CHECKPOINTpostgres@digoal-> pg_controldata |grep stateDatabase cluster state:               in production但是如果在只读实例执行完检查点，又会改回恢复状态。
postgres@digoal-> psql -h 127.0.0.1 -p 1922psql (9.4.4)postgres=# checkpoint;CHECKPOINTpostgres@digoal-> pg_controldata |grep stateDatabase cluster state:               in archive recovery注意到，上面的例子有1个问题，用流复制的话，会从主节点通过网络拷贝XLOG记录，并覆盖同一份已经写过的XLOG记录的对应的OFFSET，这是一个问题，因为可能会造成主节点看到的数据不一致（比如一个数据块改了多次，只读实例在恢复时将它覆盖到老的版本了，在主实例上看到的就会变成老版本的BLOCK，后面再来改这个问题，禁止只读实例恢复数据）。
另一方面，我们知道PostgreSQL standby会从三个地方（流、pg_xlog、restore_command）读取XLOG进行恢复，所以在共享存储的环境中，我们完全没有必要用流复制的方式，直接从pg_xlog目录读取即可。修改recovery.conf参数，将以下注释
 # primary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60'重启只读实例。
pg_ctl stop -m fastpostgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"重新测试数据一致性。主实例：
postgres=# insert into test1 select generate_series(1,10);INSERT 0 10postgres=# insert into test1 select generate_series(1,10);INSERT 0 10postgres=# insert into test1 select generate_series(1,10);INSERT 0 10postgres=# insert into test1 select generate_series(1,10);INSERT 0 10只读实例：
postgres=# select count(*) from test1; count-------    60(1 row)问题分析和解决
截至目前，有几个问题未解决：
standby还是要执行recovery的操作，recovery产生的write操作会随着只读实例数量的增加而增加。另外recovery有一个好处，解决了脏页的问题，主实例shared buffer中的脏页不需要额外的同步给只读实例了。recovery还会带来一个严重的BUG，回放可能和当前主节点操作同一个data page；或者回放时将块回放到老的状态，而实际上主节点又更新了这个块，造成数据块的不一致。如果此时只读实例关闭，然后立即关闭主实例，数据库再起来时，这个数据块是不一致的；standby还是会改控制文件；在同一个$PGDATA下启动实例，首先要删除postmaster.pid；关闭实例时，已经被删除postmaster.pid的实例，只能通过找到postgres主进程的pid，然后发kill -s 15, 2或3的信号来关闭数据库；
 static void set_mode(char *modeopt) {         if (strcmp(modeopt, "s") == 0 || strcmp(modeopt, "smart") == 0)         {                 shutdown_mode = SMART_MODE;                 sig = SIGTERM;         }         else if (strcmp(modeopt, "f") == 0 || strcmp(modeopt, "fast") == 0)         {                 shutdown_mode = FAST_MODE;                 sig = SIGINT;         }         else if (strcmp(modeopt, "i") == 0 || strcmp(modeopt, "immediate") == 0)         {                 shutdown_mode = IMMEDIATE_MODE;                 sig = SIGQUIT;         }         else         {                 write_stderr(_("%s: unrecognized shutdown mode /"%s/"/n"), progname, modeopt);                 do_advice();                 exit(1);         } }当主节点删除rel page时，只读实例回放时，会报invalid xlog对应的rel page不存在的错误，这个也是只读实例需要回放日志带来的问题。非常容易重现这个问题，删除一个表即可。
 2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,20,,2015-10-09 13:29:15 CST,1/0,0,WARNING,01000,"page 8 of relation base/151898/185251 does not exist",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"report_invalid_page, xlogutils.c:67","" 2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,21,,2015-10-09 13:29:15 CST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"log_invalid_page, xlogutils.c:91",""这个报错可以先注释这一段来绕过，从而可以演示下去。
 src/backend/access/transam/xlogutils.c /* Log a reference to an invalid page */ static void log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,                                  bool present) {   //////         /*          * Once recovery has reached a consistent state, the invalid-page table          * should be empty and remain so. If a reference to an invalid page is          * found after consistency is reached, PANIC immediately. This might seem          * aggressive, but it's better than letting the invalid reference linger          * in the hash table until the end of recovery and PANIC there, which          * might come only much later if this is a standby server.          */         //if (reachedConsistency)         //{         //      report_invalid_page(WARNING, node, forkno, blkno, present);         //      elog(PANIC, "WAL contains references to invalid pages");         //}由于本例是在同一个操作系统中演示，所以没有遇到OS的dirty page cache的问题，如果是不同主机的环境，我们需要解决OS dirty page cache 的同步问题，或者消除dirty page cache，如使用direct IO。或者集群文件系统如gfs2。如果要产品化，至少需要解决以上问题。
先解决Aurora实例写数据文件、控制文件、检查点的问题。
增加一个启动参数，表示这个实例是否为Aurora实例（即只读实例）
  # vi src/backend/utils/misc/guc.c /******** option records follow ********/ static struct config_bool ConfigureNamesBool[] = {         {                 {"aurora", PGC_POSTMASTER, CONN_AUTH_SETTINGS,                         gettext_noop("Enables advertising the server via Bonjour."),                         NULL                 },                 &aurora,                 false,                 NULL, NULL, NULL         },新增变量
 # vi src/include/postmaster/postmaster.h extern bool aurora;禁止Aurora实例更新控制文件
 # vi src/backend/access/transam/xlog.c #include "postmaster/postmaster.h" bool aurora; void UpdateControlFile(void) {         if (aurora) return;禁止Aurora实例启动bgwriter进程
 # vi src/backend/postmaster/bgwriter.c #include "postmaster/postmaster.h" bool  aurora; /*  * Main entry point for bgwriter process  *  * This is invoked from AuxiliaryProcessMain, which has already created the  * basic execution environment, but not enabled signals yet.  */ void BackgroundWriterMain(void) {   //////         pg_usleep(1000000L);         /*          * If an exception is encountered, processing resumes here.          *          * See notes in postgres.c about the design of this coding.          */         if (!aurora && sigsetjmp(local_sigjmp_buf, 1) != 0)         {   //////                 /*                  * Do one cycle of dirty-buffer writing.                  */                 if (!aurora) {                 can_hibernate = BgBufferSync();   //////                 }                 pg_usleep(1000000L);         } }禁止Aurora实例启动checkpointer进程
 # vi src/backend/postmaster/checkpointer.c #include "postmaster/postmaster.h" bool  aurora;   ////// /*  * Main entry point for checkpointer process  *  * This is invoked from AuxiliaryProcessMain, which has already created the  * basic execution environment, but not enabled signals yet.  */ void CheckpointerMain(void) {   //////         /*          * Loop forever          */         for (;;)         {                 bool            do_checkpoint = false;                 int                     flags = 0;                 pg_time_t       now;                 int                     elapsed_secs;                 int                     cur_timeout;                 int                     rc;                 pg_usleep(100000L);                 /* Clear any already-pending wakeups */                 if (!aurora)  ResetLatch(&MyProc->procLatch);                 /*                  * Process any requests or signals received recently.                  */                 if (!aurora) AbsorbFsyncRequests();                 if (!aurora && got_SIGHUP)                 {                         got_SIGHUP = false;                         ProcessConfigFile(PGC_SIGHUP);                         /*                          * Checkpointer is the last process to shut down, so we ask it to                          * hold the keys for a range of other tasks required most of which                          * have nothing to do with checkpointing at all.                          *                          * For various reasons, some config values can change dynamically                          * so the primary copy of them is held in shared memory to make                          * sure all backends see the same value.  We make Checkpointer                          * responsible for updating the shared memory copy if the                          * parameter setting changes because of SIGHUP.                          */                         UpdateSharedMemoryConfig();                 }                 if (!aurora && checkpoint_requested)                 {                         checkpoint_requested = false;                         do_checkpoint = true;                         BgWriterStats.m_requested_checkpoints++;                 }                 if (!aurora && shutdown_requested)                 {                         /*                          * From here on, elog(ERROR) should end with exit(1), not send                          * control back to the sigsetjmp block above                          */                         ExitOnAnyError = true;                         /* Close down the database */                         ShutdownXLOG(0, 0);                         /* Normal exit from the checkpointer is here */                         proc_exit(0);           /* done */                 }                 /*                  * Force a checkpoint if too much time has elapsed since the last one.                  * Note that we count a timed checkpoint in stats only when this                  * occurs without an external request, but we set the CAUSE_TIME flag                  * bit even if there is also an external request.                  */                 now = (pg_time_t) time(NULL);                 elapsed_secs = now - last_checkpoint_time;                 if (!aurora && elapsed_secs >= CheckPointTimeout)                 {                         if (!do_checkpoint)                                 BgWriterStats.m_timed_checkpoints++;                         do_checkpoint = true;                         flags |= CHECKPOINT_CAUSE_TIME;                 }                 /*                  * Do a checkpoint if requested.                  */                 if (!aurora && do_checkpoint)                 {                         bool            ckpt_performed = false;                         bool            do_restartpoint;                         /* use volatile pointer to prevent code rearrangement */                         volatile CheckpointerShmemStruct *cps = CheckpointerShmem;                         /*                          * Check if we should perform a checkpoint or a restartpoint. As a                          * side-effect, RecoveryInProgress() initializes TimeLineID if                          * it's not set yet.                          */                         do_restartpoint = RecoveryInProgress();                         /*                          * Atomically fetch the request flags to figure out what kind of a                          * checkpoint we should perform, and increase the started-counter                          * to acknowledge that we've started a new checkpoint.                          */                         SpinLockAcquire(&cps->ckpt_lck);                         flags |= cps->ckpt_flags;                         cps->ckpt_flags = 0;                         cps->ckpt_started++;                         SpinLockRelease(&cps->ckpt_lck);                         /*                          * The end-of-recovery checkpoint is a real checkpoint that's                          * performed while we're still in recovery.                          */                         if (flags & CHECKPOINT_END_OF_RECOVERY)                                 do_restartpoint = false;   //////                         ckpt_active = false;                 }                 /* Check for archive_timeout and switch xlog files if necessary. */                 if (!aurora) CheckArchiveTimeout();                 /*                  * Send off activity statistics to the stats collector.  (The reason                  * why we re-use bgwriter-related code for this is that the bgwriter                  * and checkpointer used to be just one process.  It's probably not                  * worth the trouble to split the stats support into two independent                  * stats message types.)                  */                 if (!aurora) pgstat_send_bgwriter();                 /*                  * Sleep until we are signaled or it's time for another checkpoint or                  * xlog file switch.                  */                 now = (pg_time_t) time(NULL);                 elapsed_secs = now - last_checkpoint_time;                 if (elapsed_secs >= CheckPointTimeout)                         continue;                       /* no sleep for us ... */                 cur_timeout = CheckPointTimeout - elapsed_secs;                 if (!aurora && XLogArchiveTimeout > 0 && !RecoveryInProgress())                 {                         elapsed_secs = now - last_xlog_switch_time;                         if (elapsed_secs >= XLogArchiveTimeout)                                 continue;               /* no sleep for us ... */                         cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);                 }                 if (!aurora) rc = WaitLatch(&MyProc->procLatch,                                            WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,                                            cur_timeout * 1000L /* convert to ms */ );                 /*                  * Emergency bailout if postmaster has died.  This is to avoid the                  * necessity for manual cleanup of all postmaster children.                  */                 if (rc & WL_POSTMASTER_DEATH)                         exit(1);         } }   ////// /* SIGINT: set flag to run a normal checkpoint right away */ static void ReqCheckpointHandler(SIGNAL_ARGS) {         if (aurora)            return;         int                     save_errno = errno;         checkpoint_requested = true;         if (MyProc)                 SetLatch(&MyProc->procLatch);         errno = save_errno; }   ////// /*  * AbsorbFsyncRequests  *              Retrieve queued fsync requests and pass them to local smgr.  *  * This is exported because it must be called during CreateCheckPoint;  * we have to be sure we have accepted all pending requests just before  * we start fsync'ing.  Since CreateCheckPoint sometimes runs in  * non-checkpointer processes, do nothing if not checkpointer.  */ void AbsorbFsyncRequests(void) {         CheckpointerRequest *requests = NULL;         CheckpointerRequest *request;         int                     n;         if (!AmCheckpointerProcess() || aurora)                 return;   //////禁止Aurora实例手工调用checkpoint命令
 # vi src/backend/tcop/utility.c #include "postmaster/postmaster.h" bool  aurora;   ////// void standard_ProcessUtility(Node *parsetree,                                                 const char *queryString,                                                 ProcessUtilityContext context,                                                 ParamListInfo params,                                                 DestReceiver *dest,                                                 char *completionTag) {   //////                 case T_CheckPointStmt:                    if (!superuser() || aurora)                                 ereport(ERROR,                                                 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),                                                  errmsg("must be superuser to do CHECKPOINT")));改完上面的代码，重新编译一下，现在接近一个DEMO了。现在Aurora实例不会更新控制文件，不会写数据文件，不会执行checkpoint，是我们想要的结果。启动只读实例时，加一个参数aurora=true，表示启动Aurora实例。
pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922 -c aurora=true"不过要产品化，还有很多细节需要考虑，这只是一个DEMO。阿里云RDS的小伙伴们加油！
还有一种更保险的玩法，共享存储多读架构，需要存储两份数据。其中一份是主实例的存储，它自己玩自己的，其他实例不对它做任何操作；另一份是standby的，这部作为共享存储，给多个只读实例来使用。
参考
https://aws.amazon.com/cn/rds/aurora/src/backend/access/transam/xlog.c
 /*  * Open the WAL segment containing WAL position 'RecPtr'.  *  * The segment can be fetched via restore_command, or via walreceiver having  * streamed the record, or it can already be present in pg_xlog. Checking  * pg_xlog is mainly for crash recovery, but it will be polled in standby mode  * too, in case someone copies a new segment directly to pg_xlog. That is not  * documented or recommended, though.  *  * If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should  * prepare to read WAL starting from RedoStartLSN after this.  *  * 'RecPtr' might not point to the beginning of the record we're interested  * in, it might also point to the page or segment header. In that case,  * 'tliRecPtr' is the position of the WAL record we're interested in. It is  * used to decide which timeline to stream the requested WAL from.  *  * If the record is not immediately available, the function returns false  * if we're not in standby mode. In standby mode, waits for it to become  * available.  *  * When the requested record becomes available, the function opens the file  * containing it (if not open already), and returns true. When end of standby  * mode is triggered by the user, and there is no more WAL available, returns  * false.  */ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,                                                         bool fetching_ckpt, XLogRecPtr tliRecPtr) {   //////         static pg_time_t last_fail_time = 0;         pg_time_t       now;         /*-------          * Standby mode is implemented by a state machine:          *          * 1. Read from either archive or pg_xlog (XLOG_FROM_ARCHIVE), or just          *        pg_xlog (XLOG_FROM_XLOG)          * 2. Check trigger file          * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM)          * 4. Rescan timelines          * 5. Sleep 5 seconds, and loop back to 1.          *          * Failure to read from the current source advances the state machine to          * the next state.          *          * 'currentSource' indicates the current state. There are no currentSource          * values for "check trigger", "rescan timelines", and "sleep" states,          * those actions are taken when reading from the previous source fails, as          * part of advancing to the next state.          *-------          */








上一篇：Redis安装及常用命令介绍


下一篇：100亿数据1万属性数据架构设计














发表评论
共有条评论






用户名:

密码:



验证码:

 

匿名发表


















学习交流
更多





练就火眼金睛 十一种常见电脑


打印机共享提示“操作无法完成






如何查找有故障的配件
如何查找有故障的配件...






绿毒越狱软件 完美越狱4.21教程12-21

htcg15怎么刷机？htcg15刷机教程12-21

HTC hd7刷安卓2.2全教程12-21

mpkg软件下载和卸载的方法12-21




网站后台系统设置的网站地址格式写法07-09

防止重复提交js按钮变灰07-09

不同时间显示不同的广告代码07-09

iframe框架调用高度自适应方法07-09

备份数据库出现问题的解决方法07-09







热门图片
更多




回眸一笑百魅生，六宫粉黛无颜色


岁月静美，剪一影烟雨江南




芜湖有个“松鼠小镇”


小满：小得盈满，一切刚刚好！




一串串晶莹剔透的葡萄，像一颗颗宝石挂在藤


正宗老北京脆皮烤鸭




人逢知己千杯少，喝酒搞笑图集


搞笑试卷，学生恶搞答题







猜你喜欢的新闻


最牛同桌！我考了696分我同桌考了703分

650分！高二女生考入北大：遗憾不能上高三

男生估分600只考了397 妈妈：高考虽重要，但不

唐尚珺回应是否会直播带货：有人出100万想和

名校抢人名场面：清华、北大太拼了！

男生高考语文满分！网友：第一次听说

第16次参加高考！唐尚珺今年成绩超600分

语文145分学霸笔记公开字迹清秀：多阅读，该背

数学满分！四川绵阳一男生高考710分 理综仅扣

阿里全球数学竞赛选手：决赛专业性太强，6道题





猜你喜欢的关注


技术点滴：关于BIOS设置和CMOS设置

技术点滴：进入系统去迟迟不见反应

技术点滴：菜鸟必学的53个计算机英语词汇

技术点滴：让你的MAC地址随意转变

技术点滴：让Windows百毒不侵

技术点滴：CMD命令集锦

6类网线的完美接法

web.config中存储自定义对象的正确方法

文件映射成WINDOW进程通信的首选方法（权释wi

添加路由让多网卡网关工作正常











新闻热点





最牛同桌！我考了696分我同桌考了703分
2024-06-26 22:28:41






650分！高二女生考入北大：遗憾不能上高三
2024-06-26 22:26:16






男生估分600只考了397 妈妈：高考虽重要，但不代表所有
2024-06-26 22:23:01






唐尚珺回应是否会直播带货：有人出100万想和他合作！
2024-06-25 19:29:23






名校抢人名场面：清华、北大太拼了！
2024-06-25 19:22:14






男生高考语文满分！网友：第一次听说
2024-06-25 19:19:15











疑难解答




如何查找有故障的配件

计算机启动故障的判断

练就火眼金睛 十一种常见电脑故障判断

了解电脑启动过程，解决启动问题

电脑也怕冷，温度低引发的故障

电脑各种中英文信息对照及错误信息总

新安装计算机无法启动的故障原因

电脑噪音大的解决办法

怎样减少机箱内的电磁干扰？

常见电脑故障维修－声卡音箱篇







图片精选




6类网线的完美接法



使用VMM2008 R2管理Hyper-V之1-实



开机黑屏与内存的原因导致DDR2内存



Lua中基本的数据类型、表达式与流











网友关注




三星i9250去除虚拟按键教程

绿毒越狱软件 完美越狱4.21教程

htcg15怎么刷机？htcg15刷机教程

HTC hd7刷安卓2.2全教程

mpkg软件下载和卸载的方法

华为c8860刷机教程

美萌手账更换手账背景方法

用ipad mini可以打电话吗？怎么打？

苹果ipad2使用教程

三星s6352怎么样设置上网？

PgSQL &amp;#183; 特性分析 &amp;#183; PostgreSQL Aurora方案与DEMO

前言

模拟过程

问题分析和解决

参考

PgSQL · 特性分析 · PostgreSQL Aurora方案与DEMO