Optimizer Statistics: The Brain of PostgreSQL
45sExplains the critical role of statistics in query planning, a fundamental concept for performance tuning.
▶ Play ClipThis episode of Let's Talk About Data Show focuses on performance tuning of PostgreSQL databases. Hosts Ibrahim Amara and Domenico introduce the topic, covering optimizer statistics, index types, and design patterns. They demonstrate a live demo based on a workshop available on AWS Workshop Studio, emphasizing the importance of accurate statistics for query optimization.
Ibrahim and Domenico introduce the topic of performance tuning for PostgreSQL databases, noting that it will be covered in multiple sessions.
Domenico shows the workshop available on AWS Workshop Studio, which includes labs on database efficiency, optimizer statistics, index types, and design patterns.
The PostgreSQL planner relies on statistics (number of rows, distinct values, etc.) to calculate execution plan costs. Statistics are gathered via ANALYZE, VACUUM ANALYZE, or during DDL operations like CREATE INDEX.
pg_class stores table-level statistics (row count, page count). pg_stats provides column-level statistics: null_frac, n_distinct, avg_width, most_common_vals, most_common_freqs, histogram_bounds, and correlation.
The default_statistics_target parameter (default 100) controls the number of histogram buckets. Increasing it improves accuracy but increases ANALYZE time and storage. Can be set per column.
Correlation (0-1) indicates how well the logical order of column values matches physical row order. A value of 1 means perfect correlation, which can affect index scan efficiency.
When columns are correlated (e.g., country and city), the planner may misestimate row counts. Extended statistics (CREATE STATISTICS) with dependencies, ndistinct, and mcv can improve estimates.
Use monitoring tools like CloudWatch Database Insights or pg_stat_statements to find top SQL by execution time. Then use EXPLAIN ANALYZE to examine execution plans.
After bulk loads, run ANALYZE on affected tables. Monitor statistics freshness via pg_stat_all_tables (last_analyze, last_autoanalyze). Consider custom scripts to refresh stats based on change percentage.
PostgreSQL offers B-tree (default), GiST, GIN, BRIN, and other index types. BRIN indexes are useful for time-series data. GIN is for full-text search. Detailed coverage in next session.
Accurate optimizer statistics are crucial for PostgreSQL query performance. Use ANALYZE regularly, consider extended statistics for correlated columns, and leverage monitoring tools to identify and tune slow queries.
"Title accurately reflects content: the video delivers a solid introduction to PostgreSQL tuning with statistics and indexing strategies."
What command is used to calculate statistics in PostgreSQL?
ANALYZE
05:00
What is the default value of default_statistics_target?
100
12:00
What does the correlation column in pg_stats indicate?
How well the logical order of column values matches the physical row order in data pages.
15:00
What are the three types of extended statistics available in PostgreSQL?
dependencies (functional dependencies), ndistinct (number of distinct values), and mcv (most common values).
18:00
Which extension provides per-query execution statistics?
pg_stat_statements
22:00
What is the recommended action after a bulk data load?
Run ANALYZE on the affected tables to refresh statistics.
25:00
Which index type is recommended for time-series data?
BRIN (Block Range INdex)
28:00
Statistics are the brain of the planner
Clear analogy emphasizing that the optimizer relies heavily on statistics to choose execution plans.
05:00Correlation changes after random updates
Live demo showing how updating a column with random values changes correlation from 1 to near 0, illustrating physical vs logical ordering.
15:00Extended statistics fix misestimates
Demonstration of how extended statistics dramatically improve row count estimates for correlated columns, from huge discrepancy to near perfect.
18:00[music] Hello and welcome everyone and thanks for joining us in a new episode of Let's Talk About Data Show. It's our weekly show that we broadcast on Monday and today we have a new episode for you. Before that, let me introduce myself. I'm Ibrahim Amara, one of your regular hosts. I'm a database specialist solutions architect here at AWS. and I'm based out of London and for today we have an interesting topic to dis to discuss which
is the performance tuning of Paul's grace database and for that um I'm really pleased to have my colleague my teammate Dominico to to teach you and then learn together how you can optimize the performance of the postgrad database. Dominico would like to introduce yourself please. >> Uh yeah great. Uh, hi Bra, I uh, hi folks. Uh, this is Dominico. Uh, I'm based out of Italy and I'm a teammate of Ibraim like he said. Uh, I'm in
AWS since u almost five years ago now. And in in AWS just like Ibraim, I'm a database specialist solution architect uh mainly focused on the um you know the um relational database domain uh and with open source engines specific uh posgress. So I'm very happy to be here today. >> Thanks Domo. For our viewers please let us know where from where you are tuning in today. And yeah, and surprisingly it's sunny here today in London. I'm
not sure how is the weather your side in Milan Dominico. >> Oh, that's great. It's sunny, too. Yeah. >> Okay. >> We're approaching, you know, the spring hopefully. [clears throat] >> Yeah, that's really nice. And on chat, we have Jeff and who are going to support you. So, if you have any questions, please don't hesitate. Just send them through. Okay, perfect. And yeah so Domino the performance tuning of postgress database when it comes to relational database
I think to any database performance tuning is critical task and most of the times not the most straightforward one because of the different parameters that you have the different variables that you have and for that I believe we will not be able to cover that in one session. So we are planning to have multiple sessions of those to cover different aspects of tuning or performance the tuning the performance of the postgrades database. So >> we are
starting today with a set of topics right Dominico. So what are we going to cover today? >> Yeah great. So great question. Um so you should be able to see my screen uh right now and >> yeah so basically um we will base our discussion and it's it's a discussion around you know a demo a live demo let's say uh that is based on uh this very interesting um workshop available on uh our platform that is
workshop studio um so well and the reason why I'm showing you know this page is that because basically you this lab environment is also available uh for you. It's not necessarily something that you will um um you know benefit from um being in a workshop um that is led by us with us providing you um you know temporary accounts. You can also set up your lab environment. As you can see here on the left, there is
a link where basically you can go and uh you know get all the instructions. You can download a yam yam file with uh containing you know the um cloud formation uh instructions to build the entire uh lab environment uh in just your lab account to be able to uh go through all the labs that you have here on the left. you know this is basically the list of all the labs that are available um including you
know the instructions. So coming back to your question Ibrahim uh basically today what we will focus on is um you know this um subsection that is the database efficiency workshop particularly we will start from you know the fundamentals optimizer statistics and then we will go through um a little bit you know what are the index types available in posgrql um index design patterns and you know a general discussion uh mainly um you know around around indexes
and how to optimize um your you know your workload your queries using the right indexes and uh indexing strategies with postsql. >> Perfect. Thank you so much Dominico. Before we go ahead we have several folks tuning in. We have um measure from Florida. We have uh people from India. Hi, how are you doing? and Daniel from Brazil and also we have people from Atlanta and use postgra that's perfect and uh Nairobi Congo so from all over
the world that's great okay >> yeah so Tony um has shared the link to the workshop so you can access and you can also uh follow uh the steps that we are going to do if you want to have if you want to have some hands-on uh experience as well. So let's start with the first topic which is the statistics. >> Yeah. Yeah. Yeah. Optimizer statistics. Yeah, I mean just you know introducing it it's uh uh
when it comes you know to performances um you know engines uh relational database engines like posgress scale but not only they he rely on uh statistics uh that is you know number of rows or um number of distinct values and so many other statistics uh to calculate the cost because you know the uh planner that is basically the um uh functionality within you know The engine that decides what is the execution plan for your query relies
heavily relies on the statistics. >> Perfect. So my understanding so for all those relation databases whenever you are running a query against the database so it will go through different stages. First it will be passed and then the syntax and semantic of the the query will be checked and then we move to the planning phase where the planner or the optimizer or whatever the component responsible for coming up with the execution plan of the query will
kick in. So based on that execution plan it's like the brain of the database where it which will decide how I'm going to run this query. So the planner and or the optimizer is dependent on those statistics right to decide which plan to use. Okay perfect. Yeah, exactly. And in posgrql um these phases that you mentioned um they are basically accomplished by the uh backend process. And what is the back end process is the uh server
process that is basically attached. It's a oneonone you know correspondence between you know the server process and the client process. So uh your application. >> Perfect. We still have people tuning in. We have people from Brazil, Trinidad and Tobago, India, Tamil, USA, India. Okay, that's perfectly as >> Atlanta to India. It's >> Yeah. Yeah, that's amazing. >> Great. Yes, it is. >> Okay. And you have we have Alex from UK as well. That's great. So statistics.
So how you are going or how these statistics are being gathered or where they are being stored? >> Yeah. Yeah. Yeah. So basically uh you see here um just you know just as a really quick overview in posgrql the way you can calculate statistics is uh through uh a bunch of uh options like uh you know the first and foremost important is uh the uh command that is analyze that's you know basically what we will also
do and switching you know to the to this other um uh window here uh this other tab where basically I have my VS uh code environment uh that provides me the shell into the EC2 machine because you know this lab is based on an EC2 machine where we have installed the uh client to be able to connect uh to our um version 17 uh server right it's a posgrql version 17 so what we have done yeah
>> yeah sorry James let me just reiterate that all of that is part of the workshop that we have shared so if you use the YAML file to provision the resources. As part of that, the VS code will be provisioned as well and you'll be able to use it and the instructions is part of the workshop. >> Yeah, exactly. This is a VS code server running into uh a provisioned easy to machine. Uh so you will
get access to uh it you know through the uh workshop as soon as you have uh built you know this workshop in your lab account. So what I have done here is basically um you know just for the this first part of you know the um um of this uh session it's you know to set up use the pgbench um you know client command to basically set up a very easy environment that contains four tables right
this is just for educational purpose so I'm connected to uh this uh version 17 um posgressql data Uh the name of the database is test and these are the tables that are actually contained into this uh database and here you can also see uh the sides of uh each and every table right. Um so well this is you know the first command analyze um you know to basically uh ask the engine to calculate the statistics for
the entire database. Uh so here I've not specified any uh table or any other option um you know with this command. Uh but this is not the only way to calculate the statistics because you know in postgrql there is also the vacuum process. This will be part of uh another uh session um you know in the following uh episodes. Uh so basically as soon as we are vacuuming our database so we are reading the data pages
from disk it is also convenient sometimes you know to um run vacuum analytes to also calculate the statistics in parallel and this is what is basically done by the autovacuum uh demon right um internally yeah >> yeah I'm sorry for interruption okay just one thing for our viewers just to remind you we will have another show the week after and we will talk about vacuum in more details. So please tune in. Uh and also there's another
question from one of our viewers asking do I need any form of subscription to practice this or it's for it's free >> this uh workshop well as soon as uh you know the um lab environment is uh basically your own account there is no subscription so um you can access this link download the uh yam file containing you know all the uh cloud form instructions and then you can set up uh this lab environment. Yeah, [snorts]
>> thanks Tiko. >> Yeah, welcome. Welcome. Absolutely. >> Okay, so we have the vacuum analyze that can also gather the statistics and the analy >> because it is convenient. Yeah, definitely. Uh plus in uh during you know few DDL operations like um create index or uh reindex to rebuild an index statistics are also automatically calculated for the uh new uh newly created object. Right. So switching to the optimizer statistics basically uh what are we talking uh
about and going to talk about is um you know uh once we have our tables and we have calculated the statistics um there is first of all one important very important um you know uh object uh coming from the PG catalog >> um that basically contains an entry for uh let's A pretty every object that in the database is column based like for example the tables but not only the tables like also the indexes um sequences
uh materialized views and few other objects right >> so I'm just you know selecting here a few columns out of uh you know uh this uh view catalog view just to get the real name real names you know real stands for relation and in postgrql um you know terminology relation is a synonym for table right >> okay >> and similarly you have you know rail kind of relation R stands for relation so table or I stands
for index so indexes so you can get you know pretty um easily uh the number of tpples so rows um that have been uh you know calculated as part of the analyze process that is you know the command that we have run and also the uh number of um data pages from you know the storage perspective um number of 8 kilobytes data pages. So these are uh you know the first few things um in terms of
you know uh statistics that the uh optimizer looks into. So how big is um our object in terms of uh space occupied and uh you know tpples um contained. >> So I think Dominic it was mentioning that those views as you mentioned are or those tables are being updated when you run the analyze command. So between the running the analyze commands. So there might be a different you might getting different values from the PG class for
the number of uh pupils or the number of rows and the actual number of rows of the table itself. Right? So if you are looking for the >> actual actual number of rows you need to select directly from the table itself. >> Yeah. Yeah. Yeah. Exactly. Exactly. Or there are a few other options. Um like for example um there is an extension PG statup >> that provides you know yeah procedures and functions that uh uh you
know likely you can use to uh analyze either in a way that is approximate or um complete uh to have an idea of uh you know what's the status of your objects in terms of I don't know life taples and no >> yeah definitely and also there is uh another very very important um view from you know the catalog that you can look into is uh you know we we talked about PG class now we talk
about PG stats >> and again I'm just uh you know um getting few of the attributes uh from this view uh but basically here you have the um you know this column is attribute name attribute is of course a signon for column name in postgrql language so these are um you know the columns that are part of uh this table pgbench accounts right so if for example I go and describe you know this table we can
easily check that you know these are the columns the data type and you know the type of storage if it is nullable or not and eventually um you know a few other um details like um the indexes that are created you know for this table and in this case we only have a primary key that by default it's a B3 index type and this is automatically created by the engine when creating you know the um table
itself you know with the create table command specifying that this is a primary key then this means that uh a unique primary key index is created automatically >> by the engine. Um so here you know coming back to the PG stat you [clears throat] see that we have a few uh other uh details available like um the average uh width of um you know uh the taples. This is of course a value that is approximated the
number of distinct values. This is important for cardinality. And when you uh get you know minus one means for posgress that the uh values um in this column that is a id that is the primary key they are distinct. Oh >> they unique they are all distinct. Yeah. >> Yeah. >> And here null frack is basically the uh percentage of the um null values within the column. Right. And of course you know well this is just
a case that it is zero for all the columns but you know for a primary key since it is not nullable by default. Um and by definition uh it is zero in any case for the primary key right. Um yeah so um you know having you know going more dive deep on this basically um there is also another uh very important view like the uh PG start u old tables that basically will give you um well
this is only what I have extracted you know these three columns specific for statistics you have again the real name you know the table name um uh when it was last analyzed and It was uh analyzed you know by me with the analyze command at the beginning of this lecture or eventually when it is you know last out analyze it was um automatically analyzed by the auto vacuum that is based on you know some thresholds and
well we will talk about this uh in uh one of the other episodes. So here you have pretty uh a good look of uh you know the uh um foundations in terms of statistics. Where are you know the um uh some of the um details that the costbased optimizer the planner looks into when calculating uh an execution plan for a query. Um let's go more deep uh like again on PG stats. Um well with this query
uh what we get is um uh well you know with this notation we are just um visualizing the row instead of um you know horizontally we are visualizing you know the uh row vertically right so you have again the attribute name the number of distinct values and I've uh you know managed to get three additional columns well in this case since this is basically um the primary key it's not really uh relevant I mean the um
uh content of these two columns that is the most uh common values uh and uh the frequency the um um you know the percentage um where occurrence you know for these values within uh these columns you know in this case it's null but if we just you know uh focus on in a a different column like a balance for example. >> Mhm. >> Um yeah. Well uh it is not a balance. It's another one. Just let
me check. Um because we can have a look at the uh content of the table too. It's the um PGBench accounts. Yeah. Well, in this case, you know, for the AB balance, basically, you know, we have uh the value that is always zero. >> Yeah. >> Um, so in terms of, you know, most common val uh you have zero. >> It is the only value and in terms of frequency uh you know the frequency is a
value that is between zero and one. Okay. So the percentage is um um um the value of the percentage is um a value that is basically between zero and one. one means 100%. So the value zero >> uh is um there you know in the a balance column 100% of times. So um we will have an update later on just to show you um how these uh values will change you know in um uh the PG
stats um table uh just you know updating you know the the value of the a balance column with a more randomic um you know way. So but let's let's first move to another concept that is really important because basically this is related to the uh how the histograms are um calculated by posgrql. So by default there is this parameter default statistics target that is a parameter that can go from um uh one to uh 10,000 and
default is uh 100. So basically it defines per column um um basically the number of buckets >> um with which we can calculate the um distribution of the data within the column. So we have basically three main dimensions that are uh most common values. The percentage related to uh these most uncommon values plus for um those values that are not not part of the most common values. Postgrql will store the uh histograms right in terms of
number of buckets and data distribution within a few ranges right I will show you what does this means. Um so uh but before that there is um >> so does this mean that the more the buckets or the more the statistics target the more accuracy that you'll be getting? >> Yeah likely and indirectly this is true because basically uh it's like posgress uses some um horistic >> uh for the sampling when calculating the statistics. >> Yeah.
Um basically there is no direct way um to you know tell posgress please um when you are sampling read the entire table and then the statistics right so it's sampling basically uh but uh tuning this parameter increasing you know the number of buckets it's like we are telling the uh engine uh you can go more dive deep and read more data pages >> more samples more accurate statistics definitely And so well this is also related to
the number of buckets how skewed and um you know um the number of distinct values that you may have in your column right >> so yeah for example if you have like a table with only like 5% of the data >> um it has this skewed data so if you do the sampling for just few number of pages then the statistics will not actually represent the actual data within the table. >> Yeah. Can be misleading sometimes.
>> Exactly. Yeah, >> that's the point. And we will see some cases where >> this is actually and clearly visible. Right. >> Okay. >> Um there is uh before you know switching to the update I mentioned there is another uh really important concept that is the um correlation of uh you know the data. Basically in this case you know this is uh another column that you can get from the PG stat >> and uh it's again
a value that can go from zero to one and a value of one like you know we have here in this example means that basically uh the logical ordering um of the column values um uh is the same as the physical row ordering in the data pages. M >> so it's you know basically you know this uh table is newly created and we have created um you know um with [clears throat] with the specific order um
you know all the rows in the table itself that matches the uh physical row order right this is also the reason why when selecting like I did without any or specific order by um I've got you know um uh well in this case it's just yeah 10 rows I've got you know the uh first And >> yeah, this is not guaranteed every time. It's only based on how they are being stored within the blocks or the
pages. Okay, perfect. So just look at you know this value for a balance because now we are going to uh update you know this column uh with some uh random values like um you know set a balance uh you know uh randomically we are getting some values and again we are looking at first the correlation >> well update analyze to update you know the statistics in the uh uh catalog And plus we will look at the
correlation as well as um you know again the um the histograms and most common values. Okay. >> Okay. So uh you see here since you know the uh correlation of the data is uh changed uh well this is because looking you know at the >> a balance column now we have uh you know random values. So >> the um logical order does not correspond you know like it was before with the physical order of the rules.
uh plus uh looking at you know at this query again um so just a moment so the column is a balance you see you remember before we had just you know the value of zero with most common f it was uh frequence was one now we have the most common values plus the uh most common frequency so basically you see you have um a list of uh values uh separated by um comma and basically any value
corresponds to the um you know this is the first value in the frequencies it corresponds to the first value in the common values and it is in descending order. So this is a descending order right and by this you can understand um uh you know with the value that is represents a percentage you know the uh how many times in percentage you know this value um compares you know it's it's uh it's there in the uh
column and then this value this value and all the others and all the other values that are not part of the most common values you have here they're um basically the histogram bounds. So all the values that goes from 0 to 97 are in this first bucket and then from 97 to 189 are in this bucket and so on and so forth. >> Okay. >> So it's like you have um 101 numbers here that will represent
the bounds for 100 buckets. >> Yeah. >> For all the values that are not part of the most common values. Right. This is how possess you know manages the things. Um so uh you know just uh coming back to what you mentioned uh Ibraim that is important in my opinion uh you can tweak you know the value of default statistics target to get you know more accurate statistics and it's likely that you know the analyze itself
will uh inventory last longer as well as you will need more space in the catalog to store the data but on the other end um you will have you know likely uh better cardalities and estimates uh by the planner and possibly better execution plans. >> So usually you start with the default value and then you adjust accordingly whether that for some specific edge cases some tables yes you need more accuracy for more accurate statistics then for
that you can adjust the parameter. >> Yeah exactly. So this tuning can be done column level. So it's not that you um necessarily have you know to modify this parameter per the entire database. It can be uh you know definitely per column level for those columns that you need uh you you know that you need u more accurate statistics definitely >> um I hope you know this was clear um because now we can switch you know
quickly to the uh multicolumn statistics uh how you know posgress deals with this kind of um you know statistics so basically um to support you know this discussion we are creating another table that is uh you know sales data uh with this structure basically um and then what we do is we insert some um data within this it's basically 100,000 rows and we analyze right >> so it's likely you know for um for example country and
city um we are you know um using these values so USA Canada UK and so on and for city we are uh specifying Los Angeles, Vancouver and so on right so it's like we are creating a correlation between um country and city within the sales data >> right >> uh and for this of course we generate some uh random values and uh also we um analyze you know uh in this case I specified you know the
name of the table differently of you know what I did um at the meaning plus the um verbose um you know option to get you know um some insights on what the analyz is doing. So basically it scanned 700 almost you know 700 pages containing uh a number of live rows uh zero dead rows and you know so on and so forth. So you can get uh very interesting data and insights about that. Yeah, Dominico, we
have um a valid and interesting question. Interesting question from Omar. >> Uh first, we need to know how you will identify the tables which are causing issues or busy. I think this is a valid one, right? When you are uh when you are investigating a case of a performance issue and you're troubleshooting it, you would want to identify the tables involved. >> Yeah. Yeah. Well, this is uh actually a great question because you know when it
comes to I'm switching you know quickly to [laughter] some slides but it's sure of the discussion but you know query tuning methodology it's a process so basically uh in postgrql generally speaking and then of course uh with uh our managed services like RDS and the Aurora um apart from uh a methodology and the process that you can have uh uh you that is um shown here in this slide where basically you can have your um active
session monitoring either through uh any kind of monitoring tool available out there. mentioning you know our Cloudatch database insights for example uh where you have you know the um um you know the active session um average active session top activity you know graph to uh quickly visualize you know what's the load in your database uh with the relevant top SQL uh and top weight events for example right so it's like um depending on the point of
view and the tool that you are using you can look at what's what's happening in your database >> and then identify likely you know the top SQL statements and then going down you know the methodology you have explain explain analyze uh you know with buffers uh and you know there are many other tools out there that you know can help you um analyzing you know the um output of u these tools including um auto analyze that
is a module that you can eventually um load and use to have your engine uh dump in the posgrql log file the uh queries that are I don't know long running or they're just you know triggering uh this um module basically to dump you know the uh execution plans into the posgress.log block I'm just you know mentioning few of the options out there but then as soon as you have identified what are the top queries and
execution plan within the execution plan it will be uh likely quite clear you know what are the uh bottlenecks and the objects tables or indexes where you know the um engine is spending um you know the worker is spending um most of the time yeah definitely and then you have you know the tuning >> that is what we uh what we will have a look right now because there is an interesting case so uh right we
have created you know this table with a few columns and two of these columns they are correlated in some way right so we have country and city and what we do is um you know like with explain I mentioned before um in this case it's explain analytes I will you know describe what does this means in a moments but basically you know this is the query that we are trying to uh analyze and tune eventually um
and this is the calculated execution plan. So explain is uh the uh basically the command that we use in posgrql to provide a query that can be um of course it it can be a query or it can be even a DML insert update or delete. Um but you know what is important for you to know first is this is just one of the options analyze that we can use with explain. We will uh you know
look at few other options uh later on. But basically this means that explain is not only asking the planner to calculate the plan and provide the um cost estimation for each and every node in the execution plan. But also it will actually execute the statement to provide the um you know the runtime execution plan as well as the um actual timing cost, number of rows uh retrieved by each and every node in the execution plan, the
number of loops and so on and so forth. Right? So pay attention if it is a DML like insert, update or delete, you are eventually actually uh modifying the data and uh you want you may want to um enclose you know this um explain analyze uh within you know begin transaction and uh roll back eventually. >> Yeah to avoid modifying the data. Okay. >> Yeah. And you know actually you know using explain analytes it's really interesting
here because well we are selecting from sales data you see um filtering using country and city and what we have is this is the estimation of number of rows you know our query will retrieve while this is the actual number of rows that it retrieved uh when you know executing it uh over the data right um So you see it's it's it's a huge difference between you know the uh estimation and uh the actual uh rows
that we retrieved right and this is because likely you know the uh planner well um I'm just putting here an additional comment but basically um this is what the the planner does right so uh since it does not know that there is a correlation between the two columns it assume assumes that there is independence for the country and city column and this is the formula that it uses to calculate the selectivity. So basically to calculate this
number >> cardality so so unless you instruct it or to unless you clarify that there's a correlation between those two columns it will not consider that. >> Yeah. Yeah. Definitely. And this is what we do with the concept of extended statistics in posgrql. This is also available with other engines. Uh but you know with possess this is the syntax. So basically it's like us we are defining um well this is a name that we just you
know give to this uh construct um that is you know uh hey engine please calculate the statistics for um you know these couple of columns >> that are part of this um table uh and it's like you know uh u we are telling the engine these columns they are correlated >> I see And these are the kind of statistics that it will calculate. So we uh this is the um all the values that we can specify
here in this you know um uh enclosure you know this parenthesis. So the most common values the number of distinct values and um you know the kind of dependency that is between these two columns. This is just the definition. then actually uh the uh statistics you know the extended statistics will be calculated when we analyze the table >> okay >> like I've done this here [clears throat] right so uh now looking at the query first what
we can you know observe is that the cardality estimated is uh different and now it looks like it's it's very close you know to the Yeah, >> of course in this case, you know, since this is a very very easy uh and simple table, I would say it's just a sequential scan. So, we are reading the entire table full. Um, so there are no other options uh at this moment. Um, because we don't have an index,
you know, for these two columns a composite index. So, it's going for a sequential table scan. But for complex queries maybe with joins and uh multiple uh access path and join uh methods um you know this this can help a lot the planner avoiding you know suboptimal uh execution plans. So well this is you know in my opinion very important and in posgrql we can get you know details about the uh extended statistics just looking at
few additional um views exposed by the uh catalog. So in this case we have the name of the extended statistics that is the one we specified the table name the columns uh I will show you you know um in a few moments what does this means? Well, it's let's let's do it right now. It's you know describing this table two and three is basically column two column 3 you see >> and this is the kind of
extended stats. So DFM means dependencies >> okay >> distinct number of distinct and um most common values. >> Yeah. Uh plus we also have well you will find you know these um useful queries in the instructions of the lab. So you can just you know copy and store it in uh in uh you know um because the these are very useful again in this in this uh you know with this column with this query sorry again
looking at the pg statistic_ext we not only have the uh name of the extended stats the table name but also instead of the column ID now we have the uh column name right >> um yeah So I believe the key takeaway from this one uh Dominico if you agree with me is that when you are uh troubleshooting a query performance issue and you notice that okay the maybe the optimizer or the planner is not picking up
the most optimum plan is to check when the data or the table has been u last analyzed. So from the PG stat or tables check for this table. Okay. When when was the last analyze of that table then based on that you can maybe you can find out that okay you have loaded data after that time and then the table has not been analyzed since then and you need to analyze it. And if you want to
go deeper and see how the data are being distributed um within the or from perspective uh the the statistics perspective then you can do that as well from the PG stat. >> Exactly. Exactly. And you know this is the kind of data that uh then you know the engine is going to calculate. >> Okay. I have a couple of questions here. Uh actually Louise asked a Louis asked a couple of question. What are the key metrics
to monitor when assessing uh postgrace performance? I think this more general question. Uh what are the metrics? >> Yeah. Um uh well in my opinion um you know um as soon as um you have identified what are your top queries metrics in my opinion uh the you know for this task the pgstat statements it's another extension that eventually you can install um you know for your database you can uh create and use for your databases uh
you will have a bunch of additional uh statistics for um um you know your the query And you know the statements that are running in your system such as the uh number of executions um the average execution time uh and additional um you know um dimensions like you know the number of rows retrieved average and you know there are many many many um statistics for that and well probably we can even p statements have a look
at that um you know you can track statistics of your uh SQL statements that are running in your system. You see this is the uh content uh uh you know the the u columns that you can find within u you know this uh view that is PG stat statements and this is also exposed by uh cloudatch database insights and performance insights when you are you know running um uh our services. Uh I was going to mention
that if you're running on RDS then the database and size will give you this overview and then you can uh have an idea about what's running on the database what are the bottlenecks the top weight events and so on in addition to the other monitor the other metrics as well like the server utilization the CPU memory utilization IO utilization all of that. So based on all of these um uh observations you can find out or can
pinpoint where the bottleneck where is the bottleneck and then you start working on it. >> Exactly. >> Perfect. And another question how to solve a memory hike issue. I think also uh again start to find out where this where exactly is in the memory is being consumed. Finding out the processes that are consuming the memory and then what those processes are doing the queries they are running and there are some other database parameters I believe that
can be used. >> Yeah. Yeah. Yeah. Definitely in that case you know the uh metrics like buffer cache ratio this is also important because you know it will uh point out if your working set fits in memory or not uh of obviously you know um metric like this should be close as much as possible to the 100%. Sometimes it's impossible. Um but you know a value that is greater than u 90% 95% is likely um a
good value. If you see you know consistent drops in buffer cache ratio this means that likely either your um you know the um database uh the shed buffers in the case of posgress uh is the memory is not enough you know for your working set uh or there can be some nonoptimized queries that are running in the system. So again you will go you know digat statements or cloudatch database insights or any other monitoring tool that
um you have you know implemented to identify what are the top SQL statements and then analyze um if it is possible to tune those statements and tuning can mean you know um better statistics that is you know the topic for this session uh or um indexes you know maybe there are missing indexes uh tune the uh join methods join in order parameters um I don't know >> common table expressions there are so many ways of >>
exactly >> tuning you know um that is part of the uh well we have 12 minutes late um um remaining probably this will be part of the uh next sessions yeah >> perfect yeah I believe so because uh yeah this I was going to mention that indexing indexing we are going to cover them we were hoping to cover the this uh this time but it seems we don't have enough time for that. So we'll have a
dedicated session for that. Uh unless you have >> maybe we can just introduce you know um you know to prepare for the next one we can introduce um you know the index types that we have in um available in posgress. >> Yeah but before that I'm sorry but I'm sorry to but we have a related question about the statistics. Is there a job or task that predictly updates the stats? >> I believe you mentioned. Yeah. >>
Yeah. Yeah. Yeah. Of course, it's the uh auto vacuum. Uh so this will be part of the uh not the next episode but the um uh week after uh where you know there will be colleagues talking about vacuum and out of vacuum. But basically think of uh out of occas it's it's a background um uh let's say uh worker that is running in the system right and based on some parameters and formulas it will identify those
tables that require um the statistics to be uh refreshed and eventually it will go and refresh the statistics right uh but it is also um you Um in my opinion worth mentioning that sometimes there are best practices around statistics like um uh you you see here well uh it's you know here but basically uh let's think for example you are uh massively um loading uh your tables. It's a good practice. It's a best practice. calculate the
statistics for the impacted objects just at the end of your uh load, right? With analytes and specifying the uh name of the tables, right? Or few other best practices like um this this one for example is the um statement the command that you will use to specify a different value for default uh statistics target. You know when we mentioned you know this parameter this is the um you know the um syntax so uh you can go
you know um modifying this uh parameter for the table >> specific column overriding the uh default value that is 100 right um yeah or you can monitor the freshness of the statistics in your system just probing the pgat all tables just looking at you know the um for example the uh tables where the last analytes or last out analytes so the analyze performed the auto by the auto vacuum is older than um one day starting from
now one day in the past or eventually um there are you know the uh the other views in the catalog that uh will show um you know the um number of inserts, number of updates, number of deletes um that uh you had you know in on your table running ce since last analytes. And basically this you can consider you know this as the number of changes that happened till then. And based on this number you can
calculate you know the percentage against the total number of rows in your table and assume that for example with a 5% or 10% changes you can uh run a manual uh analyze eventually. Um so eventually you can have a custom shell you know that you can build to calculate you know these percentages and then based on these values then uh you run um I don't know maybe during a maintenance window let's say during the night you
can have you know this job uh runs run this refresh the statistics for your team. >> Yeah that makes sense. Okay cool. So uh we only have like three four minutes. Uh do you like to just give a brief intro about indexes? Oh man >> yeah very very very quickly uh promise um so there are few index types you know that posgrql uh offers uh out of the blue they are in the engine and of course
you know the um uh well B3 index is the default one I'm not going through you know all the details >> yeah we will dedicate yeah a show for this >> but it's worth mentioning that and I guess this is uh in the best practices uh you have few um um you know best practices to help you decide what type of index is uh you know the most suitable for your use case. There should also be
um well no it's not here but I think it's probably here in this lab. Uh well um no worries we will show you um during the next episode but there is a table you know explaining what are the index types and uh pros and cons and use cases right >> um but it's just worth mentioning that for example the bring indexes these are really really valuable when you are dealing with time series data or data that
is naturally ordered right >> but we will show you this means and what are the benefits in the next >> yeah because each index of those each index type is suitable for different case different use case >> yeah like G index is uh full text search for example >> yeah right >> yeah exactly exactly okay before we end our show for today I'd like to show you u something related to because some people were asking about
certificates uh or course certificates we have on skills AWS skill builder we have this course postgrades for Amazon auroran are the core concept learning path and the advanced one both learning path they [snorts] you can go through you will have more details about the postgrad starting from um I believe Dominica you were part of the team who developed the content for this uh course. So it's starting I think from from early beginning how you can install
a database, how you can run it and all of that and then it take you through the different stages of provisioning the uh the the database uh or posgress database uh or mastering it. Uh so we have two learning paths. We have the core one and we have the advanced one. The performance tuning and optimization that we were that we covered today is part of the advanced one. As you can see, you can you will learn
about the Postgress architecture, the uh bottlenecks, the performance bottlenecks, query processing, the explain and analyze, the optimization techniques. All of these features and options you will learn about if you go through the AWS skill builder uh training or course. And once you finish this and you complete it, you will get a certificate as well. Uh so yeah and not only that, not only the performance, there are some other uh options or sorry other uh parts are
being covered as as well like the uh I think we are covering the extensions, we are covering the maintenance, tuning, optimization. Yeah. So this one >> comprehensive. Yeah. >> Yeah. Yeah. It's it is. And it has labs as well. >> It has labs. And you see it's uh you know roughly 12 hours almost 12 hours of uh you know learning with the core concepts. This is not to scare you but you can you know have your
pace and go through all the labs uh you know as per your convenience. And it's 20 hours of um you know labs and you know lectures the advanced concepts >> uh plus uh there is at the end of you know this you can also take um yeah you see you have also the extensions that you mentioned >> um and there is also um the assessment yes that is this one well I've completed it but basically going
through you know this it's one hour in the al uh you will have um I I think it's 60 questions right just like sort of a certification it's not an official certification of course um for posgrsql it's um our way of um you know uh >> acknowledging that you have level that that level of um of knowledge about the the postgress database but yeah I uh as you were saying >> edge >> exactly exactly >> eventually
and you're linking eventually. >> Exactly. And and as you mentioned Dominica, the the training of the course is so thorough. It's it's so comprehensive and uh it starts from the beginning till the till you become like a professional in the post grace and aurora uh databases. So we really highly encourage you to go through it. Let us know uh next time if you had the time to go through it and yeah if you manage to get
the badge. Dominico, thank you so much for your time today. >> It's a pleasure >> and looking for the pleasure is mine as well. Looking forward to see you next time and to continue the series that we have started today. As I mentioned the week after we'll be talking about the vacuum and auto vacuum. I believe we'll have vegetable and Steve to uh to discuss that. But next week we are going to talk about a bit
different topic which is the niptune databases. So, uh, next week, same time, and we the 8 a.m. PT 4 p.m. C, we'll be talking about the AI pod graph building with Amazon Neptune. So, tune in next week and see you. Thank you so much for our viewers. Thanks Dominico. >> Thank you so much. >> Thank you. Bye. Bye. >> Cheers. Bye.
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.