Debug slow fetch call on Postgres

I am running into a confusing issue where we have a fetch call that will some days complete in a few seconds whereas other days it will timeout. When it does timeout the console will never return (no error), Splunk has no errors, and the action (via Cluster.actionDump()) eventually completes after 20+ min.

Here is the call:

c3Grid(FinishedGood.fetch({
    "include":"legacyId, sapId, rlng12moGrossSales, description,item.globalSalesCode.profitCenter.divisionName, item.globalSalesCode.profitCenter.name, item.globalSalesCode.productGroup2Name, item.globalSalesCode.name",
    "filter":"exists(globalSalesCode) && globalSalesCode != \"???\"  && exists(rlng12moGrossSales)",
    "limit":20,
    "offset":0,
    "order":"empty(rlng12moGrossSales),descending(rlng12moGrossSales)"
}))

I have limited the problem down to the order as well as the include of multiple nested FK tables. Here are those results:

Completes in less than a second:

c3Grid(FinishedGood.fetch({
        limit: 100,
    order:"descending(rlng12moGrossSales)"
}))

Completes in less than a second:

c3Grid(FinishedGood.fetch({
    include: "item.globalSalesCode, item.globalSalesCode.name",
        limit: 100,
}))

Completes in 10+ minutes:

c3Grid(FinishedGood.fetch({
    include: "item.globalSalesCode, item.globalSalesCode.name",
        limit: 100,
    order:"descending(rlng12moGrossSales)"
}))

That tables do not contain very large data sets either:

FinishedGood.fetchCount()
434817
ItemMaster.fetchCount() // "item" in the include statement
3741810
GlobalSalesCode.fetchCount() // "globalSalesCode" in the include statement
8752
ProfitCenterMaster.fetchCount() // "profitCenter" in the include statement
902

Any thoughts to why this could be? Let me know if more information is needed to assist.

@brett.thompson Generally it’s a sign that the DB is under duress. You can specify “explain: true” in the FetchSpec (you’ll have to remove the c3Grid to examine the result) and see the number of queries that are executed as well as the execution plan for each. If the plans aren’t good (e.g. does table scans, etc.) then it still may work well if the db is not doing anything else but could slow down significantly if it is.

Unfortunately I am unable to get that call to return anymore even after the actionDump says there is no more actions.

As for the DB stress I’m not sure that would be the case here as I have made sure to do my testing when there are no other actions or users on the Cluster. I have looked at the explain for similar queries (go less deep into the FK links, order commented out, etc) with no luck of finding problems. Would those help to be posted here or would that not be too helpful as we really need the explain on the bad query iteslf?

@brett.thompson Instead of using “explain: true”, try “validateOnly: true”. This will return the queries that would be issued with their explain plans without actually issuing the queries. That is generally a good tool to investigate things like this.

Good to know! Thanks!

Command:

FinishedGood.fetch({
    include: "item.globalSalesCode, item.globalSalesCode.name",
        limit: 100,
    order:"descending(rlng12moGrossSales)",
	validateOnly: true
})

Plan:

Limit  (cost=718441.38..718441.64 rows=101 width=73)
  ->  Sort  (cost=718441.38..719635.65 rows=477708 width=73)
        Sort Key: t.rlng12mogrosssales_d DESC, t.id DESC
        ->  Nested Loop Left Join  (cost=0.85..700149.44 rows=477708 width=73)
              Join Filter: ((t3.tenant_tag_id = t2.tenant_tag_id) AND ((t3.id)::text = (t2.globalsalescode_r)::text))
              ->  Nested Loop Left Join  (cost=0.56..691787.15 rows=477708 width=54)
                    ->  Seq Scan on c3_2_finishedgood t  (cost=0.00..64077.51 rows=428369 width=39)
                          Filter: (tenant_tag_id = '12'::numeric)
                    ->  Index Scan using c3_2_itemmaster_pk on c3_2_itemmaster t2  (cost=0.56..1.46 rows=1 width=20)
                          Index Cond: ((tenant_tag_id = t.tenant_tag_id) AND (tenant_tag_id = '12'::numeric) AND ((id)::text = (t.item_r)::text))
              ->  Materialize  (cost=0.29..2.41 rows=1 width=29)
                    ->  Index Scan using c3_2_gbl_sales_code_pk on c3_2_gbl_sales_code t3  (cost=0.29..2.40 rows=1 width=29)
                          Index Cond: (tenant_tag_id = '12'::numeric)

Query:

SELECT T.ITEM_R, T2.GLOBALSALESCODE_R T2$GLOBALSALESCODE_R, T3.NAME T3$NAME, T3.ID T3$ID, T2.ID T2$ID, T.ID, T.VERSION
FROM C3_2_FINISHEDGOOD T
  LEFT OUTER JOIN C3_2_ITEMMASTER T2 ON T2.TENANT_TAG_ID = T.TENANT_TAG_ID AND T2.ID = T.ITEM_R
  LEFT OUTER JOIN C3_2_GBL_SALES_CODE T3 ON T3.TENANT_TAG_ID = T2.TENANT_TAG_ID AND T3.ID = T2.GLOBALSALESCODE_R
WHERE T.TENANT_TAG_ID=:v1
ORDER BY T.RLNG12MOGROSSSALES_D DESC, T.ID DESC
LIMIT 101

@trothwein From my research I don’t see anything inherently wrong or “bad” about the plan above. I also ran the call without the sort and here is what it looks like:

Limit  (cost=1.27..138.68 rows=101 width=67)
  ->  Nested Loop Left Join  (cost=1.27..705067.60 rows=518257 width=67)
        Join Filter: ((t3.tenant_tag_id = t2.tenant_tag_id) AND ((t3.id)::text = (t2.globalsalescode_r)::text))
        ->  Nested Loop Left Join  (cost=0.99..695995.70 rows=518257 width=48)
              ->  Index Scan using c3_2_finishedgood_pk on c3_2_finishedgood t  (cost=0.43..62832.64 rows=432126 width=33)
                    Index Cond: (tenant_tag_id = '12'::numeric)
              ->  Index Scan using c3_2_itemmaster_pk on c3_2_itemmaster t2  (cost=0.56..1.46 rows=1 width=20)
                    Index Cond: ((tenant_tag_id = t.tenant_tag_id) AND (tenant_tag_id = '12'::numeric) AND ((id)::text = (t.item_r)::text))
        ->  Materialize  (cost=0.29..2.41 rows=1 width=29)
              ->  Index Scan using c3_2_gbl_sales_code_pk on c3_2_gbl_sales_code t3  (cost=0.29..2.40 rows=1 width=29)
                    Index Cond: (tenant_tag_id = '12'::numeric)

From what I can tell the largest cost for each is the call on c3_2_finishedgood but the call without the sort returns in under a second. The main difference I can tell is the limit has a cost of 1.27..138.68 compared to the other of 718441.38..718441.64. Though I’m not sure how the limit can run before the left join finishes. I’m not sure if I am reading it correctly though.

@brett.thompson That first plan is “bad” as indicated by the ‘718441.38’ minimum. I believe adding an index to ‘FinishedGood.rlng12moGrossSales’ will address your problem as it will use the index for the outer scan, thus eliminating that sort.

I guess what I’m confused about is the call with the highest cost is the same with or without the sort. It seems that the the limit is able to start at an earlier cost without the sort thus making it faster but I’m still trying to fully understand that as the line before that still costs 700000+.

As for indexing, I have not done that yet as this data is displayed in a table which can be sorted off any of the columns which all cause this problem. If we end up indexing all of the columns I assume we will end up with the same issue.

@brett.thompson You have to know how to read the explain plans :). In your first example, the outermost cost has a minimum cost of 718441.38. The second plan has a minimum of 1.27 and a maximum of 138.68, far less than the first plan.