1y ago

post inclusion, a solid WHERE clause filter before any JOIN on SELECT post (listing of posts)

20 comments

Pagination should probably be redone the way Reddit does it. That is, give a "next page" token which is from an indexed field at the end of every page, instead of page=NUMBER. Then getting the next page becomes WHERE token>xyz which is an indexed operation, instead of OFFSET NUMBER, which requires counting off that many rows and gives the person duplicate results if new posts are arriving while the person browses. This is a well known performance trick. Wikipedia does it too, and I think 'SQL Performance Explained' by Marcus Winand talks about it.
- That is, give a “next page” token
  There's already a pull request on changing paging.
  My focus is a very hard wall on performance, scale. There is way too much potential for data to run into the full post table as things are now.

It can't be a simple as a date range, because we want to be inclusive for smaller communities.
paging is a consideration. 1000 posts per community would allow 10 pages of 20 posts.
small communities are defined to be 1000 or less posts, regardless of age
large communities would focus on recency, the 1000 post would be recently created or edited
Edited can be more tricky, either skip for now or focus on how to limit some kind of mass edit from taking over newly published
 
Also a good time to be reminded that the published date isn't reliable for a couple reasons:
problems in the field have been shown with incoming federation data having future published dates on content. kbin in an easy example, but it isn't limited to kbin.
federation can lag due to server overload and problems paths between specific servers, ISP issues, etc. It is rather common to have received a post hours after the published date. Lemmy currently does not track the 'received' date of content.

ok, experimenting on a massive test data set of over 5 million posts... this PostgreSQL works pretty well
undefined

SELECT COUNT(ranked_recency.*) AS post_row_count FROM ( SELECT id, community_id, published, rank() OVER ( PARTITION BY community_id ORDER BY published DESC, id DESC ) FROM post_aggregates) ranked_recency WHERE rank <= 1000 ;
This limits any one community to 1000 posts, picking the most recent created posts. This gives a way to age out older data in very active communities without removing any posts at all for small communities.
- I don't understand what you are trying to do here. What problem are you trying to solve? Is something wrong with having an index on that rank function, then just using a simple WHERE? Why do you want to limit to 1000 posts? If I want to look through 1000 posts in a community, I probably want to look at more than 1000.
  
  What problem are you trying to solve?
  Reproducible regular server crashes from queries taking tens of seconds long because the whole logic is based on no WHERE clause that has any meat to it. The server overloads in the field have been going on every single day that I've been here testing the big servers since May 2023.
  If I want to look through 1000 posts in a community, I probably want to look at more than 1000.
  I'm well aware of the push back. Everyone chimes in saying they want counting to be real time, the developers seem to avoid caching at all cost, and out of desperation - I'm trying to build some kind of basic sanity logic into the system so it doesn't plow through 5 million rows to do a LIMIT 10 query.
  Right now Lemmy works perfectly fine with no personalization. Anonymous users - it works great. If you want to read a million posts, it works great. Start blocking specific users, start adding in NSFW filters, cherry-picking a blend of communities, etc. and the problems show up. The ORM logic is difficult to follow, based on massive JOIN of every field there is in many tables, and at certain data thresholds with per-account preferences engaged - it goes off the rails into the pile of over 1 million posts (taking 40 seconds to list page = 1 of LIMIT 20 posts for even a single community).
  The programmers who built the code for over 4 years don't seem to think it is an urgent problem. So I'm chipping in. I personally have never worked with this ORM and I find it painful compared to the hand-crafted SQL I've done on major projects. I'm doing this because I feel like nobody else has for months.
- 3 hours later... I put it into code and am experimenting with it. Some proof of concept results: https://github.com/LemmyNet/lemmy/files/12373819/auto_explain_list_post_community_0_18_4_dullbananas_with_inclusion_run0a.txt
- An even less-intrusive approach is to not add any new field to existing tables. Establish a reference table say called include_range. There is already an ENUM value for each sort type, so include_range table with these columns: sort_type ENUM, lowest_id BigInt, highest_id BigInt
  Run a variation of this to populate that table:
  undefined
  
  FROM ( SELECT id, community_id, published, rank() OVER ( PARTITION BY community_id ORDER BY published DESC, id DESC ) FROM post_aggregates) ranked_recency WHERE rank <= 1000
  Against every sort order, including OLD. Capture only two BigInt results: the MIN(id) and the MAX(id) - that will give a range over the whole table. Then every SELECT on post_aggregates / post table includes a WHERE id >= lowest_id AND id <= highest_id
  That would put in a basic sanity check that ages-out content, and it would be right against the primary key!
  
  A core design issue of either approach is that server operators can modify the building of this data without needing to modify or restart the lemmy_server Rust code.
  Using a smallint also gives some flexibility (or a new field if going with the id min max approach).... if page greater than 10 for a particular sort, go to include > 1 and fall into tiers.

20 comments