lemmy_server uses the Diesel ORM that automatically generates SQL statements. There are serious performance problems in June and July 2023 preventing Lemmy from scaling. Topics include caching, PostgreSQL extensions for troubleshooting, Client/Server Code/SQL Data/server operator apps/sever operator API (performance and storage monitoring), etc.
Lemmy is incredibly unique in it's stance of not using Redis, Memcached, dragonfly... something. And all the CPU cores and RAM for what this week is reported as 57K active users across over 1200 Instance servers.
Why no Redis, Memcached, dragonfly? These are staples of API for scaling.
Anyway, Reddit too started with PostgreSQL and was open source.
"and growing Reddit to 7.5 million users per month"
Lesson 5: Memcache
The essence of this lesson is: memcache everything.
They store everything in memcache: 1. Database data 2. Session data 3. Rendered pages 4. Memoizing (remember previously calculated results) internal functions 5. Rate-limiting user actions, crawlers 6. Storing pre-computing listings/pages 7. Global locking.
They store more data now in Memcachedb than Postgres. It’s like memcache but stores to disk. Very fast. All queries are gener
post primary key has gaps in it, the sequence is being used for transactions that are later canceled or some kind of orphan posts? This is observable from incoming federation posts from other instances.
comment_aggregates has a unique id column from comment table id column. There is a one to one row relationship. Can the logic be reworked to eliminate the extra column and related INDEX overhead?
Related to 2... Same issue probably exist in post_aggregates table and others that have one to one join relationships.
It is possible to instruct PostgreSQL to log any query that takes over a certain amount of time, 2.5 seconds what I think would be a useful starting point. Minimizing the amount of logging activity to only those causing the most serious issues.
"Possibly the most generally useful log setting for troubleshooting performance, especially on a production server. Records only long-running queries for analysis; since these are often your "problem" queries, these are the most useful ones to know about. Used for pg_fouine." - https://postgresqlco.nf/doc/en/param/log_min_duration_statement/
I think it would really help if we could get lemmy.world or some other big site to turn on this logging and share it so we can try to better reproduce the performance overloads on development/testing systems. Th
I thought some people were out there in June creating stress-testing scripts, but I haven't seen anything materializing/showing results in recent weeks?
I think it would be useful to have an API client that establishes some baseline performance number that can be run before a new release of Lemmy and at least ensure there is no performance regression?
The biggest problem I have had since day 1 is not being able to reproduce the data that lemmy.ml has inside. There is a lot of older content stored that does not get replicated, etc.
The site_aggregates UPDATE statement lacking a WHERE clause and hitting 1500 rows (number of known Lemmy instances) of data instead of 1 row is exactly the kind of data-centered problem that has slipped through the cracks. That was generating a ton of extra PostgreSQL I/O for every new comment and post from a local user.
The difficult things to take on:
Simulating 200 instances instead of just 5 that the current API testing code does. First, just to
How lemmy.ca took on finding why PostgreSQL was at 100% CPU and crashing the Lemmy website this past weekend. PostgreSQL auto_explain
IIRC, it was lemmy.ca full copy of live data that was used (copy made on development system, not live server - if I'm following). Shared Saturday July 22 on GitHub was this procedure:
...
Notable is the AUTO_EXPLAIN SQL activation statements:
undefined
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 0;
SET auto_explain.log_analyze = true;
SET auto_explain.log_nested_statements = true;
This technique would be of great use for developers doing changes and study of PostgreSQL activity. Thank you!
WHERE (((((((((("community"."removed" = $9) AND ("community"."deleted" = $10)) AND ("post"."removed" = $11)) AND ("post"."deleted" = $12)) AND (("community"."hidden" = $13)
Note that a community can be hidden or deleted, separate fields. And it also has logic to see if the creator of the post is banned in the community:
LEFT OUTER JOIN "community_person_ban" ON (("post"."community_id" = "community_person_ban"."community_id") AND ("community_person_ban"."person_id" = "post"."creator_id"))
And there is both a deleted boolean (end-user delete) and removed boolean (moderator removed) on a post.
Much of this also applies to comments. Which are also owned by the post, which are also owned by the community.
Linear execution of these massive changes to votes/comments/posts with concurrency awareness. Also adds a layer of social awareness, the impact on a community when a bunch of content is black-holed.
An entire site federation delete / dead server - also would fall under this umbrella of mass data change with a potential for new content ownership/etc.
Over a short period of time, this is my incoming federation activity for new comments. pg_stat_statements output being show. It is interesting to note these two INSERT statements on comments differ only in the DEFAULT value of language column. Also note the average execution times is way higher (4.3 vs. 1.28) when the language value is set, I assume due to INDEX updates on the column? Or possibly a TRIGGER?
About half of the comments coming in from other servers have default value.
WRITES are heavy, even if it is an INDEX that has to be revised. So INSERT and UPDATE statements are important to scrutinize.
Given how frequent these records are created, every vote by a user, I think it is important to study and review how it works.
The current design of lemmy_server 0.18.3 is to issue a SQL DELETE before (almost?) every INSERT of a new vote. The INSERT already has an UPDATE clause on it.
This is one of the few places in Lemmy that a SQL DELETE statement actually takes place. We have to be careful triggers are not firing multiple times, such as decreasing the vote to then immediately have it increase with the INSERT statement that comes later.
For insert of a comment, Lemmy doesn't seem to routinely run a DELETE before the INSERT. So why was this design chosen for votes? Likely the reason is because a user can "undo" a vote and have the record of them ever voting in the database removed. Is that the actual behavior in testing?
pg_stat_statements from an instance doing almost entirely incoming federation activity of post/comments from other instances:
This will VASTLY decrease the server load of I/O for PostgreSQL, as this mistaken code is doing writes of ~1700 rows (each known Lemmy instance in the database) on every single comment & post creation. This creates record-locking issues given it is writes, which are harsh on the system. Once this is fixed, some site operators will be able to downgrade their hardware! ;)
## Improving Beehaw > BLUF: The operations team at Beehaw has worked to increase
site performance and uptime. This includes proactive monitoring to prevent
problems from escalating and planning for future likely events. — Problem:
Emails only sent to approved users, not denials; denied users can’t r...
Link Actions
Improving Beehaw
BLUF: The operations team at Beehaw has worked to increase site performance and uptime. This includes proactive monitoring to prevent problems from escalating and planning for future likely events.
Problem: Emails only sent to approved users, not denials; denied users can't reapply with the same username
Solution: Made it so denied users get emails and their usernames freed up to re-use
Details:
Disabled Docker postfix container; Lemmy runs on a Linux host that can use postfix itself, without any overhead
Modified various postfix components to accept localhost (same system) email traffic only
Created two different scripts to:
Check the Lemmy database once in while, for denied users, send them an email and delete the user from the database
User can use the same username to register again!
Send out emails to those users (and also, make the other Lemmy emails look nicer)
Requirements Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support Did you check to see if this issue already exists? Is this only a single bug? Do not put multipl...
Add a server for serving Prometheus metrics. Include a configuration
block in the config file. Provide HTTP metrics on the API, along with
process-level metrics and DB pool metrics.
Another day, another update. More troubleshooting was done today. What did we
do: - Yesterday evening @phiresky@phiresky@lemmy.world
[https://lemmy.world/u/phiresky] did some SQL troubleshooting with some of the
lemmy.world admins. After that, phiresky submitted some PRs to github. -
@cetra3@lemmy.m...
Furthermore, this leads to data loss, since there is no other consistency mechanism. I think it might be a high priority issue, taking into account the current momentum behind growth of Lemmy...
I spent several hours tracing in production (updating the code a dozen times with extra logging) to identify the actual path the lemmy_server code uses for outbound federation of votes to subscribed servers.
Major popular servers, Beehaw, Leemy.world, Lemmy.ml - have a large number of instance servers subscribing to their communities to get copies of every post/comment. Comment votes/likes are the most common activity, and it is proposed that during the PERFORMANCE CRISIS that outbound vote/like sharing be turned off by these overwhelmed servers.