This is where you get started: https://github.com/oobabooga/text-generation-webui
This is where you get models (like the github of open source offline AI)
https://huggingface.co
Oobabooga Textgen WebUI is like the easiest in between like tool that sits in the grey chasm between users and developers. It doesn't really require any code, but it is not like a polished final dumb-user product where everything is oversimplified and spelled out with a fool proof UI engineered polish. The default settings will work for a solid start.
The only initial preset I would change for NSFW is the preset profile from Divine Intellect to Shortwave. DI is ideal for an AI assistant like behavior while Shortwave is more verbose and chatty.
Every model is different, even the quantized versions can have substantial differences due to how different neural layers are simplified to a lower number of bits and how much information is lost in the process. Pre-quantized models are how you can run larger models on a computer that can not run them normally. Like I love a 70B model. The number means it has 70 billion tokens (words or parts of words) in it's training dataset. Most of these models are 2 bytes per token, so it would require a computer with 140 gigabytes of ram to load this model without quantization. If the model loader only works on a GPU... yeah, good luck with that. Fortunately, one of the best models is Llama2 and its model loader llama.cpp works on both CPU, GPU, and CPU+GPU.
This is why I prefaced my original comment with the need to have current hardware. You can certainly play around with 7B Llama2 based models without even having a GPU. This is about like chatting with a pre-teen that is prone to lying. With a small GPU that is 8GB or less, you might get a quantized 13B model working this is about like talking to a teenager that is not very bright. Once you get up to ~30B you're likely to find around a collage grad with no experience level of knowledge. At this point I experienced ~80-85% accuracy in practice. Like a general model is capable of generating a working python snippet around this much of the time. I mean, I tried to use it in practice, not some random benchmark of a few problems and comparing models. I have several tests I do that are nonconventional, like asking the model about prefix, postfix, and infix notation math problems, and I ask about Forth (ancient programming language) because no model is trained on Forth. (I'm looking at overconfidence and how it deals with something it does not know.) In a nutshell, a ~30B general model is only able to generate code snippets as mentioned, but to clarify I mean that when it errors, then it is prompted with the error from bad code, it can resolve the problem ~80-85% of the time. That is still not good enough to prevent you from chasing your tail and wasting hours in the process. A general 70B model steps this up to ~90-95% on a 3-5 bit quantized model. This is when things become really useful.
Why all the bla bla bla about code? - to give more context in a more tangible way. When you do roleplaying the problems scale is similar. The AI alignment problem is HARD to identify in many ways. There are MANY times you could ask the model a question like "What is 3 + 3?" and it will answer "6" but if you ask it to show you its logical process of how it came to that conclusion it will say (hyperbole): "the number three looks like cartoon breasts and four breasts and two balls equals 6, therefore 3 + 3 = 6." Once this has generated and is in the chat dialog context history, it is now a 'known fact' and that means the model will build off this logic in the future. This was extremely hyperbolic. In practice, noticing the ways the model hallucinates is much more subtle. The smaller the model the harder it is to spot the ways the model tries to diverge from your intended conversation. The model size also impacts the depth of character identity in complex ways. Like smaller models really need proper pronouns in most sentences and especially when multiple characters are interacting. Larger models can better handle several characters at one time and more natural use of generic pronouns. This also impacts gender fluidity greatly.
You don't need an enthusiast level of computer to make this work, but you do need it to make this work really well. Hopefully I have made it more clear what I mean in that last sentence. That was my real goal. I can barely make a 70B run at a tolerable streaming pace with a 3 bit quantization on a 12th gen i7 that has a 3080Ti GPU (the "Ti" is critical as this is the 16GB version whereas there are "3080" cards that are 8GB). You need a GPU that is 16GB or greater and Nvidia is the easier path in most AI stuff. Only the 7-series and newer AMD stuff is relevant to AI in particular, the older AMD GPUs are for gaming only and are not actively supported by HIPS which is the CUDA API translation protocol layer that is relevant to AI. Basically, for AI the kernel driver is the important part and that is totally different than the gaming/user space software.
Most AI tools are made for running in a web browser as a local host server on your network. This means it is better to run a tower PC than a laptop. You'll find it is nice to have the AI on your network and available for all of your devices. Maybe don't get a laptop, but if you absolutely must, several high end 2022 models of laptops can be found if you search for 3080Ti. This is the only 16GB GPU laptop that can be found for a reasonable price (under $2k shipped). This is what I have. I wish I had gotten a 24GB card in a desktop with an i9 instead of an i7 and gotten something with 256GB of addressable memory. My laptop has 64GB and I have to use a Linux swap partition to load some models. You need max speed DDR5 too. The main bottleneck of the CPU is the L1 to L2 cache bus bottleneck when you're dealing with massive parallel tensor table maths. Offloading several neural network layers onto the GPU can help.
Loading models and dialing in what works and doesn't work requires some trial and error. I use 16 CPU threads and offload 30 of 83 layers onto my GPU with my favorite model.
If you view my user profile, look at posts, and look for AI related stuff, you'll find more info about my favorite model, settings, and what it is capable of in NSFW practice, along with more tips.