Markus Zimmermann @zimmskal Twitter profile | Pikagi

Pikagi

Markus Zimmermann

@zimmskal

939

Followers

856

Following

165

Media

3,264

Statuses

Benchmarking LLMs to check how well they write quality code as CTO and Founder at Symflower

Linz

https://t.co/h0YiWvu7q9

Joined November 2010

Don't wanna be here? Send us removal request.

Pinned Tweet

@zimmskal

Markus Zimmermann

11 days

Crowning our new king 👑 @deepseek_ai 's Coder-V2, king of cost-effectiveness! - 🥈 @AnthropicAI 's Claude Sonnet 3.5 (and Opus 3) have similar functional scores - 🍂 Our old king @Meta 's Llama 3 70B has lost much ground but still a “small” great model! - 🐰 @GoogleAI ’s Gemini

Tweet media one

3

16

108

Last Seen Profiles

@robbienlondon

@blaircismo420

@EEllerbroc37106

@BMVCconf

@theeclarel

@_onehuncho

@Debbielyr

@shion_vovk

@hhhshsjushbz

@herbertmuhwezi

@PiscesFueda

@naepooooh

@Kari_Squared

@BoSoxBioBeth

@CurtRichy

@Dvinee_Feminine

@galery_basah10

@lnoiiii

@Michelle_CRUK

@eo636782

@shi_li97558

@FrieEngland

@shouta1600

@hghh37131407

@1eBSPvDiQl2ooeO

@theeyadoreekayy

@steelemuse

@ft_yaro

@MaalMask

@txt_RangJi

@13svitek

@VmaniakJ

@SchesslDone

@HugM246

@HardChiefs1

@Ayto_Aranjuez

@zimmskal

Markus Zimmermann

2 months

Llama 3 70B takes the king’s crown 👑 from GPT-4 Turbo - 100% in coverage and 70% in quality of code - 💵 most cost-effective inference - open-weight We tested @OpenAI 's GPT 3.5 & 4, @Meta 's Llama 3, @Google 's Gemini 1.5 Pro, @cohere 's Command R+ and 130+ other LLMs

Tweet media one

20

85

568

@zimmskal

Markus Zimmermann

2 months

We tested @OpenAI 's GPT 3.5 & 4, @Meta 's Llama 3, @Google 's Gemini 1.5 Pro, @cohere 's Command R+ and 130+ other LLMs on writing code. Deep dive blog post soon 🏇 In the meantime: a screenshot of the results (more points is better)

Tweet media one

38

40

446

@zimmskal

Markus Zimmermann

12 days

DeepSeek-Coder-V2 might be taking the king’s crown 👑 from Llama 3 70B soon - 📈 Shares almost same score (19980) with Claude 3 Opus (19954) a leap over GPT4-o (19236) - 💵 However, FAR CHEAPER: $0.42 vs $90 vs $20 - 🔓 Open-weight and allows commercial use This is an excerpt

Tweet media one

9

34

214

@zimmskal

Markus Zimmermann

2 months

@mitchellh More FPS means less CPU spent on single frames, right? Scaling that to 100k-s of users means less energy used in general. Every watt counts. Doing your part for the <2°C goal. Or am i too optimistic here why optimizations matter? 😉

3

2

76

@zimmskal

Markus Zimmermann

2 years

@golang I count "not discussing which language features to use while implementing an application" to my top 1 reason.

1

2

38

@zimmskal

Markus Zimmermann

2 years

@moyix Makes efforts like reverse engineering Super Mario 64 even more impressive. is old but gold to read. There is even someone optimizing for speed... for a while now 🤓

Tweet card media

Beyond emulation: The massive effort to reverse-engineer N64 source code

It's about much more than just enabling PC ports.

arstechnica.com

2

2

32

@zimmskal

Markus Zimmermann

2 years

Wow. The only "negative feedback" for @Tailscale is a list of bugs that can be fixed and i am sure everyone is working on already. Truly a product to admire and learn from. @apenwarr any tips on how you guys made it _that_ good? Any product management learnings maybe?

1

3

28

@zimmskal

Markus Zimmermann

2 years

@_JacobTomlinson Granularity of commits helps on so many levels it is absurd to not want it. Faster reviewing is just one massive advantage but I give you two other words: git bisect

1

0

26

@zimmskal

Markus Zimmermann

2 months

GPT-4o is 1.55x faster than GPT-4 Turbo - 🌬️ Faster than @AnthropicAI 's Claude 3 Haiku, but slower than @Meta 's Llama 3 70B 🐌 - 🐰 Speed is important for interactive use cases (e.g. autocomplete) and workloads that are time critical Cheers to @FireworksAI_HQ for being the

Tweet media one

@zimmskal

Markus Zimmermann

2 months

GPT-4o is drastically more cost-effective than GPT-4 Turbo - 💸Half the price of GPT-4 Turbo - 💯 Highest score in the DevQualityEval v0.4.0 - 📉 Still not as cost-effective as @Meta 's Llama 3 70B or @AnthropicAI ’s Claude 3 Haiku (see thread 👇)

Tweet media one

5

2

14

1

6

22

@zimmskal

Markus Zimmermann

3 years

@RandallKanna I find it ridiculous that we are still in a world where people are ashamed of failing or shame others. What is the point in that? Failing just means that you can learn something or that you can teach and help someone else. One should embrace getting better together...

1

0

20

@zimmskal

Markus Zimmermann

2 months

You can read even more details and findings in the full deep dive on Did you see other findings and problems that are interesting? Please let us know how you like this form of article. We will do at least one dive per eval version.

Tweet card media

Is Llama-3 better than GPT-4 for generating tests? And other deep dives of the DevQualityEval v0.4.0

Evaluating 138 LLMs on how well they can write automated tests for an empty Java and Go function.

1

1

18

@zimmskal

Markus Zimmermann

2 years

@RayGesualdo Have you heard about CI/CD with automated testing, our lord and savior?

2

0

19

@zimmskal

Markus Zimmermann

2 months

Llama 3 70B found an non obvious constructor test case in Java. Surprising, but valid! It also writes high quality test code 70% of the time.

Tweet media one

2

1

17

@zimmskal

Markus Zimmermann

2 months

💵 Cost matters! GPT-4 (150 score at $40 per million tokens) and Claude 3 Opus (142 at $90) are good but 25 to 55 times more expensive than Llamma, Wizard and Haiku 😱 Compare logarithmic scale:

Tweet media one

1

1

16

@zimmskal

Markus Zimmermann

3 years

@gunnarmorling Using less existing code means reinventing the wheel again and again. Hence, the same and more bugs and LOADS development time wasted with the same features. Also learning even more APIs. I rather spend time helping get rid of the remaining issues, helping everyone.

0

0

15

@zimmskal

Markus Zimmermann

2 years

@mitsuhiko I see it almost as an ORM layer for clients. If you need "joins" over multiple tables regularly i think it is a great fit. Better than REST.

2

0

16

@zimmskal

Markus Zimmermann

3 years

Finally upgraded all our hosts, VMs, ... to @openSUSE 15.3. Except for libvirt (firewall and Vagrant) and lots of repository conflicts it was smooth as butter. Already enjoining the new performance and features 🥳 Thanks! #openSUSE #devops

1

4

13

@zimmskal

Markus Zimmermann

2 months

GPT-4o is drastically more cost-effective than GPT-4 Turbo - 💸Half the price of GPT-4 Turbo - 💯 Highest score in the DevQualityEval v0.4.0 - 📉 Still not as cost-effective as @Meta 's Llama 3 70B or @AnthropicAI ’s Claude 3 Haiku (see thread 👇)

Tweet media one

5

2

14

@zimmskal

Markus Zimmermann

2 months

With linear scale:

Tweet media one

1

0

11

@zimmskal

Markus Zimmermann

2 months

Fine-tuning makes a big difference in the quality of test code: WizardLM-2 8x22B beats Mixtral 8x22B-Instruct with a 30% better score.

1

0

12

@zimmskal

Markus Zimmermann

3 years

@editingemily Wait... You find leap days strange? You are in for a treat: take a look at leap seconds. They lead to some serious problems e.g. search for MySQL and leap second. Handling leap years is kind of a no-brainer in comparison to doing leap seconds. Bet that most validations are wrong

2

1

12

@zimmskal

Markus Zimmermann

2 years

@_JacobTomlinson If your experience is to have such commits, i am sorry. But there is a clear way on how to incrementally build up a PR with isolated commits. We do that daily. No problems, just advantages. Tooling then also makes a lot automatable in comparison e.g.

Using git-autofixup to effortlessly correct your Git commits

Looking to learn how to create fixup commits automatically in Git? See how git-autofixup can help.

2

1

12

@zimmskal

Markus Zimmermann

2 months

Output prices are only 1 half when using an API provider. Here is a graph of input+output prices for @Meta 's Llama 3 70B of all available providers vs @OpenAI 's GPT-4 Turbo vs @AnthropicAI Claude 3 Opus & Haiku. Costs matter. Taking inference speed as another axis would be cool!

Tweet media one

@awnihannun

Awni Hannun

2 months

This is an important chart for LLMs. $/token for high quality LLMs will probably need to fall rapidly. @GroqInc leading the way.

Tweet media one

25

73

615

2

2

12

@zimmskal

Markus Zimmermann

24 days

We are testing @OpenAI 's 4o, @Meta 's Llama 3, @Google 's Gemini 1.5 Flash, @AnthropicAI 's Claude 3 Haiku, @SnowflakeDB 's Arctic and 150+ other LLMs on writing code. Deep dive blog post soon 🏇 In the meantime, screenshots of an interesting finding (more points is better): choose

Tweet media one

Tweet media two

0

0

11

@zimmskal

Markus Zimmermann

2 years

@gunnarmorling IIRC the scheduler then works smarter to distribute the load. You also do not want a CPU/memory hog that makes your node unusable for other pods. Happens to us all the time. We have a specific CI job that would eat all the CPU and memory in the world if we would allow it.

2

0

11

@zimmskal

Markus Zimmermann

2 years

Finally read by @apenwarr . The @Tailscale blog is always worth reading but i especially enjoy what Avery writes. Clear content, vision and mission, so much to learn from, every time. Also heard only good things. Truly a manager i admire&aspire. Thank you!

1

4

11

@zimmskal

Markus Zimmermann

2 months

Models with smaller parameter size like Gemma 7B, Llama 3 8B and WizardLM 2 7B (take a look at this case 👇) have a problem finding compilable code, but Mistral 7B does well.

Tweet media one

1

0

10

@zimmskal

Markus Zimmermann

2 months

We looked at 138 LLMs but about 80 do not even provide reliable test generation for simple cases. Score below 85 == model did not do well

Tweet media one

2

0

9

@zimmskal

Markus Zimmermann

3 years

@anammostarac @visakanv Know two grandparents that lived in total isolation. Only groceries from their children where the connection to the outside. Died of covid. Lowering the probability to get sick is not a waste of time.

0

0

10

@zimmskal

Markus Zimmermann

2 years

@bradfitz Seeing this i have just one conclusion: I have been writing user stories totally wrong all the time!

0

0

10

@zimmskal

Markus Zimmermann

3 months

@mitsuhiko Totally agree. Even a little more complicated setups are now super easy. MUCH easier than a decade ago. There is also "the move" of @dhh with to make more complicated setups working. I am riding the Hetzner-train since almost forever. Still necessary. $$$

0

0

10

@zimmskal

Markus Zimmermann

2 months

Only 44.78% of the code responses actually compiled. Code that did not compile were strong hallucinations, but some were really close to compiling. 95.05% of all compilable code reached 100% coverage. With more tasks and cases we argue that this will get worse fast.

1

0

7

@zimmskal

Markus Zimmermann

3 years

@openSUSE Its well deserved! Tumbleweed is pretty great especially because of all the automatic testing that is going on. Often feels like SUSE is one step ahead here or is that because other distros are not blogging about testing? Btw why is there a growth beginning with 2021-03?

1

0

9

@zimmskal

Markus Zimmermann

2 months

@realshojaei @OpenAI @Meta @Google @cohere Twitter image handling sucks. You need to open it, and then open the image in your browser as a new tab/window: There. you can now zooooom. Enjoy. Will upload an SVG with the blog post though. So you can zoom in as long as you want

1

0

9

@zimmskal

Markus Zimmermann

2 months

GPT 4 Turbo writes by default obvious comments in test code that must be avoided in quality code.

Tweet media one

2

0

8

@zimmskal

Markus Zimmermann

2 years

@moyix Found the optimization video again If you played the game (or still do 👋) or just curious how weird the N64 platform is, can recommend. "Vroom vroom"

0

0

9

@zimmskal

Markus Zimmermann

2 years

@RayGesualdo Do NOT anger them, they just haunt you back with failing CI Pipelines and lots of flaky tests. Please them with more automation

1

0

8

@zimmskal

Markus Zimmermann

11 days

@natfriedman @natfriedman what is your take on why current benchmarks are not-enough/bad? Just the score-ceiling? ... asking for a friend🙃

1

0

8

@zimmskal

Markus Zimmermann

3 years

@vlad_mihalcea IMHO i either know the names i want to go for right away, or i am using some longer name that describes in my POV the behavior and then almost all the time someone during reviews suggests a much better name. #teamwork

0

0

8

@zimmskal

Markus Zimmermann

12 days

Congratulations @deepseek_ai for having the highest functional score for the DevQualityEval. Another benchmark! - Still checking the quality of the results, so might be a surprise there - Coder is much chattier than GPT-4o (+46% more characters) - DeepSeek-Coder is much slower

1

0

8

@zimmskal

Markus Zimmermann

3 years

@euronews @esa Strange how something can be at the same time awesome to look at and scary and worrisome...

0

0

8

@zimmskal

Markus Zimmermann

2 years

Just put our first VM setup for MacOS runners in production with Really nice to work with :-)

Tweet card media

GitHub - cirruslabs/tart: macOS and Linux VMs on Apple Silicon to use in CI and other automations

macOS and Linux VMs on Apple Silicon to use in CI and other automations - cirruslabs/tart

1

2

7

@zimmskal

Markus Zimmermann

2 years

@DrewTumaABC7 What is the reason that San Francisco has 65 and Portland 88? And then again Palm Springs has 113?

13

0

7

@zimmskal

Markus Zimmermann

2 years

@QuinnyPig Mark it as "beta". Problem, solved.

0

0

8

@zimmskal

Markus Zimmermann

2 months

@HochreiterSepp Congratulations. Can you give me API access for a bit so i can add it to this evaluation benchmark ?

@zimmskal

Markus Zimmermann

2 months

Llama 3 70B takes the king’s crown 👑 from GPT-4 Turbo - 100% in coverage and 70% in quality of code - 💵 most cost-effective inference - open-weight We tested @OpenAI 's GPT 3.5 & 4, @Meta 's Llama 3, @Google 's Gemini 1.5 Pro, @cohere 's Command R+ and 130+ other LLMs

Tweet media one

20

85

568

2

0

8

@zimmskal

Markus Zimmermann

3 years

No more out-of-sync web tests for us, no more flaky ... OK ... some flaky tests, but its a lot better than before. Check out why we chose @Cypress_io for some of our web frontend tests. 👇 #cypress

@symflower

Symflower

3 years

@Cypress_io offers an efficient workflow for writing UI E2E tests. Take a look for more reasons why we have chosen Cypress over Jasmine: #cypress #testing #angular

0

0

1

0

2

6

@zimmskal

Markus Zimmermann

4 years

@ddprrt Standardisation over whole stack and tooling + conventions about formatting, linting, ... (if you follow the Angular devs). I mean it can be still improved in lots of areas but getting rid of decisions and focusing on the requirements leads to faster iterations and onboarding

1

0

7

@zimmskal

Markus Zimmermann

26 days

Finally! The new DevQualityEval v0.5.0 got released. We have some juicy 🧃 additions for the whole LLM community: - Ollama 🦙 support - Support for any OpenAI API inference endpoint 🧠 - Mac and Windows 🖥️ support - More complex "write test" task cases 🔢 for both Java, and Go -

2

0

7

@zimmskal

Markus Zimmermann

2 years

@maciejwalkowiak Why a book? We have onboarded almost everyone here with this list adds more context and conventions with every link. What do you don't like? Maybe I can help

1

2

7

@zimmskal

Markus Zimmermann

2 years

@lukaseder Double-negatives...

Tweet media one

0

1

7

@zimmskal

Markus Zimmermann

2 years

@KevinNaughtonJr At this point I am really wondering if it is mostly reading code and text. And thinking about code and text. I have a mail and review day today, most of the time I am starting add red and green shaded characters. 🤨

0

0

6

@zimmskal

Markus Zimmermann

2 years

Just gave @github 's #codespaces a try with @symflower : works out of the box! and it is really fast 🍻 It is "just" #vscode with @ubuntu . However, we will add it to our environments with a default setup so you can play around with Symflower 😍

Tweet media one

0

0

7

@zimmskal

Markus Zimmermann

3 years

Yes, the current #Log4j CVE is amazingly bad but what i can't stand is the hatred towards the developers. Its an OPEN SOURCE project! most likely almost no full-time paid persons as it is with OSS. So: instead, help them out! #Java #OpenSource

0

1

7

@zimmskal

Markus Zimmermann

2 years

Additional benefit of Avery's content that there are constant stream of jokes. Just a short quote: "We are Canadians, we hate fighting, it feels like work." 😃

0

1

7

@zimmskal

Markus Zimmermann

1 year

@kentcdodds *old dude voice* ... and soo began the great spaces to tabs migration of Kent et al...

0

0

7

@zimmskal

Markus Zimmermann

2 years

@maciejwalkowiak Sounds better suited as part of an integration test. But if you really need to have a unit test, maybe you can just generate them with Would be happy to hear your feedback

0

0

7

@zimmskal

Markus Zimmermann

2 months

If you use GPT-4 for coding, switch to GPT-4o right NOW 🚀: GPT-4o writes BETTER code 📈 cheaper & faster! Here is a peak of a more complex scenario of the next DevQualityEval version. Guess which version is GPT-4 Turbo, and which is GPT-4o? Which one do you think is better?

Tweet media one

Tweet media two

1

1

7

@zimmskal

Markus Zimmermann

2 years

#linux is so much better for UI system testing than #macos and #windows . I can parallelize jobs like crazy using Kubernetes. I can just boot up X and a window manager in a container. No weird "must be logged in" or "screensaver" problems. No weird permission problems. And and and

2

3

6

@zimmskal

Markus Zimmermann

2 years

Want to show your software testing magic? We have a challenge for you! And if you beat @symflower , there is also a t-shirt 👕 for you in it 🥰 #100daysofcode #codenewbie #learntocode #java

@symflower

Symflower

2 years

Heard of Fizz buzz, but have already solved it? 🤔 We have a challenge for you! If you can find a test case that gives you more coverage than the existing test cases, we’ll surprise you with a t-shirt! 👕 #100daysofcode #WomenWhoCode #codenewbie #learntocode #java

Tweet media one

Tweet media two

1

4

5

0

5

5

@zimmskal

Markus Zimmermann

1 month

@maximelabonne Working on a better eval for software development related tasks (not just code gen) Right now we let the model write tests and we score how well they do that, next version we are adding more assessments (obvious comments, dead code, ...) and tasks

Tweet card media

GitHub - symflower/eval-dev-quality: DevQualityEval: An evaluation benchmark 📈 and framework to...

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs. - symflower/eval-dev-quality

1

2

7

@zimmskal

Markus Zimmermann

2 years

@dascandy42 We looked into that and IMO it is (was?) Bad for the user's experience. If you so not have WSL set up what then? Rather have our users using the apps in one click.

0

0

6

@zimmskal

Markus Zimmermann

26 days

@huybery Congrats! Blog post reads extremely promising. I am looking forward to running it in our evaluation, maybe it can dethrown our current king model: Llama3-70B 🏇

1

0

6

@zimmskal

Markus Zimmermann

2 years

Pro tip: if you are waiting for a command in your console to finish. Take the microsecond to press enter to actually run it 🫣 definitely going to bed now ...

0

0

5

@zimmskal

Markus Zimmermann

6 months

batman-POW! @golang just announced the first 1.22 release candidate and it looks to me like they squeezed two major versions into one! It is truly packed 🌋 Congratulations to everyone involved making #go more awesome!

Go 1.22 Release Notes - The Go Programming Language

1

0

6

@zimmskal

Markus Zimmermann

3 years

@luca_cloud Thanks for posting. Will be my good night story today. How did you get access to these teams? On what level did you communicate? IMHO people tell different things on different levels.

1

0

6

@zimmskal

Markus Zimmermann

3 months

If you are interested in LLMs/AI-in-general, code-generation or code-quality: Just released "An evaluation benchmark for LLMs and friends to compare and evolve code quality of code generation." The first release v0.1.0 is just there to ...

Tweet card media

GitHub - symflower/eval-dev-quality: DevQualityEval: An evaluation benchmark 📈 and framework to...

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs. - symflower/eval-dev-quality

2

1

5

@zimmskal

Markus Zimmermann

2 months

@mitchellh @davidcrawshaw That is what i would expect idle does (ok, would expect 0% CPU when nothing is happening but 1% is good for now). Compare it with web pages: I have one Reddit page open where right now no change happens. Takes a **whole** core. 100%. No idea why. The climate is crying Reddit.

1

0

6

@zimmskal

Markus Zimmermann

12 days

@GoogleAI 's Gemini Pro 1.5 is doing well, but Flash 1.5 is doing much better! - 16% better score for Flash than Pro - Flash is much better for Java (+23%) but roughly the same for Go (+3%) @OfficialLoganK was surprised by Flash. Shouldn’t Pro be better?

1

0

6

@zimmskal

Markus Zimmermann

24 days

@gwenshap Using @OpenRouterAI for most of ours evals. @FireworksAI_HQ and @GroqInc are also very nice (for Llama3 at least) Currently looking at @huggingface for models that are not available somewhere else. Some parts with @ollama / MLX on rented servers for CPU-only evals.

0

0

6

@zimmskal

Markus Zimmermann

2 months

@AdrienBrault @OpenAI @Meta @Google @cohere

Tweet media one

1

0

6

@zimmskal

Markus Zimmermann

2 months

How to optimize Ollama for CPU-only inference? I am wondering if anyone has documentation, blog posts, ideas, ... really anything(!) about optimizing inference for CPU-only. The reason is that most people do not have nice GPU(s) in their notebooks and rely on API providers or

2

0

6

@zimmskal

Markus Zimmermann

3 years

1/7 At @symflower we have multiple principles that we are applying every day to everything we do. One of them is "The Beyonce Rule" which got its name from the EXCELLENT book "Software Engineering at @Google "... #devops #programming #coding #conventions

1

6

6

@zimmskal

Markus Zimmermann

2 years

Saying goodbye to an old friend. The only way i truly can, with a #devmeme . We refactored and delete lots of code to make @symflower CLI more efficient. My friend, FillDependencies (2016-2022), had to go to make room for a more memory-efficient future. #golang #coding

1

2

6

@zimmskal

Markus Zimmermann

2 years

@apenwarr @Tailscale `Tailscale raises $100M to finally move the internet to IPv6 with people thinking they are on IPv4. "It's high time to give this idea another go", says a fake quote by their CEO.`

0

1

6

@zimmskal

Markus Zimmermann

3 years

@badamczewski01 Wait until you are doing Go a few years and try to read some of your old code. Found that most of the time I wouldn’t change a thing. And that I can still read the stuff I wrote! Maintainability is amazing when you don't have lots of choices but lots of power in functionality 🥳

1

0

6

@zimmskal

Markus Zimmermann

1 year

@ID_AA_Carmack I'm sorry, Dave, errrrr John! I'm afraid I need to reboot.

0

0

6

@zimmskal

Markus Zimmermann

2 years

If you are doing #golang or #java , give a try. Evelyn demonstrates 👇 a full workflow on how to work with generated unit tests, how to validate your code and debug problems. If you are writing tests by hand. You will be surprised, i promise 😉

@symflower

Symflower

2 years

Ever wished you can just generate your #unittests instead of painfully writing them? In this video 👇 Evelyn demonstrates of how to use to speed up your daily development workflow 🚀✨ #golang #java

0

0

5

1

0

4

@zimmskal

Markus Zimmermann

2 years

So close 😍 If you haven't tried our #vscode extension for generating #java and #golang unit tests yet, let's give it a try! Let me know how it went and if you are missing something. We really appreciate your feedback 🥳

Tweet media one

1

0

6

@zimmskal

Markus Zimmermann

2 years

Looking forward to adding more convenience functionality to @symflower like this progress indicator. Let me know if you are missing any additional functionality to be more productive in #golang or #java . Looking forward to announcing some exciting #debugging features next

@symflower

Symflower

2 years

Due to popular request we added a progress indicator to #vscode while is generating tests. This feature is coming to #intellij as well, together with more detailed progress reports. Give it a whirl and let us know how it works for you! #coding #devops

Tweet media one

0

0

4

0

0

5

@zimmskal

Markus Zimmermann

2 years

@maciejwalkowiak We bought who wanted it. Heard only good feedback, examples look IMO nice but I did not walk through it

0

0

5

@zimmskal

Markus Zimmermann

3 years

@francesc @Goodwine Author here: WOW, am I dreaming? Francesc thanks! Would be great if you could do an episode on mutation testing, IMHO one of the most important test coverage. Hope the tool works for your examples. Currently busy building another product hope to get back to go-mutesting in Oct.

1

0

5

@zimmskal

Markus Zimmermann

8 months

@tagir_valeev Chesterton's Fence at its best When i see something like that, i usually add a comment right there and a test to never have such a "fix" merged aka doing the Beyonce rule. At this point i think such tests are the best documentation.

Programming principle "Chesterton's Fence": Understand the current state before changing it

Learn about Chesterton's Fence to not rush "unwise" decisions.

1

0

5

@zimmskal

Markus Zimmermann

12 days

Just when i thought that we can close down our model list for the eval @AnthropicAI pulls out one more of the hat 🎩🐰🪄 LOOK at those Coding scores! Let's benchmark 🏇

@zimmskal

Markus Zimmermann

12 days

DeepSeek-Coder-V2 might be taking the king’s crown 👑 from Llama 3 70B soon - 📈 Shares almost same score (19980) with Claude 3 Opus (19954) a leap over GPT4-o (19236) - 💵 However, FAR CHEAPER: $0.42 vs $90 vs $20 - 🔓 Open-weight and allows commercial use This is an excerpt

Tweet media one

9

34

214

1

0

5

@zimmskal

Markus Zimmermann

2 years

This is @symflower 's CI/CD pipeline. When i started we had one job that did everything at once. Now we have multiple editors, OSes, services, ... it is wild how much automation is in that one screenshot. Care to share your CI? Any learnings? #devops #gitlab

Tweet media one

5

0

5

@zimmskal

Markus Zimmermann

3 years

💉💉💉 third one is Pfizer 🥳

2

0

5

@zimmskal

Markus Zimmermann

2 years

@brunoborges At my last conference I stayed almost all the time at our booth (which was not cheap), saw three talks (the first, the last and one with my colleague) and I held my own talk. Your suggestion makes that XP completely terrible because I cannot even watch what I missed. Pay to miss

1

0

5

@zimmskal

Markus Zimmermann

3 years

@RandallKanna ... and that just means that something was not "perfect" before. What a dull world it would be if people wouldn't drive on failures to produce something better ... Hope to lead as an example here that not everyone is shaming failure

0

0

5

@zimmskal

Markus Zimmermann

3 years

@starbuxman Lol i am old enough when people thought XML is a good idea 😂

1

0

5

@zimmskal

Markus Zimmermann

2 years

@grashalm_ Where is that?

3

0

5

@zimmskal

Markus Zimmermann

1 year

We are looking for software engineers who are interested in developing tools and source code analyses for software developers. DM me if you want to know more! #jobs #jobsearch #jobhunt #hiring #career

@symflower

Symflower

1 year

We’re looking for Algorithm Developers! Are you familiar with programming, logic, and formal methods? 🤓 Join the Symflower team and help us make developers’ lives easier! 👇

1

0

3

3

1

5

@zimmskal

Markus Zimmermann

2 years

@jasongorman Would be interesting to know why you are removing the code? Has the specification changed? When I do TDD I only write tests for the behavior I am expecting with the current step, rest i try to generate. Nothing changes, unless the API|specification changes or i clean up something

0

0

4

@zimmskal

Markus Zimmermann

2 years

@moyix Welcome back fellow bad programmer! We have tears, ice cream and beer to get you started 🍻

1

0

5

@zimmskal

Markus Zimmermann

2 years

@tagir_valeev Was that a user report or did a fuzzer/symbolicexecution find it?

1

0

5

@zimmskal

Markus Zimmermann

3 years

@katharineCodes Its fine. Leave everything behind. Join the Gopher side, join #golang

0

1

5

@zimmskal

Markus Zimmermann

11 days

@VictorTaelin It is true for the non-functional metrics that this is the case. But i can always quote the reason i got into benchmarking LLMs "costs matter" look at If with that simple version of the eval, it clear that GPT4 was far more cost-effective. But yeah, now

@zimmskal

Markus Zimmermann

2 months

Llama 3 70B takes the king’s crown 👑 from GPT-4 Turbo - 100% in coverage and 70% in quality of code - 💵 most cost-effective inference - open-weight We tested @OpenAI 's GPT 3.5 & 4, @Meta 's Llama 3, @Google 's Gemini 1.5 Pro, @cohere 's Command R+ and 130+ other LLMs

Tweet media one

20

85

568

1

0

5

@zimmskal

Markus Zimmermann

3 years

@robertbalazsi @jones_spencera Problem is that it takes you enormously more time when you want to add it afterwards. Doing it right from the start does not make you slower, it makes you much more productive. You do not have to create perfect code: some linting and tests are enough to *have* quality

2

0

5

@zimmskal

Markus Zimmermann

2 months

@mboehme_ Both have value, unless they are truly always consistently and deterministically outperformed. Horse vs car. There are lots of cases where (some) LLMs fail. The more one looks into that the more you see problems that can be (and should be) improved. E.g.

Tweet card media

Can LLMs test a Go function that does nothing?

Evaluating 123 LLMs on how well they can write automated tests for an empty Go function.

1

0

4

@zimmskal

Markus Zimmermann

2 years

@iospaulz Wow, pretty hard. How does it work out? Did you delete something that is still used? I think you could call that the fearless version of the "Beyoncé rule"😀 We enforce that rule at @symflower . Wrote about it here It works pretty well, lots of tests added

@zimmskal

Markus Zimmermann

3 years

1/7 At @symflower we have multiple principles that we are applying every day to everything we do. One of them is "The Beyonce Rule" which got its name from the EXCELLENT book "Software Engineering at @Google "... #devops #programming #coding #conventions

1

6

6

0

1

5

@zimmskal

Markus Zimmermann

3 years

1/2 One unknown feature of @gitlab that helps us every day is using `<details>...</details>` to hide content in descriptions/comments. We often have longer examples that need to be broken down in smaller parts for ... #gitlab #coding #programming

1

5

5