Markus Zimmermann Profile Banner
Markus Zimmermann Profile
Markus Zimmermann

@zimmskal

939
Followers
856
Following
165
Media
3,264
Statuses

Benchmarking LLMs to check how well they write quality code as CTO and Founder at Symflower

Linz
Joined November 2010
Don't wanna be here? Send us removal request.
Pinned Tweet
@zimmskal
Markus Zimmermann
11 days
Crowning our new king 👑 @deepseek_ai 's Coder-V2, king of cost-effectiveness! - 🥈 @AnthropicAI 's Claude Sonnet 3.5 (and Opus 3) have similar functional scores - 🍂 Our old king @Meta 's Llama 3 70B has lost much ground but still a “small” great model! - 🐰 @GoogleAI ’s Gemini
Tweet media one
3
16
108
@zimmskal
Markus Zimmermann
2 months
Llama 3 70B takes the king’s crown 👑 from GPT-4 Turbo - 100% in coverage and 70% in quality of code - 💵 most cost-effective inference - open-weight We tested @OpenAI 's GPT 3.5 & 4, @Meta 's Llama 3, @Google 's Gemini 1.5 Pro, @cohere 's Command R+ and 130+ other LLMs
Tweet media one
20
85
568
@zimmskal
Markus Zimmermann
2 months
We tested @OpenAI 's GPT 3.5 & 4, @Meta 's Llama 3, @Google 's Gemini 1.5 Pro, @cohere 's Command R+ and 130+ other LLMs on writing code. Deep dive blog post soon 🏇 In the meantime: a screenshot of the results (more points is better)
Tweet media one
38
40
446
@zimmskal
Markus Zimmermann
12 days
DeepSeek-Coder-V2 might be taking the king’s crown 👑 from Llama 3 70B soon - 📈 Shares almost same score (19980) with Claude 3 Opus (19954) a leap over GPT4-o (19236) - 💵 However, FAR CHEAPER: $0.42 vs $90 vs $20 - 🔓 Open-weight and allows commercial use This is an excerpt
Tweet media one
9
34
214
@zimmskal
Markus Zimmermann
2 months
@mitchellh More FPS means less CPU spent on single frames, right? Scaling that to 100k-s of users means less energy used in general. Every watt counts. Doing your part for the <2°C goal. Or am i too optimistic here why optimizations matter? 😉
3
2
76
@zimmskal
Markus Zimmermann
2 years
@golang I count "not discussing which language features to use while implementing an application" to my top 1 reason.
1
2
38
@zimmskal
Markus Zimmermann
2 years
@moyix Makes efforts like reverse engineering Super Mario 64 even more impressive. is old but gold to read. There is even someone optimizing for speed... for a while now 🤓
2
2
32
@zimmskal
Markus Zimmermann
2 years
Wow. The only "negative feedback" for @Tailscale is a list of bugs that can be fixed and i am sure everyone is working on already. Truly a product to admire and learn from. @apenwarr any tips on how you guys made it _that_ good? Any product management learnings maybe?
1
3
28
@zimmskal
Markus Zimmermann
2 years
@_JacobTomlinson Granularity of commits helps on so many levels it is absurd to not want it. Faster reviewing is just one massive advantage but I give you two other words: git bisect
1
0
26
@zimmskal
Markus Zimmermann
2 months
GPT-4o is 1.55x faster than GPT-4 Turbo - 🌬️ Faster than @AnthropicAI 's Claude 3 Haiku, but slower than @Meta 's Llama 3 70B 🐌 - 🐰 Speed is important for interactive use cases (e.g. autocomplete) and workloads that are time critical Cheers to @FireworksAI_HQ for being the
Tweet media one
@zimmskal
Markus Zimmermann
2 months
GPT-4o is drastically more cost-effective than GPT-4 Turbo - 💸Half the price of GPT-4 Turbo - 💯 Highest score in the DevQualityEval v0.4.0 - 📉 Still not as cost-effective as @Meta 's Llama 3 70B or @AnthropicAI ’s Claude 3 Haiku (see thread 👇)
Tweet media one
5
2
14
1
6
22
@zimmskal
Markus Zimmermann
3 years
@RandallKanna I find it ridiculous that we are still in a world where people are ashamed of failing or shame others. What is the point in that? Failing just means that you can learn something or that you can teach and help someone else. One should embrace getting better together...
1
0
20
@zimmskal
Markus Zimmermann
2 months
You can read even more details and findings in the full deep dive on Did you see other findings and problems that are interesting? Please let us know how you like this form of article. We will do at least one dive per eval version.
1
1
18
@zimmskal
Markus Zimmermann
2 years
@RayGesualdo Have you heard about CI/CD with automated testing, our lord and savior?
2
0
19
@zimmskal
Markus Zimmermann
2 months
Llama 3 70B found an non obvious constructor test case in Java. Surprising, but valid! It also writes high quality test code 70% of the time.
Tweet media one
2
1
17
@zimmskal
Markus Zimmermann
2 months
💵 Cost matters! GPT-4 (150 score at $40 per million tokens) and Claude 3 Opus (142 at $90) are good but 25 to 55 times more expensive than Llamma, Wizard and Haiku 😱 Compare logarithmic scale:
Tweet media one
1
1
16
@zimmskal
Markus Zimmermann
3 years
@gunnarmorling Using less existing code means reinventing the wheel again and again. Hence, the same and more bugs and LOADS development time wasted with the same features. Also learning even more APIs. I rather spend time helping get rid of the remaining issues, helping everyone.
0
0
15
@zimmskal
Markus Zimmermann
2 years
@mitsuhiko I see it almost as an ORM layer for clients. If you need "joins" over multiple tables regularly i think it is a great fit. Better than REST.
2
0
16
@zimmskal
Markus Zimmermann
3 years
Finally upgraded all our hosts, VMs, ... to @openSUSE 15.3. Except for libvirt (firewall and Vagrant) and lots of repository conflicts it was smooth as butter. Already enjoining the new performance and features 🥳 Thanks! #openSUSE #devops
1
4
13
@zimmskal
Markus Zimmermann
2 months
GPT-4o is drastically more cost-effective than GPT-4 Turbo - 💸Half the price of GPT-4 Turbo - 💯 Highest score in the DevQualityEval v0.4.0 - 📉 Still not as cost-effective as @Meta 's Llama 3 70B or @AnthropicAI ’s Claude 3 Haiku (see thread 👇)
Tweet media one
5
2
14
@zimmskal
Markus Zimmermann
2 months
With linear scale:
Tweet media one
1
0
11
@zimmskal
Markus Zimmermann
2 months
Fine-tuning makes a big difference in the quality of test code: WizardLM-2 8x22B beats Mixtral 8x22B-Instruct with a 30% better score.
1
0
12
@zimmskal
Markus Zimmermann
3 years
@editingemily Wait... You find leap days strange? You are in for a treat: take a look at leap seconds. They lead to some serious problems e.g. search for MySQL and leap second. Handling leap years is kind of a no-brainer in comparison to doing leap seconds. Bet that most validations are wrong
2
1
12
@zimmskal
Markus Zimmermann
2 years
@_JacobTomlinson If your experience is to have such commits, i am sorry. But there is a clear way on how to incrementally build up a PR with isolated commits. We do that daily. No problems, just advantages. Tooling then also makes a lot automatable in comparison e.g.
2
1
12
@zimmskal
Markus Zimmermann
2 months
Output prices are only 1 half when using an API provider. Here is a graph of input+output prices for @Meta 's Llama 3 70B of all available providers vs @OpenAI 's GPT-4 Turbo vs @AnthropicAI Claude 3 Opus & Haiku. Costs matter. Taking inference speed as another axis would be cool!
Tweet media one
@awnihannun
Awni Hannun
2 months
This is an important chart for LLMs. $/token for high quality LLMs will probably need to fall rapidly. @GroqInc leading the way.
Tweet media one
25
73
615
2
2
12
@zimmskal
Markus Zimmermann
24 days
We are testing @OpenAI 's 4o, @Meta 's Llama 3, @Google 's Gemini 1.5 Flash, @AnthropicAI 's Claude 3 Haiku, @SnowflakeDB 's Arctic and 150+ other LLMs on writing code. Deep dive blog post soon 🏇 In the meantime, screenshots of an interesting finding (more points is better): choose
Tweet media one
Tweet media two
0
0
11
@zimmskal
Markus Zimmermann
2 years
@gunnarmorling IIRC the scheduler then works smarter to distribute the load. You also do not want a CPU/memory hog that makes your node unusable for other pods. Happens to us all the time. We have a specific CI job that would eat all the CPU and memory in the world if we would allow it.
2
0
11
@zimmskal
Markus Zimmermann
2 years
Finally read by @apenwarr . The @Tailscale blog is always worth reading but i especially enjoy what Avery writes. Clear content, vision and mission, so much to learn from, every time. Also heard only good things. Truly a manager i admire&aspire. Thank you!
1
4
11
@zimmskal
Markus Zimmermann
2 months
Models with smaller parameter size like Gemma 7B, Llama 3 8B and WizardLM 2 7B (take a look at this case 👇) have a problem finding compilable code, but Mistral 7B does well.
Tweet media one
1
0
10
@zimmskal
Markus Zimmermann
2 months
We looked at 138 LLMs but about 80 do not even provide reliable test generation for simple cases. Score below 85 == model did not do well
Tweet media one
2
0
9
@zimmskal
Markus Zimmermann
3 years
@anammostarac @visakanv Know two grandparents that lived in total isolation. Only groceries from their children where the connection to the outside. Died of covid. Lowering the probability to get sick is not a waste of time.
0
0
10
@zimmskal
Markus Zimmermann
2 years
@bradfitz Seeing this i have just one conclusion: I have been writing user stories totally wrong all the time!
0
0
10
@zimmskal
Markus Zimmermann
3 months
@mitsuhiko Totally agree. Even a little more complicated setups are now super easy. MUCH easier than a decade ago. There is also "the move" of @dhh with to make more complicated setups working. I am riding the Hetzner-train since almost forever. Still necessary. $$$
0
0
10
@zimmskal
Markus Zimmermann
2 months
Only 44.78% of the code responses actually compiled. Code that did not compile were strong hallucinations, but some were really close to compiling. 95.05% of all compilable code reached 100% coverage. With more tasks and cases we argue that this will get worse fast.
1
0
7
@zimmskal
Markus Zimmermann
3 years
@openSUSE Its well deserved! Tumbleweed is pretty great especially because of all the automatic testing that is going on. Often feels like SUSE is one step ahead here or is that because other distros are not blogging about testing? Btw why is there a growth beginning with 2021-03?
1
0
9
@zimmskal
Markus Zimmermann
2 months
@realshojaei @OpenAI @Meta @Google @cohere Twitter image handling sucks. You need to open it, and then open the image in your browser as a new tab/window: There. you can now zooooom. Enjoy. Will upload an SVG with the blog post though. So you can zoom in as long as you want
1
0
9
@zimmskal
Markus Zimmermann
2 months
GPT 4 Turbo writes by default obvious comments in test code that must be avoided in quality code.
Tweet media one
2
0
8
@zimmskal
Markus Zimmermann
2 years
@moyix Found the optimization video again If you played the game (or still do 👋) or just curious how weird the N64 platform is, can recommend. "Vroom vroom"
0
0
9
@zimmskal
Markus Zimmermann
2 years
@RayGesualdo Do NOT anger them, they just haunt you back with failing CI Pipelines and lots of flaky tests. Please them with more automation
1
0
8
@zimmskal
Markus Zimmermann
11 days
@natfriedman @natfriedman what is your take on why current benchmarks are not-enough/bad? Just the score-ceiling? ... asking for a friend🙃
1
0
8
@zimmskal
Markus Zimmermann
3 years
@vlad_mihalcea IMHO i either know the names i want to go for right away, or i am using some longer name that describes in my POV the behavior and then almost all the time someone during reviews suggests a much better name. #teamwork
0
0
8
@zimmskal
Markus Zimmermann
12 days
Congratulations @deepseek_ai for having the highest functional score for the DevQualityEval. Another benchmark! - Still checking the quality of the results, so might be a surprise there - Coder is much chattier than GPT-4o (+46% more characters) - DeepSeek-Coder is much slower
1
0
8
@zimmskal
Markus Zimmermann
3 years
@euronews @esa Strange how something can be at the same time awesome to look at and scary and worrisome...
0
0
8
@zimmskal
Markus Zimmermann
2 years
@DrewTumaABC7 What is the reason that San Francisco has 65 and Portland 88? And then again Palm Springs has 113?
13
0
7
@zimmskal
Markus Zimmermann
2 years
@QuinnyPig Mark it as "beta". Problem, solved.
0
0
8
@zimmskal
Markus Zimmermann
2 months
@HochreiterSepp Congratulations. Can you give me API access for a bit so i can add it to this evaluation benchmark ?
@zimmskal
Markus Zimmermann
2 months
Llama 3 70B takes the king’s crown 👑 from GPT-4 Turbo - 100% in coverage and 70% in quality of code - 💵 most cost-effective inference - open-weight We tested @OpenAI 's GPT 3.5 & 4, @Meta 's Llama 3, @Google 's Gemini 1.5 Pro, @cohere 's Command R+ and 130+ other LLMs
Tweet media one
20
85
568
2
0
8
@zimmskal
Markus Zimmermann
3 years
No more out-of-sync web tests for us, no more flaky ... OK ... some flaky tests, but its a lot better than before. Check out why we chose @Cypress_io for some of our web frontend tests. 👇 #cypress
@symflower
Symflower
3 years
@Cypress_io offers an efficient workflow for writing UI E2E tests. Take a look for more reasons why we have chosen Cypress over Jasmine: #cypress #testing #angular
0
0
1
0
2
6
@zimmskal
Markus Zimmermann
4 years
@ddprrt Standardisation over whole stack and tooling + conventions about formatting, linting, ... (if you follow the Angular devs). I mean it can be still improved in lots of areas but getting rid of decisions and focusing on the requirements leads to faster iterations and onboarding
1
0
7
@zimmskal
Markus Zimmermann
26 days
Finally! The new DevQualityEval v0.5.0 got released. We have some juicy 🧃 additions for the whole LLM community: - Ollama 🦙 support - Support for any OpenAI API inference endpoint 🧠 - Mac and Windows 🖥️ support - More complex "write test" task cases 🔢 for both Java, and Go -
2
0
7
@zimmskal
Markus Zimmermann
2 years
@maciejwalkowiak Why a book? We have onboarded almost everyone here with this list adds more context and conventions with every link. What do you don't like? Maybe I can help
1
2
7
@zimmskal
Markus Zimmermann
2 years
@lukaseder Double-negatives...
Tweet media one
0
1
7
@zimmskal
Markus Zimmermann
2 years
@KevinNaughtonJr At this point I am really wondering if it is mostly reading code and text. And thinking about code and text. I have a mail and review day today, most of the time I am starting add red and green shaded characters. 🤨
0
0
6
@zimmskal
Markus Zimmermann
2 years
Just gave @github 's #codespaces a try with @symflower : works out of the box! and it is really fast 🍻 It is "just" #vscode with @ubuntu . However, we will add it to our environments with a default setup so you can play around with Symflower 😍
Tweet media one
0
0
7
@zimmskal
Markus Zimmermann
3 years
Yes, the current #Log4j CVE is amazingly bad but what i can't stand is the hatred towards the developers. Its an OPEN SOURCE project! most likely almost no full-time paid persons as it is with OSS. So: instead, help them out! #Java #OpenSource
0
1
7
@zimmskal
Markus Zimmermann
2 years
Additional benefit of Avery's content that there are constant stream of jokes. Just a short quote: "We are Canadians, we hate fighting, it feels like work." 😃
0
1
7
@zimmskal
Markus Zimmermann
1 year
@kentcdodds *old dude voice* ... and soo began the great spaces to tabs migration of Kent et al...
0
0
7
@zimmskal
Markus Zimmermann
2 years
@maciejwalkowiak Sounds better suited as part of an integration test. But if you really need to have a unit test, maybe you can just generate them with Would be happy to hear your feedback
0
0
7
@zimmskal
Markus Zimmermann
2 months
If you use GPT-4 for coding, switch to GPT-4o right NOW 🚀: GPT-4o writes BETTER code 📈 cheaper & faster! Here is a peak of a more complex scenario of the next DevQualityEval version. Guess which version is GPT-4 Turbo, and which is GPT-4o? Which one do you think is better?
Tweet media one
Tweet media two
1
1
7
@zimmskal
Markus Zimmermann
2 years
#linux is so much better for UI system testing than #macos and #windows . I can parallelize jobs like crazy using Kubernetes. I can just boot up X and a window manager in a container. No weird "must be logged in" or "screensaver" problems. No weird permission problems. And and and
2
3
6
@zimmskal
Markus Zimmermann
2 years
Want to show your software testing magic? We have a challenge for you! And if you beat @symflower , there is also a t-shirt 👕 for you in it 🥰 #100daysofcode #codenewbie #learntocode #java
@symflower
Symflower
2 years
Heard of Fizz buzz, but have already solved it? 🤔 We have a challenge for you! If you can find a test case that gives you more coverage than the existing test cases, we’ll surprise you with a t-shirt! 👕 #100daysofcode #WomenWhoCode #codenewbie #learntocode #java
Tweet media one
Tweet media two
1
4
5
0
5
5
@zimmskal
Markus Zimmermann
1 month
@maximelabonne Working on a better eval for software development related tasks (not just code gen) Right now we let the model write tests and we score how well they do that, next version we are adding more assessments (obvious comments, dead code, ...) and tasks
1
2
7
@zimmskal
Markus Zimmermann
2 years
@dascandy42 We looked into that and IMO it is (was?) Bad for the user's experience. If you so not have WSL set up what then? Rather have our users using the apps in one click.
0
0
6
@zimmskal
Markus Zimmermann
26 days
@huybery Congrats! Blog post reads extremely promising. I am looking forward to running it in our evaluation, maybe it can dethrown our current king model: Llama3-70B 🏇
1
0
6
@zimmskal
Markus Zimmermann
2 years
Pro tip: if you are waiting for a command in your console to finish. Take the microsecond to press enter to actually run it 🫣 definitely going to bed now ...
0
0
5
@zimmskal
Markus Zimmermann
6 months
batman-POW! @golang just announced the first 1.22 release candidate and it looks to me like they squeezed two major versions into one! It is truly packed 🌋 Congratulations to everyone involved making #go more awesome!
1
0
6
@zimmskal
Markus Zimmermann
3 years
@luca_cloud Thanks for posting. Will be my good night story today. How did you get access to these teams? On what level did you communicate? IMHO people tell different things on different levels.
1
0
6
@zimmskal
Markus Zimmermann
3 months
If you are interested in LLMs/AI-in-general, code-generation or code-quality: Just released "An evaluation benchmark for LLMs and friends to compare and evolve code quality of code generation." The first release v0.1.0 is just there to ...
2
1
5
@zimmskal
Markus Zimmermann
2 months
@mitchellh @davidcrawshaw That is what i would expect idle does (ok, would expect 0% CPU when nothing is happening but 1% is good for now). Compare it with web pages: I have one Reddit page open where right now no change happens. Takes a **whole** core. 100%. No idea why. The climate is crying Reddit.
1
0
6
@zimmskal
Markus Zimmermann
12 days
@GoogleAI 's Gemini Pro 1.5 is doing well, but Flash 1.5 is doing much better! - 16% better score for Flash than Pro - Flash is much better for Java (+23%) but roughly the same for Go (+3%) @OfficialLoganK was surprised by Flash. Shouldn’t Pro be better?
1
0
6
@zimmskal
Markus Zimmermann
24 days
@gwenshap Using @OpenRouterAI for most of ours evals. @FireworksAI_HQ and @GroqInc are also very nice (for Llama3 at least) Currently looking at @huggingface for models that are not available somewhere else. Some parts with @ollama / MLX on rented servers for CPU-only evals.
0
0
6
@zimmskal
Markus Zimmermann
2 months
How to optimize Ollama for CPU-only inference? I am wondering if anyone has documentation, blog posts, ideas, ... really anything(!) about optimizing inference for CPU-only. The reason is that most people do not have nice GPU(s) in their notebooks and rely on API providers or
2
0
6
@zimmskal
Markus Zimmermann
3 years
1/7 At @symflower we have multiple principles that we are applying every day to everything we do. One of them is "The Beyonce Rule" which got its name from the EXCELLENT book "Software Engineering at @Google "... #devops #programming #coding #conventions
1
6
6
@zimmskal
Markus Zimmermann
2 years
Saying goodbye to an old friend. The only way i truly can, with a #devmeme . We refactored and delete lots of code to make @symflower CLI more efficient. My friend, FillDependencies (2016-2022), had to go to make room for a more memory-efficient future. #golang #coding
1
2
6
@zimmskal
Markus Zimmermann
2 years
@apenwarr @Tailscale `Tailscale raises $100M to finally move the internet to IPv6 with people thinking they are on IPv4. "It's high time to give this idea another go", says a fake quote by their CEO.`
0
1
6
@zimmskal
Markus Zimmermann
3 years
@badamczewski01 Wait until you are doing Go a few years and try to read some of your old code. Found that most of the time I wouldn’t change a thing. And that I can still read the stuff I wrote! Maintainability is amazing when you don't have lots of choices but lots of power in functionality 🥳
1
0
6
@zimmskal
Markus Zimmermann
1 year
@ID_AA_Carmack I'm sorry, Dave, errrrr John! I'm afraid I need to reboot.
0
0
6
@zimmskal
Markus Zimmermann
2 years
If you are doing #golang or #java , give a try. Evelyn demonstrates 👇 a full workflow on how to work with generated unit tests, how to validate your code and debug problems. If you are writing tests by hand. You will be surprised, i promise 😉
@symflower
Symflower
2 years
Ever wished you can just generate your #unittests instead of painfully writing them? In this video 👇 Evelyn demonstrates of how to use to speed up your daily development workflow 🚀✨ #golang #java
0
0
5
1
0
4
@zimmskal
Markus Zimmermann
2 years
So close 😍 If you haven't tried our #vscode extension for generating #java and #golang unit tests yet, let's give it a try! Let me know how it went and if you are missing something. We really appreciate your feedback 🥳
Tweet media one
1
0
6
@zimmskal
Markus Zimmermann
2 years
Looking forward to adding more convenience functionality to @symflower like this progress indicator. Let me know if you are missing any additional functionality to be more productive in #golang or #java . Looking forward to announcing some exciting #debugging features next
@symflower
Symflower
2 years
Due to popular request we added a progress indicator to #vscode while is generating tests. This feature is coming to #intellij as well, together with more detailed progress reports. Give it a whirl and let us know how it works for you! #coding #devops
Tweet media one
0
0
4
0
0
5
@zimmskal
Markus Zimmermann
2 years
@maciejwalkowiak We bought who wanted it. Heard only good feedback, examples look IMO nice but I did not walk through it
0
0
5
@zimmskal
Markus Zimmermann
3 years
@francesc @Goodwine Author here: WOW, am I dreaming? Francesc thanks! Would be great if you could do an episode on mutation testing, IMHO one of the most important test coverage. Hope the tool works for your examples. Currently busy building another product hope to get back to go-mutesting in Oct.
1
0
5
@zimmskal
Markus Zimmermann
8 months
@tagir_valeev Chesterton's Fence at its best When i see something like that, i usually add a comment right there and a test to never have such a "fix" merged aka doing the Beyonce rule. At this point i think such tests are the best documentation.
1
0
5
@zimmskal
Markus Zimmermann
12 days
Just when i thought that we can close down our model list for the eval @AnthropicAI pulls out one more of the hat 🎩🐰🪄 LOOK at those Coding scores! Let's benchmark 🏇
@zimmskal
Markus Zimmermann
12 days
DeepSeek-Coder-V2 might be taking the king’s crown 👑 from Llama 3 70B soon - 📈 Shares almost same score (19980) with Claude 3 Opus (19954) a leap over GPT4-o (19236) - 💵 However, FAR CHEAPER: $0.42 vs $90 vs $20 - 🔓 Open-weight and allows commercial use This is an excerpt
Tweet media one
9
34
214
1
0
5
@zimmskal
Markus Zimmermann
2 years
This is @symflower 's CI/CD pipeline. When i started we had one job that did everything at once. Now we have multiple editors, OSes, services, ... it is wild how much automation is in that one screenshot. Care to share your CI? Any learnings? #devops #gitlab
Tweet media one
5
0
5
@zimmskal
Markus Zimmermann
3 years
💉💉💉 third one is Pfizer 🥳
2
0
5
@zimmskal
Markus Zimmermann
2 years
@brunoborges At my last conference I stayed almost all the time at our booth (which was not cheap), saw three talks (the first, the last and one with my colleague) and I held my own talk. Your suggestion makes that XP completely terrible because I cannot even watch what I missed. Pay to miss
1
0
5
@zimmskal
Markus Zimmermann
3 years
@RandallKanna ... and that just means that something was not "perfect" before. What a dull world it would be if people wouldn't drive on failures to produce something better ... Hope to lead as an example here that not everyone is shaming failure
0
0
5
@zimmskal
Markus Zimmermann
3 years
@starbuxman Lol i am old enough when people thought XML is a good idea 😂
1
0
5
@zimmskal
Markus Zimmermann
2 years
@grashalm_ Where is that?
3
0
5
@zimmskal
Markus Zimmermann
1 year
We are looking for software engineers who are interested in developing tools and source code analyses for software developers. DM me if you want to know more! #jobs #jobsearch #jobhunt #hiring #career
@symflower
Symflower
1 year
We’re looking for Algorithm Developers! Are you familiar with programming, logic, and formal methods? 🤓 Join the Symflower team and help us make developers’ lives easier! 👇
1
0
3
3
1
5
@zimmskal
Markus Zimmermann
2 years
@jasongorman Would be interesting to know why you are removing the code? Has the specification changed? When I do TDD I only write tests for the behavior I am expecting with the current step, rest i try to generate. Nothing changes, unless the API|specification changes or i clean up something
0
0
4
@zimmskal
Markus Zimmermann
2 years
@moyix Welcome back fellow bad programmer! We have tears, ice cream and beer to get you started 🍻
1
0
5
@zimmskal
Markus Zimmermann
2 years
@tagir_valeev Was that a user report or did a fuzzer/symbolicexecution find it?
1
0
5
@zimmskal
Markus Zimmermann
3 years
@katharineCodes Its fine. Leave everything behind. Join the Gopher side, join #golang
0
1
5
@zimmskal
Markus Zimmermann
11 days
@VictorTaelin It is true for the non-functional metrics that this is the case. But i can always quote the reason i got into benchmarking LLMs "costs matter" look at If with that simple version of the eval, it clear that GPT4 was far more cost-effective. But yeah, now
@zimmskal
Markus Zimmermann
2 months
Llama 3 70B takes the king’s crown 👑 from GPT-4 Turbo - 100% in coverage and 70% in quality of code - 💵 most cost-effective inference - open-weight We tested @OpenAI 's GPT 3.5 & 4, @Meta 's Llama 3, @Google 's Gemini 1.5 Pro, @cohere 's Command R+ and 130+ other LLMs
Tweet media one
20
85
568
1
0
5
@zimmskal
Markus Zimmermann
3 years
@robertbalazsi @jones_spencera Problem is that it takes you enormously more time when you want to add it afterwards. Doing it right from the start does not make you slower, it makes you much more productive. You do not have to create perfect code: some linting and tests are enough to *have* quality
2
0
5
@zimmskal
Markus Zimmermann
2 months
@mboehme_ Both have value, unless they are truly always consistently and deterministically outperformed. Horse vs car. There are lots of cases where (some) LLMs fail. The more one looks into that the more you see problems that can be (and should be) improved. E.g.
1
0
4
@zimmskal
Markus Zimmermann
2 years
@iospaulz Wow, pretty hard. How does it work out? Did you delete something that is still used? I think you could call that the fearless version of the "Beyoncé rule"😀 We enforce that rule at @symflower . Wrote about it here It works pretty well, lots of tests added
@zimmskal
Markus Zimmermann
3 years
1/7 At @symflower we have multiple principles that we are applying every day to everything we do. One of them is "The Beyonce Rule" which got its name from the EXCELLENT book "Software Engineering at @Google "... #devops #programming #coding #conventions
1
6
6
0
1
5
@zimmskal
Markus Zimmermann
3 years
1/2 One unknown feature of @gitlab that helps us every day is using `<details>...</details>` to hide content in descriptions/comments. We often have longer examples that need to be broken down in smaller parts for ... #gitlab #coding #programming
1
5
5