AI Summary

✅ Ready

Protocol to PRISMA: A Live Systematic Review with Elicit

6,785 1:01:44 May 21, 2026

Description

A walkthrough of completing an end-to-end systematic review in Elicit in less than an hour. Recorded as a webinar on May 13, 2026.

Details

Published: May 21, 2026
Views: 6,785
Duration: 1:01:44

Transcript

Okay, I think we'll go ahead and get started. Thank you everyone for joining. Um so, a little context on what we're doing, why we're doing it. Last year Elicit, we introduced uh our first version of the systematic review workflow on Elicit. Since then, we've been doing a lot to improve it. And last week we announced that it now supports Prisma 2020 standards, so you can do very rigorous SLR's on Elicit and have the auditability auditability and the traceability along with the accuracy and performance that you need to actually be able to take your systematic review, feel confident submitting it to a regulatory body, or submitting it as a research uh article. But, there's no better way to do that. Like we have statistics, we have numbers on how well it works. I think the best way to do it is to do a live demo. Can we do a systematic review, which you know, usually takes 6-8 months to do and hundreds and hundreds of uh manual hours. Can we do one in less than 1 hour? And so, that's the purpose of what we're going to do today. Along the way, I'm going to show all the different features that we've added to Elicit systematic review. And I want to leave plenty of time for any questions that you might have. Um when we run certain phases, for example, extraction will take a little bit longer, I'll tackle all the Q&A that you have. So, if you want to add questions as they come up, go into the Q&A tab, and then you can add them. And if you want to be anonymous, totally fine. Just make sure you hit that toggle. Uh without further ado, I'm going to go ahead and get started. Okay. Cool. So, this is the Elicit homepage. If you haven't used Elicit before, Elicit's trying to be the place where the the place, the home where you can do evidence synthesis, where you can come and do, you know, fully rigorous systematic workflows, as well as the quicker help me find research in this area, do I find any evidence that contradicts this, and uh everything and beyond in the life sciences. So, we're going to focus on the systematic review workflow today. And the way we're going to do it is I'm going to take the protocol that I have in this tab, which I've predefined before, and I'm going to put it in and we're going to run it in Elicit. So, to start with, the way that I would use Elicit here is I would take this research question that I have, go back to the home page, and then just paste that in. So, I have among adults with type 2 diabetes initiating a GLP-1 receptor agonist routine clinical practice, I'm really looking for persistence and adherence in real-world evidence over 12 to 24 months. Do they actually adhere to it in the real world? So, I'm going to start with that. And that's going to load up the review. And this is a new page we've added to the Elicit systematic review, where essentially this is to define and set up your protocol before you go in to the the actual stages. There's a lot of stages, there's a lot of features. This is the one place to go in and say, "This is what I want to do, and here's all the context I need." So, I put in the research question, as you can see. I'll come back to the additional context in a second. And then I will also come back to dual review, but we've added a feature where now you can use Elicit as decision support for two reviewers, or you can have Elicit be one reviewer and a human be the other. But what I really want to focus on here is the step configuration. Systematic review has many, many stages. And here we sort of lay out all the stages that you can do in Elicit, and you can do configurations. So, first we're going to start off with gathering all of our sources, then we're going to go into title and abstract screening. After that will be full-text screening. We'll extract the data from the papers that made it through all the screening, and finally we'll generate a narrative synthesis from the data. Want to call out a few toggles here. One is strict criteria. This is also new uh or newer on Elicit. And essentially what this is is for a rigorous systematic review, if a paper fails a single eligibility criterion, then we have to fail it. I think that's the way that systematic reviews are done, and so we've added that. So, when you toggle this on, any paper that fails, even a single criterion, will automatically be excluded. For more rapid reviews, scoping reviews, and so on, you can turn this off, and papers that fail a couple, but are still important or useful, will make it through. And same with full-text screening, you can turn on the sub-criterion. And then, in terms of extract, a feature that we have on the list that we're very proud of is extracting from figures. So, if there's tables, charts, graphs, you know, Kaplan-Meier curves, and so on, Allicit can look at the data in those charts and images, and pull out data and estimate data for your extraction. I'm going to come back up here to research question. So, additional context, this is actually where you can fit in all of your protocol, and then Allicit will help figure out the rest. So, okay, you have your PICO criteria or your PICO definitions, from that, what should the eligibility criteria be? There's a toggle here where Allicit will auto-suggest columns. If you use those in tandem, Allicit can take your protocol and then come up with the criteria, and then you can modify those. So, that's exactly what I'm going to do. I'm going to go back to here. I have my PICO, I have my eligibility criteria. I'm going to focus on Just I'm just going to copy and paste those directly in. A bit of a text dump right now, but Allicit will understand this and will use it in the actual review. With that, let's go ahead and get started. Okay. So, now we're in the first stage, which is where we gather all the papers. Right now, this is doing a semantic-based question search. I'm just taking my research question and looking inside of our own internal corpus for all the papers that might be relevant. Our internal corpus, Allicit's internal corpus, has about 138 million papers. So, I'm sure we'll find some great results. But, we've also, you know, for rigorous systematic reviews, keyword search really is the name of the game. So, we have the ability to do that as well. And so, what I'm going to do is I'm going to add a new tab, and so, there's a couple options here. I can do another keyword search, I can do a key Sorry, I can do a semantic search, I can do another keyword search, or I can upload papers. I'm going to add papers from keyword search. And one thing to note is they've also added other databases. So, we have Illicit's internal research papers corpus, but we also have PubMed and we have clinicaltrials.gov. For this systematic review, I'm going to focus on PubMed and I have the specific search strategy also in the protocol doc, which I'm just pulling over. Can put that in. Search papers. It'll take a a few seconds to go get the papers from PubMed and pull them in here. Now, one thing to note is that, you know, you can do multiple searches that all flow into this one tab where you can have all your papers gathered together and we can actually support up to 40,000 papers per systematic review. Um and so, that's where you would start with them all here, whether that's from your semantic searches over Illicit, your keyword searches, or if you want to do something on Embase or another database that we don't have in product yet, you can always go there, do the searches, and then if you upload papers, you can do a bib file, a RIS file, or if you have the direct PDFs, you can upload those as well. So, this is going to take a a few more seconds to run here. I'm going to go ahead and delete this among adult one. I This it returned about 10,000 results. I'm sure many is irrelevant, but I would like to stick with the PubMed one for now. Okay. And one statistic to note on the search is when we do Illicit's semantic search, we've tested it at at about 5,000 and 10,000 papers, we get about 95% of the papers at 5,000 that you'd want in your in your systematic review. But, we are keyword search is loaded and we can see all the papers over in here, 984. And if you copy and paste this uh keyword query directly into PubMed, you'll get pretty much the exact same number of results. Okay? That's the search phase. We've gathered the papers that I'd like to do in this systematic review. Now, we're going to go on to the screening phase. So, on screening, of course, the objective is to take all of our papers and then remove all the ones that we know we definitely do not want in our data analysis and extraction. Here, we're focusing on the title abstract screening phase. And one thing we really focused on with Illicit is we try to make it as inclusive as possible. So, to get really high sensitivity, you know, as I think is the the common knowledge, a false negative is probably the worst thing you can get in a systematic review. If you miss a paper that's really important, that could change your results drastically versus false positives. You still don't want them, but a little bit easier to remove and uh segregate out. So, we're we've really erred towards making sure we get all the right papers in. The way screening works in Illicit is we have essentially eligibility criteria questions. So, you can add your own. You can see Illicit's suggesting some based off of the protocol I added in earlier. But, the way it works is, you know, I'm going to ask for the question like, you know, is this study focusing on adults? And we can just see a couple here. Appropriate population focus. Is the study population not primarily composed of pediatric patients and some other groups that we don't want to see. Cohort design. Is the study using, you know, this compound uh questions that we can do in here. Does the study investigate uh approved GLP-1 receptor agonist? And the way it works is you'll see it's already starting to populate. There's a couple. There's Each criterion is evaluated against the paper and then each criterion returns a yes, a no, or maybe. If any of these are a no, for example, a short-term cost, this paper, since it has a no in cohort design, we know this is just going to be removed entirely from the um actual systematic review. So, I'm going to go ahead and evaluate screening and then I'll come back and explain more. So now it's taking those criteria and running it against all 984 abstracts. This previous screen, what it's doing here is it's actually selected 100 randomly from the 984. And the purpose here is like a pilot phase. Essentially, what this is useful for is we have this criteria, where are the maybes occurring? We can then go and see, okay, it's on the appropriate population focus. Is our criterion clear enough for Alyssa to understand? Are there any areas where I disagree with it? For example, this no, actually I think it should be a yes based off of me reading the paper. Then I would go and see, okay, based off of that, do I need to modify Alyssa's criterion? You modify it, you update, and you can do that until you feel comfortable and satisfied with how Alyssa is performing. Okay. So now Alyssa's going against all the abstracts. It's about 12% complete. It'll take a couple minutes. Um so as well in this phase, we've done the pilot. Now it's running against all of them. We evaluate, so that's how the screening works in the title and abstract phase. We've done an evaluation of this as well when we released we announced our release last week. So let me switch over to show you what we've found. Okay. So we evaluated title and abstract screening in Alyssa against 108 Cochrane reviews, which in total comprised 6,093 individual papers that needed to be screened in or out. And using our our system, we got to a sensitivity, essentially of all the papers that should be screened in, how many of them did we get? Of 96.9% and a specificity of all the papers that should be left out, how many did we ignore? 92.5%. And we think these are pretty good numbers. We compared it to a dual review human benchmark that we found from Gartner et al. 2020 of about 97.5% sensitivity, 68.7% specificity for two humans doing systematic reviews. So we think we compare favorably in this in this dimension, but of course, and all of the nations differ. All right. Any questions at this point while we wait for this to finish? Cool. I'll continue, but of course, feel free to add in questions in the Q&A. So, I've talked a little bit about how the screening works on the title and abstract uh phase. I've shown some statistics showing we've got pretty good results in terms of sensitivity and specificity, but of course, the main thing is we want to make it really easy for researchers and reviewers to verify every single decision recommendation that Elicit makes, and then over time that will build trust. So, there's a couple ways we do that. One is for every single paper, if you click into it, we have every single criterion that it's evaluated against, and with every single criterion, we also show the rationale for why it either passed or didn't pass that criterion, and we have source quotes. And we also have a reading mode because this is more of a table view, but if you really want to focus in depth every single data point, we have a reading mode here. This paper, for example, appropriate population focus, okay. It's Elicit says that the study focus is adults with type 2 diabetes. How do I know that? I click on the source quote here, and it will immediately highlight within the abstract the specific quote that gave Elicit the confidence to either pass or fail this criterion. And you can do this with every single criterion and every single paper. Cohort design, uh also this thing also this quote, and also there's another quote that um made sure that at 6 months uh utilization beforehand. So, you can check every single decision. Of course, sometimes you might disagree with Elicit's decision, or you want to override it, and that's also possible. You can very simply just say, "Okay, Elicit just said this should be an include. Actually, I'm going to say this is an exclude." And I can select exclusion reasons as well. So, I would say, "Actually, you know, I think color design was off here." In this case, it's not, but if it was, I could choose that and then hit save and next. And then you can do that with every single paper. I'm going to jump down to some of the excludes. Try a little bit lower. Okay, here we go. These are papers that illicit thinks should be excluded because they're failing at least one criterion. So, here I'm going to go into reading mode. Illicit will tell you, "Okay, it's an exclude." It'll also tell you why it thinks it should be excluded, what's the primary exclusion reason. And the way we do that is essentially we try and go in PICO order. If there's a If there's a failing criterion, if there's multiple, we go in PICO order to say which is the most important exclusion reason. In this one, it was sample size. Um and I can go check, okay, why is sample size uh an exclude? And that's the wrong one. Let's see. Yeah. This analyzed 203 patients and our criterion, we requested that it be more than 500. And so, I can go and verify that that is correct. And then I can do this for every single paper and uh so on. And then one last thing to note is of course there is exports if you want to be able to take this out into your own systems to do analysis, you can download as a CSV or an Excel. And in those CSVs will have every single paper, every single criterion, the judgments, the rationals, and the source quotes as well, in case you want to check those against other systems. All right. And so, we've added a lot of these features uh more recently, primarily to make sure that we have Prisma compliant status. And when you do your screening, for example, illicit in the final report will show you, let me go back down to the excludes. I'll start the includes here. Okay. It's going to be excluded due to the sample size. It will also list that in the Prisma flow diagram that we do at the end that shows, okay, um what are all the reasons why, you know, 650 660 abstracts were excluded. It's going to show you 20 of them were for sample size, 10 of them were for, um you know, the study design was off or something like that. All right. I see there's a question. You can use the chat. You can also use the QA feature if you'd like. How does this compare versus Nesta knowledge? It's a great question. I think Nesta knowledge came from the more manual side of the systematic review, so they've built out, you know, meta-analysis, critical appraisal, and so on and and so forth, versus we are coming more from can we build an AI from the very beginning, but I think we have a I think within the system the the world of AI systematic review, we have a a pretty defined philosophy. We have some things that I have some blog posts I can link you to, but essentially when you try and use like a ChatGPT or a a generic AI tool, there is a lot of it tries to cram everything into one answer, versus we have some philosophies around like factor cognition. Essentially, we need to break down every single large task into smaller tasks. At at the smaller task level, elicit our AI models can do more accurately and uh be more traceable. And so, with those things combined, we can bring in AI and have lots of oversight and um accuracy at So, different approaches to the to the world of systematic review, but I think we have our, you know, a similar mission of helping people be able to do much better research, much more research, help systematic reviews keep up with the amount of evidence piling up out there. Okay. All the abstracts were evaluated, 984 of them. I could go in and review every single one if I would like to, but I'm going to move on to the full text screening stage. In the full text screening stage, it's very similar to the title abstract stage in that we have the criterion that we the criteria that we selected before. I might add one here because there's some details that are only available in the full text that are not in the title abstract. And then the secondary thing is we need to have the full text. I'm going to go ahead and there's one criterion I want to add. I'll add that here. And then we can kick off full text screening. Okay. Okay. So, very similar to the title abstract stage, but we're going to do only full text. There's a big question here that of course I'm sure you're thinking is how do we get the full text? So, Elicit actually has a couple ways to provide the full text to the stage. One is that Elicit will go out and try and find all the open access um papers that we can. So, for example, already we have about 24 full text that we we were able to find either their full text or their conference abstracts. So, we find them uh automatically for you. We also have a Chrome extension which if you use that and you have institutional access, I do not, but if you have institutional access, you can use those credentials to get the papers automatically from the journal subscriptions, too. And then finally, we show all the abstracts that we couldn't get full text for ourselves. And so, what you can do is we'll show you the exact link. If you click on this, it'll take you to the exact uh paper and then you can get the PDF in some way and uh upload to Elicit here automatically. Once it's uploaded, that abstract will just be processed like every Oh, sorry, that full text will be processed with all the rest of the full texts. So, let that continue. And then, let me share the stats, actually. So, I'm going to go back to this similar title and abstract. You can click in, you can see the reading mode, and I think it's actually even more useful here where there's a lot I mean, there's many pages per paper, and so okay, focus on adults with type 2 diabetes. Where is that coming from? This case might be the abstract. Um but if there's something Let's see here. Oops. Yes. It will jump to the place in the paper where the the information was pulled from, and as you can see here in the screening, we're actually looking at tables itself. So, we can see, you know, for example, this has percentages for, you know, the the visual percentage of 12, 18, 24, and 39 months. And so, we can use that information even if it's in a table and a figure, we will show you exactly where it came from in the paper. Okay. Before we go on, I'm just going to share one stat here. We also evaluated against um full text screening on Elicit. We took 74 Cochrane reviews and 377 full texts, and then we looked at sensitivity and specificity again. Here, we saw about 99.5 sensitivity and 70.1 specificity. And then, we also looked at the per criterion level, which is a which is to say for every single criterion, how many of them do we get correct? And we saw about 94.8% accuracy on the per criterion basis. And where we got it wrong was that we over indexed towards maybe the includes instead of excluding. Again, for the same reason of we don't want to miss any papers. All right. So, screening is done. Out of the 984 sources that we originally saw, we have about 40 sources included here. If we wanted to, we could also get the full text for the 50 that we couldn't that Alyssa couldn't find for us, and then go from there. So, now we're going to go into extraction. Extraction is essentially for all the sources that we've included, let's extract out the data points that we actually need to be able to do our analysis, to be able to do our syntheses, and so on. Alyssa's going to suggest some, but I actually pulled I wanted to show what it's like to create some custom columns, so I have some that I'm going to pull from the protocol. So, give me a second to pull those in. And you can see that we're extracting from figures as well. There's also an ability to have an answer structure. Generally, I default to any, which allows Alyssa to essentially, you know, look across all the data and then be able to pull out into the right structure and right format based off your prompt the answers. But if you know that you only want, you know, like a uh like there's like a couple choices you want to choose from, you can always choose that. So, for example, you know, it's going to be an observational study versus an RCT or something, you can define those, and you can do yes and no, maybe in case you want uh answers in that structure. But I'm going to default with any right now. So, that's one. Let me add the others. And again, if you want to see where I'm getting these from, they're in the protocol Google Doc that I shared. >> Okay. The task is going to run. Let me let it start running cuz it's going to take some time. This is probably the the lengthiest part of the Elicit Systematic Review because it's going to for every single extraction data field you just defined, go against every paper and pull out all the data, add all the quotes, and make sure that nothing is unsupported. In the meantime, the interface looks similar to the screening phases and that's on purpose. With using AI, using Elicit, prompts do matter. So, if you the pilot phase here is really to make sure that when you put in your criteria, the format, the information that Elicit is pulling out is to your liking. And if you don't like something, again, you can go in, you can specify a format. I did not specify a particular format on these ones, but you could say, you know, use parentheses for um standard deviation, brackets for the means, or vice versa. Add those in and Elicit will will follow that. You can iterate and then make sure that the extraction is to your liking before you do it against all of them. And so, we've selected 10 here. All right. And then, Elicit has also suggested some columns based off the protocol I put in earlier, but I've kept these off for now because I like to go with my extractions. Okay. So, Elicit is going to take its time to do this. As before, there's a download, so you can export to CSV and Excel and show exactly um what are the papers, what are the columns that we're pulling, and what are the source quotes from each paper that show why that exists. Let me see if I can show a couple here. Okay, so this is an example of an extraction patient population. I gave it a long list of things to pull out. It's pulling it out from the paper and you can see at all times the quotes that it's pulling from. And you can go through every single quote. Pulls from tables, charts, all kinds. And then when it's not mentioned, you I put in the prompt to say, you know, not mentioned. It will say specifically not mentioned. I yes. So this is a more complex diagram. Um key eligibility criteria. Elicit is essentially looking at this figure and saying, okay, what do I understand about it and then using that to inform the extraction and the analysis that it's doing. Okay. So, we're going to let this run for a second. I'm going to take a pause. Okay, cool. There is a couple questions. Let me answer them. Um can we add or create our own agents in Elicit instead of just using the pre-set pre-set research agent? So that is another workflow that exists in Elicit. Let me see if I can pull that up. I will stop sharing. Share this tab instead. Okay. So on the home page, if you notice we had selected systematic review, but the base is to do a research agent. At the current moment, no, the the main agent is the research agent that Elicit has defined. We put a lot of effort into the prompts and the tools and so on that it has to make sure that everything is verifiable, everything is backed up, and it's never being, you know, finding evidence where there is none or overextending claims. So at the current moment, no, but I believe we made it flexible enough that you could go in and do sort of any task that you want to do with it. If there's something in particular that you curious about, drop it as a question. I can also answer that in terms of what kind of agency you're looking for. Okay, does it make sense to check all decisions in title abstract screening by human overview or should a specific percentage be checked? So, I think this is something where the industry has not yet settled on. I know a lot of tools do have um priority screening or check, you know, at 20% and if it's all good after that point then you can just let the the system take over. From the research I've seen online, even these kinds of systems leave some valuable papers on the table. So, it feels like at the current moment we cannot just trust yet a system to do all of it by itself. So, I think human overview is going to be required. But I think that might change in the next, you know, year or two as we get more validation. For example, if Elicit is performing at 97% sensitivity and 92% specificity beating dual human review, I think that's some evidence that actually we should go into let's check a percentage, let's take a random sample, and so on. So, I think for now it's still makes sense to check them and we've added some features like the reading mode to allow you to just quickly check every single um decision very easily. But over time we hope to cut that down to where you check a percentage where you can do uh calibration within the review and then and show that. Is it possible to show off KM curve digitization? Yes, it should be. So, the way that it works in Elicit is we don't, you know, we don't replicate the diagram in as an internal thing. But what we do instead is we would look at the KM curve, we'll try and extract out all the data points. If I can find one in here, I'll try I'll try and show it to you. Should be somewhere. Should be probably in primary outcomes. Okay. Let's see. So, I'm guessing where that does work. Okay. So, for example, persistence at 12 months here. Here, what elicit is doing is got this figure and based off of this figure, it's estimating that I I'll define the exact number from the KM curve, 48 to 50% at 12 months. Let's see. And this might be a case where elicit overestimated and it might be closer like the the 60% uh 48 to 50 is about 10 percentage points off. This is probably on the the larger end of errors, but elicit can digitize from the KM curve and essentially what it's going to do is try and estimate based off of this sort of eyeball. If you go to the 12 and then go up, what is actually Did I misunderstand that? Yeah. Uh it's going to go eyeball up from the the the 12 the 12-month mark and then look at what is the chart saying? Here, I think it's probably closer to 60% versus elicit was saying 48 to 50. In other cases that I've seen, it's also it's been a little more accurate. I've seen on average, the estimate is within about five percentage points of the actual value. Okay. And then, I wanted to show one more thing while we wait for extraction to load, which is uh oh Sorry, I realized I was on the wrong tab. This is the KM curve digitization. So, here elicit is looking at the curve. It's looking up at the 12 uh month mark and it's saying, "Okay, this is right around the the 0.6 probability of persistence." Whereas, I think elicit is is overestimated that and it's gone to 58 uh or like 40 to 48 to 58 percentage points. So, I list of uses within about five percentage points when I've looked at other examples. Okay. I want to show one more thing while we wait for extraction to load, which is dual review. It's a new feature that we've added, and I didn't do it for the sake of this one because that adds another person who's going to take a little more time, but we do have it as a feature, and there's two ways it can work. Elicit can either be decision support. The two individual humans will review Elicit's recommendations to make sure they're not missing anything, and then they're making their decisions, and they reconcile. Or, Elicit can perform as a second screener, in which case you run Elicit like I'm doing here, you download the Excel, and then you compare it to what a manual human has done with their Excel and reconcile. Let me show you what the the Elicit as decision support for two humans looks like. Believe that should be this tab. Okay. So, this is the separate This is a separate review that I had set up beforehand. It's with a far fewer papers. It's only got 10 papers in here. And what I did is I did I assigned dual review. So, in that first protocol step, you can say add a second reviewer. I added in my other email so that I could do both sides to show what it looks like. But essentially, what you'll see here is title abstract screening is finished, and then I've started dual review. And so, now what happens is Elicit has made its screening recommendation. It's up to me to say, do I agree with this, or do I actually have a different decision? So, I'm going to check every single decision. Let's just say, you know, I don't agree with this one for whatever reason. I think it should be a maybe because I've dug through it, and I found some criteria like human adult participants is strong enough of a criterion that it's getting maybe on that I think it's maybe. I can save that, and I can also add a note for later use and reconciliation. Okay. And I'm going to that one's been marked. My decision is maybe. I'm going to accept this one. I'm accept this one. I'm accept this one. Let's say I think this should actually be an exclude. I'll save that. And then I'll go forth. So now what's going to happen is on the another human is going to do the same thing across all of these papers independently. They're not going to see my answers, so they're blinded. They will see illicit screening recommendations up. They're going to do the same thing. And then once they're done, which I've already done beforehand, we can reconcile. So what that means is I will actually see, okay, here's what Elicit recommended. Here's what I review one uh decided and here's what reviewer two, which is the peer, decided. And so we can see where we disagree. And really the disagreements are going to be between my decision and my peer's decision. So here I'm going to see my decision and the notes I added earlier, reviewer one's decision, the note they added earlier. And the intent here is however review teams want to work, maybe they do it over email, maybe they get into a room and adjudicate, maybe they bring in a third person to make those decisions. However that wants to be done, that can be done and then reviewer one, who is the creator of this the systematic review, can just put in the final decision that was made. So here, let's say we're just going to include it. And here we're going to exclude it. Okay. That will finalize that. So at this point I've locked in my entire title abstract screening. Elicit's done its recommendations, human one's done their recommendations, and then human two has done their recommendations, and we reconciled. And so now we have our final uh title abstract screening. But with now we've also added some statistics, which I think are really helpful for calibrating how well Elicit's doing. So what we've added here is, okay, one, reviewer one, how did they calibrate with elicit? We had an 87.5% agreement rate across the papers, and when we uh excluded a paper, we had the same exclusion reason all the time. And then we also have the ability to Cohen's kappa, but it's pretty unstable underneath about 50 to 100 papers, so it's available after you've added about 50 papers. And the second thing is, how do the reviewers agree with each other? So, reviewer one versus reviewer two, what's their agreement rate, exclusion reason agreement, and same thing with the kappa. And then when you do dual review, we also have an audit trail that I think is super helpful. So, let me show you this. That was review stats. We also have review history. So, for example, this shows you every single decision that was made, what elicit recommended, what human one recommended, and human two recommended. So, for example, this will show every single decision um and where we disagreed. And same with the reviewer two, we can show every single decision, where we disagreed, and the notes. It'll show you exactly when each phase is reached. So, you locked your protocol in, you started screening, when did you go to conflict resolution, what were the conflicts that were resolved, include which ones did you resolve, and then when it was done. And you can also export the CSV, so if you want to put in your um Prisma reporting documents, you can put that as well. So, that is dual review. You can do it that way with two humans on elicit, or you can do it where um elicit is one screener or reviewer, and a human is the other. Okay. Let's see. Couple more things extraction is about 70% done. Again, this is going to be the longest part of the systematic review, um because it's going against all the criteria pulling it out in detail finding the quotes analyzing the figures. We've talked a little bit about you know, auditability and traceability are super important at every single step of the way. We showed that in screening where you can check every decision see the rationale override it and then the same thing in extraction as well. You can for each paper and each extraction field see where it came from in the paper see the rationale that elicit had for why it extracted that data out. And then in terms of editing it you can always go back to extraction results change the results or if you want to export you can export download the CSV edit it there and we have on our road map coming up also the ability to to edit inside of this view as well. All right and then let me talk about the stats of this a little bit. So I shared stats on title abstract screening full text screening on extraction what we did is we took about 28 Cochrane reviews and about 111 papers that full text papers that came with that and we focused on extracting out the methods participants and interventions the same way they did in the Cochrane review and we got to about 95.6% accuracy with that evaluation. There are some caveats to it that I would recommend you check out on the blog post where I think when you in Cochrane reviews sometimes it's under specified what you want to extract out so we sort of had a synthetic question that was formed from the extractions. It's a bit involved but I think you can go check it out on the blog post. I would still stand by the 95.6% number as being accurate for the methods participants and interventions. That was our internal evaluation. We also have had external evaluations independently done. For example Lee et al from earlier this year evaluated an earlier version of elicit systematic review and found that we had about a 1% hallucination rate which is approximately which is actually the same as humans that they've measured in their in their study. And the overall error rate actually was lower than humans. Since then, we've increased our accuracy and improved our system and its performance, which is why we're seeing 95% accuracy rates which is better than before. But we've had both internal and external evaluations. Okay. Let me answer some more questions. So there's a couple questions. When you want to add PDF to your search that was not downloaded automatically, what is the best way to feed it back to your current search? I probably in the past with this I had to start everything from scratch. Let me see. Okay. So in terms of adding papers or PDFs to the systematic review, there's a couple things. One is at the full text stage, which is really when it starts to become important, you can always upload there to the specific abstract that is missing a PDF. So that will always persist through the rest of the systematic review. Or if you know you need some PDFs for sure added, you can upload it directly in the search stage or to the library and then pull it from the library into the systematic review from there. Okay. And then there are three questions here. Are there any copyright issues if I upload a PDF that Elicit does not have access to? That is something to check with the journal that you're specifically pulling from. There might be and we're working on ways to easily give you the ability to check whether you have access. That is something you'll have to check with the specific paper. Is a risk of bias assessment already possible or is one being planned? So you can always add in in the extraction state risk of bias columns, but we are specifically planning on adding critical appraisal to Elicit. So that'll be a separate stage where you can say, you know, what's the risk of bias for each paper and then do analysis based off that. How realistic is it that Ilicit will one day be able to conduct meta-analysis? Is the last question here. Um, very realistic. We have it on our road map coming up soon. So, we have the systematic review, then add critical appraisal. Once you have critical appraisal, we're going to um, essentially add in the ability to do meta-analysis on top of it, so you can do quantitative synthesis as well as the narrative synthesis. Let me jump over to the Q&A for a second. Okay, that was already addressed. Okay. I know you mentioned that Ilicit is different from NKN that it's not AI native, but what about AutoSR? Uh, to be honest, I have not used AutoSR. I know they put out some great validation numbers, which are great to see. I have not used AutoSR. I don't know. I cannot speak to their traceability, auditability, reproducibility. I just haven't seen this seen the system. But, I will emphasize that the the thing that we try to been we've been doing from the start is focusing on every decision should be traceable, auditable, reportable. So, when you go up to submit this to as an SLR and you have to do the Prisma documentation, every decision you can report and say exactly where it came from, which system used it, and when it was run. Okay. See, extraction should be almost done. Cool. Extraction's complete, and now we go on to generating the report. So, in terms of generating the report, we'll see as it runs. Here is sort of, you know, the Prisma funnel. We'll have an explicit Prisma diagram that shows up at the end, but we've looked at 984 sources. We screened in 230 of them at the title abstract stage. At full text screening, we looked at we included 40 sources from that 230 pool, and then from the 40 sources we extracted about 200 data points, which are going to go into the final report. I'm I'm to let that generate for a couple minutes. But, in the meantime, I'm going to show a couple cool things that um I think are Actually, let me show a a done report. Let me go let this report generate. So, I had done essentially very similar essentially the same protocol couple uh yesterday to just you know to check it out and then I generated a report of it this morning. What this is doing is essentially it's taken all the data points and then looked at the research question you had at the beginning and the PICO criteria and so on and then tries to come up with essentially a Cochrane review ask or um very rigorous analysis of that. So, we'll have our summary. We'll have our abstract which essentially summarizes across all the evidence. And then I want to specifically call out the method section because this is going to be very important for showing that it's rigorous and done in a PRISMA-compliant way. We have our PRISMA diagram which shows you all the steps that we did. And if you let me zoom in a little bit, essentially it shows you papers that came in, the numbers, how many were screened out at each stage, and the specific exclusion reasons for each of the papers. Done above title, abstract, and full text. Shows you the full search string, the number of results. Shows you the exact screening questions that we used as well as the um the the number of papers are excluded. And then in terms of extraction, it shows you the exact extraction criteria or definitions we had as well. Cool. And then the rest of the report goes into what are the you know the things you'd expect in a a review of this kind. What are the character characteristics of included studies? And then it shows you all the the data for that and it will show you also the claims, why it says that. And again, it'll go into the the paper and show you the quotes from the underlying paper as well. And then it'll do this pretty consistently throughout. It will do here are the claims. Um Ah, apologies. Cool. Let me rewind a little bit. Apologies. So, this would be what a finished report looks like. Essentially, I ran a very similar one earlier and it's going to look at the research question. It's going to look at all the data points that we pulled out and essentially come up with what is the synthesis across all the data, while also, you know, factoring in that some papers might disagree and so temper the evidence. So, it has an abstract and a summary in the way that you expect. It has a method section, which shows you all the steps we did in the in the actual process. So, for example, it has a Prisma diagram and if you can zoom in a little bit, it shows you the number of papers we pulled in, what are the abstract screening criteria we used, how many papers were screened out for each specific criterion. And then I'll go into full-text screening, the number of papers screened out of full-text screening based off of those criterion, and then the number of papers we finally included. It also includes the includes the full query that we used, the database we searched against, and then it will list out in detail all the screening questions and the extraction definitions we wanted as well. And then past that, it will do a characteristics of included status studies table. So, for all 34 papers that we included, what are the characteristics, you know, that we wanted to look at and what did it show? And then for each of them, it will show the full quote as well, where in the paper that claim comes from. And then it continues on, essentially summarizing and synthesizing all the evidence to show what are the persistence and adherence rates at 12 months. It will show again tables showing each paper what it shows, and then summarizing or synthesizing across those numbers, adherence and so on. And then often they'll do subgroup analyses, um dosing schedules, and so on, to essentially get all the subgroup analyses you would like in your in your review. I will caveat that essentially as it stands, the report that Elicit generates is more of a first pass. There's a lot of stuff here that you want to add. We want to be able to do meta-analyses. You want to do more papers and do more rigorous checks of all the claims. But as a first pass, it sort of gives you the lay of the land. All right. Let's see how the report is going. Report is still generating. It'll take a couple more minutes. Let me answer some more questions. I know this presentation is focused on a systematic review of published studies, but can you provide information or suggestions for using Elicit for something like a landscape scan that also looks at gray literature and websites? Yes, so there are two answers to that. One is as part of our corpus, we do have some gray literature, conference abstracts, and so on, And so when we do a systematic review with with the Elicit corpus and you do a semantic search or something, we will pull those in. But if you want to get even more and get websites and so on, I would start with the research agent on Elicit, where it will essentially be able to scan across the web, find all these data sources, and compile into there. And something we have planned that, you know, we will be implementing soon is can we combine these two? So you do your web searches, you pull those in, and do a systematic review over the results you found from the web search as well. Okay. Are we going to receive the recording after meeting? Yes, absolutely. We will be sending out the full recording. And then Cochrane is evaluating tools based on their alignment with responsible AI and evidence synthesis uh raise recommendations. Have you submitted your tool for evaluation? Yeah, we are talking with Cochrane and about joining their study and what the steps would be. Um we have a we have internal uh service. Our evaluations are tried to sort of align with the the raise guidelines in terms of showing here is independent or validated performance of each stage of the pipeline, which is a big thing that raise guidelines look for. Another thing is we've also added a lot of human oversight. That was something we've always wanted from the beginning and we've added more and more stages of it. So for example, dual review, being able to do the review stats within dual review so you can see where does Alister agree with me, sort of calibrating as you go to make sure that beyond just independent validation or external validation, Alister also is doing well in your specific review. And then the final part is all the reporting that we have at the end you can use to make sure that when you use Alister, you can report it accurately in your submissions. Yeah, the lead paper says there's about 20% error rate in both Alister and humans, but if I remember correctly you had mentioned 1%. Yeah, so error rate versus the hallucination rate is different. I believe Lee defined hallucination rate as separate as more like just confabulating that is not there. There I think it was 1%, which is similar to humans. And then the other thing is the 20% error rate was on a previous version of Alister from I believe almost, you know, the paper came out in February, but I think it was using a version of Alister probably a year old. The newer version I think is showing maybe like closer to a 5% 4% error rate. All right, and can literature searches conducted in Alister be considered reproducible if they're not run in PubMed? With PubMed replication is straightforward because the exact search string can be copied and rerun. How can someone replicate a search performed on Semantic Scholar's broad corpus, especially if the underlying query logic isn't fully transparent? Yeah, that's a great question. To specifically combat that, when I ran this PubMed query, I ran it against specifically the PubMed corpus. So if you take the results, uh let me pull in the exact query we ran. You will get So it's it's in the exact PubMed syntax, so it will show Let's see if I can pull it. Ah. I'm also sharing that we finished our review. Um it's been We have about 8 minutes left till the hour ends, so we finished the entire systematic review within an hour. So, we did that successfully. The report is going to be very similar to the previous one that I shared cuz it's the same question with very similar sources and of course the methods. But, what I'm going to show is if you take this exact query, we run this query against PubMed. And let's go to PubMed here. Going to run it. Should get about the same. So, 984 versus I think 985 and there's one paper that I think maybe PubMed added since we uh started or something like that. But, essentially should be the exact same papers that flow through from PubMed will flow through Illicit. So, when you report the PubMed query or the keyword query, it's going to be reproducible every single time, especially if you put the date filter. Does that answer your question? If not, um let me know if there's a follow-up you want to address. Okay. So, we've done our review. We started with a protocol. We ended with the report. And you can always save You can download PDFs. You can download the Bib files that are associated with this, the RIS files. You can also download, as we mentioned throughout every stage, the CSVs and Excels, especially the extraction data. I think that's the most valuable thing. Download the CSV will show all the extracted data. Verify that. In the future, we'll be able to do critical appraisal and meta-analysis on top of this. So, this extraction data will be used to do that meta-analysis. And then finally, we generate the report including citations. And when you download the PDF for the report, the citation will be moved to the APA format and have numbers instead. There's a couple things I want to add now that we've um um hit the end. One question is how did I come up with this protocol? Usually, this would take weeks of, you know, checking against the literature. Um I am not a librarian. I don't have necessarily expertise to define a protocol and do it super well. So, I actually used Elicit to help me define the protocol to run in this. And I'll show you how I did that. So, I'm going to take the research question, which is simply among adult type 2 diabetes using GLP-1 receptors, through receptor agonist, what is the persistent of appearance? And what I did is I went to the research agent and I said, "Can you help me define a protocol for a systematic review on this?" And I often prompted to do do some searches against So, I put that in. And then the research agent actually has access to the full corpus of Elicit as well as keyword search, clinical trials, I got PubMed. And so, what I can do is it can do queries against the corpus and essentially do like a mini scoping review where it looks it does searches, finds papers, tries to figure out what are the anchor papers we want in every single systematic review that would go here, like what are the really important papers, and then helps me formulate queries, then makes sure that I get those papers in there. And then it looks at papers that should be included versus shouldn't be included and tries to figure out what are the screening criteria that really matter to make sure that the papers that do make it through are of high high quality or high signal. So, here it's finding a bunch of papers. It will continue to do that. And then it will come back with a systematic review protocol in just a second. It's going to take a second to do that. So, it needs to go through all the sources and similar to the systematic review workflow, the research agent also has access to the papers and will do summaries, it will do quotations as well to make sure that every analysis it's doing will show up with quotes and rationale for why it's included. Here we go. That's like almost done like a mini version of a systematic review in about a minute. Of course, I would never submit this for rigorous submission, but it does help me inform inform me on what I need to put in my systematic review. I might get bugged with the the H's, but nonetheless, here we go. Draft systematic review and protocol, the background rationale, what should the interventions, comparators, outcomes be, study design, what should be included versus excluded. I think what I'd probably do is iterate with this a little bit more, look at more searches, but I think in about 10 minutes, which is what it took me when I first did it, I got to a pretty sizable uh first five sizable first version of the protocol. That's I have one more thing to share as well. We also released an API. I think it's maybe the world's first systematic review API, and the reason is you can do a systematic review on Elicit using a lot of Elicit's corpus, but if you're, you know, form a team and you have, you know, hundreds and hundreds of uh research questions to investigate, how do you do this at scale? How do you make sure that you don't drop off the ball in any single one of them? And how do you bring your context into the systematic review? Let me show you how to do that. In the 2 minutes we have left. Okay. Using everyone's favorite Claude code here, but basically I've set it up with my API key already, and I have a bunch of protocols in this folder that I want to do. And you can fire it off like that. So, you can also do more programmatically. We write a Python script and say do a systematic review for each of these protocols, but I've defined them already in a folder. Um and then what it's going to do is look to that folder, look to get my API key, and just fire off a bunch of requests to um Elicit's API to do the systematic reviews. And then the the neat thing is once it fires it off, it also gives me a URL that I can see the systematic review running in real time. So, let's see if it does that. It'll take a second. I should have the docs, which you can also go access at docs.elicit.com. And it should start firing them off in about a second. Cool. All right, and we're just about time. I think it should be almost done firing them off. And so, the whole purpose of this is when you have, you know, dozens and dozens of research questions, can you just do them in one go? Yes, you can. Now, Okay, one more minute. While I go, is there any questions on the API? Or on anything we've we've discussed today. So, while it goes, I'll just do the the final wrap up, which is we've done our goal, which was to start with a protocol and then with a report that covers the search, screen, uh title and abstract screening, full text screening, and extraction phases of the systematic review. We can, of course, scale it up much more. I only looked at 984 papers. We can do up to 40,000 per analysis systematic review, so you can really make sure that you get every single paper that matters. I will share out We'll share out the recordings of this webinar, so you can go back and check it. We will share out um the the finished systematic review as well. And then, I will also see if I can put these uh final systematic reviews that ran with the API in there as well. If you'd like to check them out. But, thank you so much for attending. This is starting to do systematic reviews. You can extrapolate and see that it would run the systematic review against it. It's got a couple errors. We'll fix that and then and then go. Run the systematic review and then you could check the actual results of the systematic review API, download them, pull the extraction tables, and do analysis all from um from the API. And the final thing is we've seriously upgraded our systematic review. If you want to chat more, feel free to reach out to me at hamster@elicit.com. I can put you in contact. I can answer any questions if you want to upgrade to scale or enterprise, which is where these plans the plans that have these upgraded versions of systematic review live. I can also help you uh forgot the way to right way to do that. But, this is going. I will send the results in the the final doc. Um thank you so much for joining. Cool. Hey y'all. For the people that are still left here, there's it it ran this it's running the systematic reviews. It's going to give me the URL and I can show you what it looks like to actually see it running. Let's see. Stop sharing that. Share this. I believe this is it. Yep. So, this is the systematic review. We start from the API and it is already it's already into the the screening phase. This one is just doing title abstract screening, but we're already about 16% of the way through screening. I can modify here as well and take control, but otherwise it'll just continue on all the way to generating a report. And with that, thank you so much for joining. I will talk to you later. Bye.